Navigating the Intersection of Big Data and Societal Insights

As a computer science undergraduate, I’ve been immersed in the world of code and algorithms. While the technical intricacies of data manipulation fascinated me, a nagging question lingered: how does this vast ocean of information translate into real-world impact?  

This yearning for purpose led me to explore the two faces of data – the technical prowess that unlocks its potential and the societal lens that imbues it with meaning. This exploration took the form of two distinct courses: CS4225: Big Data Systems for Data Science, a deep dive into big data systems, and a UQR2215: Developing Meaningful Indicators, a USP course on developing meaningful indicators from data sets. Through this comparative lens, I embarked on a journey that fundamentally reshaped my understanding of data’s role. 

Initially, I viewed data as a collection of static, empirical facts. My exposure to data processing in CS4225 provided valuable technical skills, teaching me tangible techniques for extracting data. However, UQR2215 shifted my perspective by focusing on the intangible aspect – extracting insights and crafting narratives from data. This emphasis on storytelling with data revealed its potential to spark conversations and drive social understanding. In essence, these complementary experiences transformed my perception of data, from mere data points to a powerful tool for generating insights and fostering dialogue.

 

What: The Technical Marvel of Big Data 

Let’s begin by demystifying data. At its core, data is a collection of information, encompassing numbers, descriptions, facts, and statistics. During my studies, I encountered an article by Forbes: How To Make Use Of The New Gold: Data. It resonated deeply especially in today’s digital age, data is likened to the new gold. It’s the “raw material that fuels the creation of digital products and services”. This sparked my curiosity – how can we unlock its potential? How can we turn data into something valuable? How can we structure and organise it to extract meaningful insights?

Driven by these questions, I enrolled in CS4225, eager to gain a deeper understanding of data science. While the course heavily emphasised objective analysis, it felt centred on data harvesting rather than generating insights. This observation planted a seed of doubt: in a technical field like data science, could there be an overemphasis on technical aspects, potentially neglecting the ultimate purpose of data processing? It felt incomplete – lacking a sense of connection and purpose beyond simply collecting data.

To illustrate this point further, let’s consider the data lifecycle model presented in the introductory slide of the module (Figure 1). This model depicts a series of well-defined stages, each focused on a specific task like data ingestion, transformation, and analysis. The model appears to prioritise the mechanical cycle of data processing. The focus is on efficiently moving data through these stages, potentially overlooking a crucial step: developing meaningful indicators or insights. This prioritisation of efficiency can lead to a repetitive cycle where models are built and churned out without sufficient time dedicated to extracting deeper understanding from the data. As I reviewed Figure 1, this focus on speed and defined tasks created a sense of being on autopilot – a process lacking the critical thinking necessary to truly unlock the data’s potential. This feeling of “mindlessness” prompted the question: could this purely technical approach lead us to overlook valuable insights hidden within the data?

Figure 1: A slide from CS4225 Lecture – Data Lifecycle model 

 

As the course progressed, we were exposed to intricate algorithms like K-means clustering and specialised systems like the Hadoop file system. These tools represented the pinnacle of technical brilliance in data analysis. K-means clustering, for instance, exemplifies this brilliance by automatically grouping similar data points together, uncovering hidden patterns at an impressive scale. However, despite this technical prowess, a sense of detachment began to emerge.

To illustrate the sense of repetitive tasks where models are churned out without enough time for deeper understanding, let’s consider a task from assignment 2: filtering a messy server log file. My first complaint is that the professor had already set up Spark, the analytics engine for large-scale data processing (see Figure 2).  I felt this defeated the purpose of learning. Being spoon-fed the tools doesn’t provide the opportunity to build the assignment from scratch, which fosters deeper understanding and exploration of the various tools.  Instead, we were given the tools and told to produce an output – filtering and cleaning the messy log file. This, in my opinion, wasn’t very productive. It reinforces the idea that we only care about the final product, not the process or the journey of getting there.

Figure 2: CS4225 Assignment 2 – Filtering a messy log file

 

So What: Beyond the Numbers Game 

Fortunately, I had the opportunity to take a contrasting course: UQR2215 Developing Meaningful Indicators offered by USP that focused on developing meaningful indicators. This course proved to be a truly memorable experience. Unlike CS4225, I wasn’t confined to a siloed environment where projects felt like mindlessly completing tasks for the sake of finishing a typical course. Here, through our data visualisation charts, we actively gathered “negative feedback from real people,” prompting them to engage in conversations about specific social issues.

In essence, this course shifted my focus from the technical ‘how’ to the social ‘why’ of data analysis. We delved into creating indicators from diverse data sources, not just to identify trends, but to communicate complex social concepts. These indicators aimed to accurately reflect the nuances of societal problems. Through hands-on practice with data collection, processing, and visualisation, I learned to generate insights that sparked debate on platforms like Reddit. (Figure 3 and 4) For the first time, the entire data processing cycle felt truly rewarding.

Figure 3: My post on Reddit for UQR2215 

 

The Russia-Ukraine war was ongoing when I took this course, and I believed it would be valuable to analyse issues related to this sensitive topic. To do this, I decided to delve into military aid data. My hope was to spark critical discourse about the war. I chose a dataset that showed the breakdown of military aid to Ukraine. The choice of visualisation was deliberate; in my opinion, it helped to highlight the significant contribution of the United States. This, I hoped, would spark conversation. Additionally, instead of using the Ukrainian flag, I used a heart with Ukrainian colours to potentially humanise the visualisation.

However, despite getting about 8,300 upvotes and creating a visually appealing chart that I believed effectively conveyed my intended message, I received some harsh criticism. (Figure 4) Instead of engaging with the social issues I aimed to highlight, many focused on the data representation itself. This feedback resonated deeply – it underscored the critical importance of a balanced approach, where both technical skills and social awareness go hand in hand.

For instance, a feedback received was: “This doesn’t seem like a good use of this chart style. Why does Ukraine need to have a section of the circle? … This could just be a bar chart and it would be far more readable”. (See Figure 4) The common theme in this feedback is that I should have avoided overcomplicating the visualisation. A simpler, more basic approach would have been more effective. This experience made me realise that data storytelling doesn’t require unnecessary complexity. We need to consider what works best, be open to feedback, and iterate on our designs.

Figure 4: Some comments left on a my Reddit post in Figure 3

Overall, the course emphasised critical thinking – the ability to dissect data and construct indicators that accurately reflect the nuances of societal problems or the narrative you want to highlight. This human-centric approach resonated deeply with me, contrasting sharply with the technical focus of CS4225.

 

Now What: A Balanced Approach for the AI Frontier 

As data becomes an increasingly valuable asset and artificial intelligence trained on large datasets experiences exponential growth, the need for a balanced approach becomes ever more critical. On one hand, strong technical skills are essential –  acquiring, cleaning, and manipulating data with statistical methods, programming, and visualisation tools. But a balanced approach requires acknowledging the social context as well. This means considering potential biases in data collection (e.g historical biases reflected in datasets), analysis (e.g algorithmic choices), presentation (e.g misleading visualisations), and the impact of data-driven decisions (e.g perpetuating social inequalities). Striking a balance between strong technical skills and a critical awareness of the social context ensures that data analysis serves society for the better.

The contrasting yet complementary lessons learned from CS4225 and UQR2215 have equipped me with useful skills and memorable experiences. This broadened perspective leads me to ask more questions as I progress in my field. In the current age of artificial intelligence, where algorithms increasingly shape our world, we must be more cautious about how we process and present data. This balanced approach to computer science education equips me to not only harness the power of big data but also be aware of the ethical dimensions and societal impacts of such technologies. Ultimately, this combined knowledge paves the way for responsible and impactful data-driven decision-making in my future career.

In my next post, “Beyond the Classroom,” I’ll journey beyond the UHB2206 assignment and delve into a pivotal overseas entrepreneurial experience in Toronto. This trip challenged my initial disappointment with entrepreneurship and sparked a mindset shift that continues to shape my career path. More importantly, it transformed my perspective, leading me from a traditional tech focus to a more purpose-driven pursuit!

 

References: 

  1. Quek, E. (2023a). [OC] US adults consume more news on Facebook than any other social media platfor. Reddit. https://www.reddit.com/r/dataisbeautiful/comments/y81urx/oc_us_adults_consume_more_news_on_facebook_than/ 
  2. Quek, E. (2023b). [OC] Where Military Aid to Ukraine Comes From. Reddit. https://www.reddit.com/r/dataisbeautiful/comments/yrf10w/oc_where_military_aid_to_ukraine_comes_from/  
  3. Shubladze, S. (2023, March 28). Council post: How to make use of the new Gold: Data. Forbes. https://www.forbes.com/sites/forbestechcouncil/2023/03/27/how-to-make-use-of-the-new-gold-data/?sh=2a7ba1fb2bbf