
South Korea’s journey toward open public data reflects both rapid digital advancement and evolving public engagement. From early IT infrastructure and restrictive regulations like WIPI to citizen-driven breakthroughs such as the Seoul Bus app, the nation transitioned from top-down control to participatory innovation. Government initiatives like the Data Dam under the Digital New Deal expanded access to vast datasets through AI Hub and the Public Data Portal. Yet, the ultimate challenge lies in fostering true data literacy—equipping citizens not only to use data but to interpret it critically, avoiding blind faith in “Big Data” while embracing informed, evidence-based decision-making.
#Public Data #Open Data Policy #Digital New Deal #Big Data #Data literacy
South Korea has long cultivated a global reputation as a powerhouse in Information Technology, a status built on decades of strategic investment and public enthusiasm for cutting-edge technology. Yet, behind this success lies a complex and evolving journey with public data. This story charts that path, moving from early infrastructure development and significant regulatory hurdles to ambitious national data projects. Ultimately, it reveals that the most crucial challenge ahead lies not just in archiving data, but in fostering a deep and critical data literacy among the citizens who are meant to be its ultimate beneficiaries.
To fully appreciate South Korea's current focus on public data, it is essential to understand the cultural and infrastructural groundwork laid during its early ascent as a technology leader. This history reveals a nation with a high public appetite for innovation, a willingness to invest in nationwide infrastructure, and a regulatory landscape that has, at times, both spurred and hindered progress.
South Korea's leadership in IT was established through early, aggressive investment in a national IT infrastructure that outpaced even the most advanced nations. This was matched by a public that eagerly embraced high-speed networks and next-generation hardware, creating a fertile environment for technological adoption. The country's internet history dates back to 1982, when Seoul National University (SNU) and the Electronics and Telecommunications Research Institute (ETRI) constructed a network system, making Korea one of the first nations to possess an operational internet on its soil.
Despite leading the world in internet speed, Korea faced a significant and surprising challenge with smartphone adoption in the mid-2000s. The country lagged behind the global storm created by Apple's iPhone due to a specific regulation requiring all phones to include WIPI (Wireless Internet Platform for Interoperability). This Korea-specific standard, intended to ensure multimedia interoperability, increased production costs and effectively operated as a barrier for foreign manufacturers. As a result, the iPhone was not introduced to the Korean market until almost a full three years after its US release. For a nation known for its tech-savvy early adopters, this delay served as an omen for a major controversy regarding the usage of public data on smartphones.
This top-down, regulation-first mindset, which stifled innovation in the name of control, set the stage for an inevitable public response—a bottom-up challenge that would fundamentally reshape the nation's public data landscape.
The controversy surrounding a simple mobile application known as 'Seoul Bus' became a pivotal moment in South Korea's public data history. It represented a grassroots, innovation-first challenge to the restrictive environment exemplified by the WIPI mandate, serving as a powerful catalyst that shifted both public perception and government policy towards greater data accessibility.
Shortly after the iPhone's belated release in Korea, two high school students developed the 'Seoul Bus' app, which used real-time bus operation schedule data. The app quickly gained immense popularity among early smartphone users. However, this success attracted the attention of the Gyeonggi municipal government, which claimed the students had violated its location data use agreement and demanded the service be shut down.
The government's demand, while having a legal basis, was widely seen as bureaucratic overreach. The students were told to obtain a business operator license, a requirement that was financially impossible for them to meet. The ensuing public outcry highlighted the immense practical value of open data when placed in the hands of innovators. The event raised public awareness of the promise of combining mobile devices with data, ultimately leading to a significant revision of regulations and business models.
This citizen-driven push for open data paved the way for more formal, large-scale government initiatives designed to harness the power of public information.
In 2020, the South Korean government launched the so-called 'Korean New Deal'. A cornerstone of this initiative was the 'Data Dam,' a landmark project designed to formally structure, expand, and democratize the nation's public data resources for public use and innovation.
.png)
The Data Dam project's ambitious goal was to build a giant, public, and free-to-use data archive. To achieve this, it set out to collect 140 thousand new datasets to supplement existing archives, making them available for research and development. These vast data repositories are showcased on two primary public websites: the AI Hub and the Public Data Portal.
AI Hub
Maintained by the National Information Agency, the AI Hub serves as a repository for high-quality, curated data from public projects across many different fields. Representative examples include:
AI-based legal document translation: This service uses Machine Translation Post Editing (MTPE) to translate legal documents. It was developed to address the delays and obsolescence of manual translations, especially as new legislation is introduced and the number of foreign residents in Korea increases.
Document write-up helper: This intelligent system analyzes the context of a document as it is being written. It then predicts and fetches relevant information in the background, providing it to the writer to streamline the process and supply multifaceted knowledge that the writer may not readily possess.
Korean facial image-based applications: This dataset consists of 19,444,000 images of 600 Koreans, captured with varying resolutions, lighting conditions, and facial accessories. It is designed to support the development of advanced facial recognition services, such as identity verification and finding similar-looking faces, for use in finance, security, and investigations.
Public Data Portal
The Public Data Portal provides a user-friendly interface that allows private and public users to easily find and utilize available public data for service and product development. Accessible via Open API, the portal categorizes data under headings such as Education, Finance, Social Welfare, Culture and Tourism, Health Care, and Transportation and Logistics. The portal showcases examples that utilize its data. (Click)
.png)
While these initiatives made massive amounts of data available, they also highlighted the next critical challenge: ensuring citizens have the skills to understand and use this data wisely.
Simply creating massive data archives is not enough. For public data to fulfill its promise of driving innovation and informed decision-making, citizens must be equipped with data literacy. This means educating the public on the fundamental concepts of data, its potential applications, and, just as importantly, its inherent pitfalls.
What is Data?
Data can, by definition, be any set of two symbols at a minimum that has the potential to distinguish one thing from another. At its most fundamental level, data is a stream of symbols—whether bits like 0 and 1, text in an alphabet, or carvings on a cave wall—that represents the physical manifestation of a chosen message. It is the physical form, the container for a message selected from a range of all possible messages.
What is Information?
While often used interchangeably with "data," "information" has a distinct and crucial meaning. If data is the physical message, information is how much new knowledge data conveys to the receiver, highlighting its subjective nature. A message that provides no new knowledge contains zero information for that specific receiver. This understanding is a core step toward building robust data literacy. Because people approach data with different levels of existing knowledge, or 'priors,' the same dataset will convey different amounts of information to different individuals. This inherent subjectivity is a key reason why data interpretation can vary so widely.
Making Sense of Data: Statistical Inference.
Statistical inference is the process of making useful sense of data. The term itself comes from the Latin inferentia, meaning "to bring in." It is the all-encompassing process of transforming raw data into practical knowledge. This process is powerfully illustrated by the example of an emergency room nurse. Imagine a nurse records the blood types (A, B, AB, O) of 1,000 patients. This list of 1,000 entries is raw data. Through statistical inference, the nurse finds that there have been 340 Type A patients, 260 Type B, 110 Type AB, and 280 Type O. The 1,000 raw data points are reduced to just four useful statistics. This summary has a profound practical benefit. If the hospital were to stock its blood supply based on a naïve guess of storing equal proportions—say, 250 packets of each type—it would face a critical shortfall of 90 packets for Type A blood. This misallocation, born from a lack of data, could lead to "personal catastrophe, i.e. death of the patient." Statistical inference greatly reduces the amount of data, creates a useful summary, and has practical benefits that can be life-or-death.
However, even with powerful tools like statistical inference, a blind reliance on large datasets without critical thinking can lead to significant errors.
The concept of "Big Data" emerged with a compelling promise rooted in the "law of large numbers." This law theoretically states that as a sample size grows toward infinity, the sample mean will converge on the true value. The allure was that with enough data, we could eliminate uncertainty and find definitive answers. However, it is critical to understand that this promise has very limited applicability to complex, real-world problems.
The law of large numbers only works when we have perfect knowledge of all possible outcomes of a trial, such as the four human blood types. This is rarely the case. A historical example powerfully illustrates this limitation: predicting projectile trajectories. Before the space age, a "Big Data" advocate might have argued that one could simply conduct a practically infinite number of experiments, build a massive data table, and look up the trajectory for any future projectile. This was behind the proclamation that "'scientific modeling is irrelevant' in the era of Big Data."
The flaw in this argument is profound. Because every projectile in these experiments would have ended its journey by falling back to Earth, the resulting "Big Data" would be incapable of predicting a trajectory that could escape Earth's orbit. Yet, Isaac Newton, using sheer theoretical scientific modeling, predicted such an event 250 years before it was ever observed. This demonstrates a crucial lesson: Big Data is inherently limited by how the data is collected, and it cannot tell us what we cannot already observe and comprehensively understand.
Understanding this inherent limitation is essential for developing a mature and effective approach to data literacy.
South Korea's journey—from the restrictive, top-down regulations of the WIPI era to the citizen-led data activism of 'Seoul Bus,' and from the government's ambitious 'Data Dam' to the hard-learned lessons on the perils of Big Data—offers a powerful narrative of progress and caution. This evolution underscores that the path forward requires a robust plan for data literacy that makes the public aware of both the promise and the profound limitations of data. To truly empower citizens, the following hard-won principles must be at the forefront of public education.

South Korea’s journey toward open public data reflects both rapid digital advancement and evolving public engagement. From early IT infrastructure and restrictive regulations like WIPI to citizen-driven breakthroughs such as the Seoul Bus app, the nation transitioned from top-down control to participatory innovation. Government initiatives like the Data Dam under the Digital New Deal expanded access to vast datasets through AI Hub and the Public Data Portal. Yet, the ultimate challenge lies in fostering true data literacy—equipping citizens not only to use data but to interpret it critically, avoiding blind faith in “Big Data” while embracing informed, evidence-based decision-making.
#Public Data #Open Data Policy #Digital New Deal #Big Data #Data literacy
South Korea has long cultivated a global reputation as a powerhouse in Information Technology, a status built on decades of strategic investment and public enthusiasm for cutting-edge technology. Yet, behind this success lies a complex and evolving journey with public data. This story charts that path, moving from early infrastructure development and significant regulatory hurdles to ambitious national data projects. Ultimately, it reveals that the most crucial challenge ahead lies not just in archiving data, but in fostering a deep and critical data literacy among the citizens who are meant to be its ultimate beneficiaries.
To fully appreciate South Korea's current focus on public data, it is essential to understand the cultural and infrastructural groundwork laid during its early ascent as a technology leader. This history reveals a nation with a high public appetite for innovation, a willingness to invest in nationwide infrastructure, and a regulatory landscape that has, at times, both spurred and hindered progress.
South Korea's leadership in IT was established through early, aggressive investment in a national IT infrastructure that outpaced even the most advanced nations. This was matched by a public that eagerly embraced high-speed networks and next-generation hardware, creating a fertile environment for technological adoption. The country's internet history dates back to 1982, when Seoul National University (SNU) and the Electronics and Telecommunications Research Institute (ETRI) constructed a network system, making Korea one of the first nations to possess an operational internet on its soil.
Despite leading the world in internet speed, Korea faced a significant and surprising challenge with smartphone adoption in the mid-2000s. The country lagged behind the global storm created by Apple's iPhone due to a specific regulation requiring all phones to include WIPI (Wireless Internet Platform for Interoperability). This Korea-specific standard, intended to ensure multimedia interoperability, increased production costs and effectively operated as a barrier for foreign manufacturers. As a result, the iPhone was not introduced to the Korean market until almost a full three years after its US release. For a nation known for its tech-savvy early adopters, this delay served as an omen for a major controversy regarding the usage of public data on smartphones.
This top-down, regulation-first mindset, which stifled innovation in the name of control, set the stage for an inevitable public response—a bottom-up challenge that would fundamentally reshape the nation's public data landscape.
The controversy surrounding a simple mobile application known as 'Seoul Bus' became a pivotal moment in South Korea's public data history. It represented a grassroots, innovation-first challenge to the restrictive environment exemplified by the WIPI mandate, serving as a powerful catalyst that shifted both public perception and government policy towards greater data accessibility.
Shortly after the iPhone's belated release in Korea, two high school students developed the 'Seoul Bus' app, which used real-time bus operation schedule data. The app quickly gained immense popularity among early smartphone users. However, this success attracted the attention of the Gyeonggi municipal government, which claimed the students had violated its location data use agreement and demanded the service be shut down.
The government's demand, while having a legal basis, was widely seen as bureaucratic overreach. The students were told to obtain a business operator license, a requirement that was financially impossible for them to meet. The ensuing public outcry highlighted the immense practical value of open data when placed in the hands of innovators. The event raised public awareness of the promise of combining mobile devices with data, ultimately leading to a significant revision of regulations and business models.
This citizen-driven push for open data paved the way for more formal, large-scale government initiatives designed to harness the power of public information.
In 2020, the South Korean government launched the so-called 'Korean New Deal'. A cornerstone of this initiative was the 'Data Dam,' a landmark project designed to formally structure, expand, and democratize the nation's public data resources for public use and innovation.
.png)
The Data Dam project's ambitious goal was to build a giant, public, and free-to-use data archive. To achieve this, it set out to collect 140 thousand new datasets to supplement existing archives, making them available for research and development. These vast data repositories are showcased on two primary public websites: the AI Hub and the Public Data Portal.
AI Hub
Maintained by the National Information Agency, the AI Hub serves as a repository for high-quality, curated data from public projects across many different fields. Representative examples include:
AI-based legal document translation: This service uses Machine Translation Post Editing (MTPE) to translate legal documents. It was developed to address the delays and obsolescence of manual translations, especially as new legislation is introduced and the number of foreign residents in Korea increases.
Document write-up helper: This intelligent system analyzes the context of a document as it is being written. It then predicts and fetches relevant information in the background, providing it to the writer to streamline the process and supply multifaceted knowledge that the writer may not readily possess.
Korean facial image-based applications: This dataset consists of 19,444,000 images of 600 Koreans, captured with varying resolutions, lighting conditions, and facial accessories. It is designed to support the development of advanced facial recognition services, such as identity verification and finding similar-looking faces, for use in finance, security, and investigations.
Public Data Portal
The Public Data Portal provides a user-friendly interface that allows private and public users to easily find and utilize available public data for service and product development. Accessible via Open API, the portal categorizes data under headings such as Education, Finance, Social Welfare, Culture and Tourism, Health Care, and Transportation and Logistics. The portal showcases examples that utilize its data. (Click)
.png)
While these initiatives made massive amounts of data available, they also highlighted the next critical challenge: ensuring citizens have the skills to understand and use this data wisely.
Simply creating massive data archives is not enough. For public data to fulfill its promise of driving innovation and informed decision-making, citizens must be equipped with data literacy. This means educating the public on the fundamental concepts of data, its potential applications, and, just as importantly, its inherent pitfalls.
What is Data?
Data can, by definition, be any set of two symbols at a minimum that has the potential to distinguish one thing from another. At its most fundamental level, data is a stream of symbols—whether bits like 0 and 1, text in an alphabet, or carvings on a cave wall—that represents the physical manifestation of a chosen message. It is the physical form, the container for a message selected from a range of all possible messages.
What is Information?
While often used interchangeably with "data," "information" has a distinct and crucial meaning. If data is the physical message, information is how much new knowledge data conveys to the receiver, highlighting its subjective nature. A message that provides no new knowledge contains zero information for that specific receiver. This understanding is a core step toward building robust data literacy. Because people approach data with different levels of existing knowledge, or 'priors,' the same dataset will convey different amounts of information to different individuals. This inherent subjectivity is a key reason why data interpretation can vary so widely.
Making Sense of Data: Statistical Inference.
Statistical inference is the process of making useful sense of data. The term itself comes from the Latin inferentia, meaning "to bring in." It is the all-encompassing process of transforming raw data into practical knowledge. This process is powerfully illustrated by the example of an emergency room nurse. Imagine a nurse records the blood types (A, B, AB, O) of 1,000 patients. This list of 1,000 entries is raw data. Through statistical inference, the nurse finds that there have been 340 Type A patients, 260 Type B, 110 Type AB, and 280 Type O. The 1,000 raw data points are reduced to just four useful statistics. This summary has a profound practical benefit. If the hospital were to stock its blood supply based on a naïve guess of storing equal proportions—say, 250 packets of each type—it would face a critical shortfall of 90 packets for Type A blood. This misallocation, born from a lack of data, could lead to "personal catastrophe, i.e. death of the patient." Statistical inference greatly reduces the amount of data, creates a useful summary, and has practical benefits that can be life-or-death.
However, even with powerful tools like statistical inference, a blind reliance on large datasets without critical thinking can lead to significant errors.
The concept of "Big Data" emerged with a compelling promise rooted in the "law of large numbers." This law theoretically states that as a sample size grows toward infinity, the sample mean will converge on the true value. The allure was that with enough data, we could eliminate uncertainty and find definitive answers. However, it is critical to understand that this promise has very limited applicability to complex, real-world problems.
The law of large numbers only works when we have perfect knowledge of all possible outcomes of a trial, such as the four human blood types. This is rarely the case. A historical example powerfully illustrates this limitation: predicting projectile trajectories. Before the space age, a "Big Data" advocate might have argued that one could simply conduct a practically infinite number of experiments, build a massive data table, and look up the trajectory for any future projectile. This was behind the proclamation that "'scientific modeling is irrelevant' in the era of Big Data."
The flaw in this argument is profound. Because every projectile in these experiments would have ended its journey by falling back to Earth, the resulting "Big Data" would be incapable of predicting a trajectory that could escape Earth's orbit. Yet, Isaac Newton, using sheer theoretical scientific modeling, predicted such an event 250 years before it was ever observed. This demonstrates a crucial lesson: Big Data is inherently limited by how the data is collected, and it cannot tell us what we cannot already observe and comprehensively understand.
Understanding this inherent limitation is essential for developing a mature and effective approach to data literacy.
South Korea's journey—from the restrictive, top-down regulations of the WIPI era to the citizen-led data activism of 'Seoul Bus,' and from the government's ambitious 'Data Dam' to the hard-learned lessons on the perils of Big Data—offers a powerful narrative of progress and caution. This evolution underscores that the path forward requires a robust plan for data literacy that makes the public aware of both the promise and the profound limitations of data. To truly empower citizens, the following hard-won principles must be at the forefront of public education.

South Korea has long cultivated a global reputation as a powerhouse in Information Technology, a status built on decades of strategic investment and public enthusiasm for cutting-edge technology. Yet, behind this success lies a complex and evolving journey with public data. This story charts that path, moving from early infrastructure development and significant regulatory hurdles to ambitious national data projects. Ultimately, it reveals that the most crucial challenge ahead lies not just in archiving data, but in fostering a deep and critical data literacy among the citizens who are meant to be its ultimate beneficiaries.
To fully appreciate South Korea's current focus on public data, it is essential to understand the cultural and infrastructural groundwork laid during its early ascent as a technology leader. This history reveals a nation with a high public appetite for innovation, a willingness to invest in nationwide infrastructure, and a regulatory landscape that has, at times, both spurred and hindered progress.
South Korea's leadership in IT was established through early, aggressive investment in a national IT infrastructure that outpaced even the most advanced nations. This was matched by a public that eagerly embraced high-speed networks and next-generation hardware, creating a fertile environment for technological adoption. The country's internet history dates back to 1982, when Seoul National University (SNU) and the Electronics and Telecommunications Research Institute (ETRI) constructed a network system, making Korea one of the first nations to possess an operational internet on its soil.
Despite leading the world in internet speed, Korea faced a significant and surprising challenge with smartphone adoption in the mid-2000s. The country lagged behind the global storm created by Apple's iPhone due to a specific regulation requiring all phones to include WIPI (Wireless Internet Platform for Interoperability). This Korea-specific standard, intended to ensure multimedia interoperability, increased production costs and effectively operated as a barrier for foreign manufacturers. As a result, the iPhone was not introduced to the Korean market until almost a full three years after its US release. For a nation known for its tech-savvy early adopters, this delay served as an omen for a major controversy regarding the usage of public data on smartphones.
This top-down, regulation-first mindset, which stifled innovation in the name of control, set the stage for an inevitable public response—a bottom-up challenge that would fundamentally reshape the nation's public data landscape.
The controversy surrounding a simple mobile application known as 'Seoul Bus' became a pivotal moment in South Korea's public data history. It represented a grassroots, innovation-first challenge to the restrictive environment exemplified by the WIPI mandate, serving as a powerful catalyst that shifted both public perception and government policy towards greater data accessibility.
Shortly after the iPhone's belated release in Korea, two high school students developed the 'Seoul Bus' app, which used real-time bus operation schedule data. The app quickly gained immense popularity among early smartphone users. However, this success attracted the attention of the Gyeonggi municipal government, which claimed the students had violated its location data use agreement and demanded the service be shut down.
The government's demand, while having a legal basis, was widely seen as bureaucratic overreach. The students were told to obtain a business operator license, a requirement that was financially impossible for them to meet. The ensuing public outcry highlighted the immense practical value of open data when placed in the hands of innovators. The event raised public awareness of the promise of combining mobile devices with data, ultimately leading to a significant revision of regulations and business models.
This citizen-driven push for open data paved the way for more formal, large-scale government initiatives designed to harness the power of public information.
In 2020, the South Korean government launched the so-called 'Korean New Deal'. A cornerstone of this initiative was the 'Data Dam,' a landmark project designed to formally structure, expand, and democratize the nation's public data resources for public use and innovation.
.png)
The Data Dam project's ambitious goal was to build a giant, public, and free-to-use data archive. To achieve this, it set out to collect 140 thousand new datasets to supplement existing archives, making them available for research and development. These vast data repositories are showcased on two primary public websites: the AI Hub and the Public Data Portal.
AI Hub
Maintained by the National Information Agency, the AI Hub serves as a repository for high-quality, curated data from public projects across many different fields. Representative examples include:
AI-based legal document translation: This service uses Machine Translation Post Editing (MTPE) to translate legal documents. It was developed to address the delays and obsolescence of manual translations, especially as new legislation is introduced and the number of foreign residents in Korea increases.
Document write-up helper: This intelligent system analyzes the context of a document as it is being written. It then predicts and fetches relevant information in the background, providing it to the writer to streamline the process and supply multifaceted knowledge that the writer may not readily possess.
Korean facial image-based applications: This dataset consists of 19,444,000 images of 600 Koreans, captured with varying resolutions, lighting conditions, and facial accessories. It is designed to support the development of advanced facial recognition services, such as identity verification and finding similar-looking faces, for use in finance, security, and investigations.
Public Data Portal
The Public Data Portal provides a user-friendly interface that allows private and public users to easily find and utilize available public data for service and product development. Accessible via Open API, the portal categorizes data under headings such as Education, Finance, Social Welfare, Culture and Tourism, Health Care, and Transportation and Logistics. The portal showcases examples that utilize its data. (Click)
.png)
While these initiatives made massive amounts of data available, they also highlighted the next critical challenge: ensuring citizens have the skills to understand and use this data wisely.
Simply creating massive data archives is not enough. For public data to fulfill its promise of driving innovation and informed decision-making, citizens must be equipped with data literacy. This means educating the public on the fundamental concepts of data, its potential applications, and, just as importantly, its inherent pitfalls.
What is Data?
Data can, by definition, be any set of two symbols at a minimum that has the potential to distinguish one thing from another. At its most fundamental level, data is a stream of symbols—whether bits like 0 and 1, text in an alphabet, or carvings on a cave wall—that represents the physical manifestation of a chosen message. It is the physical form, the container for a message selected from a range of all possible messages.
What is Information?
While often used interchangeably with "data," "information" has a distinct and crucial meaning. If data is the physical message, information is how much new knowledge data conveys to the receiver, highlighting its subjective nature. A message that provides no new knowledge contains zero information for that specific receiver. This understanding is a core step toward building robust data literacy. Because people approach data with different levels of existing knowledge, or 'priors,' the same dataset will convey different amounts of information to different individuals. This inherent subjectivity is a key reason why data interpretation can vary so widely.
Making Sense of Data: Statistical Inference.
Statistical inference is the process of making useful sense of data. The term itself comes from the Latin inferentia, meaning "to bring in." It is the all-encompassing process of transforming raw data into practical knowledge. This process is powerfully illustrated by the example of an emergency room nurse. Imagine a nurse records the blood types (A, B, AB, O) of 1,000 patients. This list of 1,000 entries is raw data. Through statistical inference, the nurse finds that there have been 340 Type A patients, 260 Type B, 110 Type AB, and 280 Type O. The 1,000 raw data points are reduced to just four useful statistics. This summary has a profound practical benefit. If the hospital were to stock its blood supply based on a naïve guess of storing equal proportions—say, 250 packets of each type—it would face a critical shortfall of 90 packets for Type A blood. This misallocation, born from a lack of data, could lead to "personal catastrophe, i.e. death of the patient." Statistical inference greatly reduces the amount of data, creates a useful summary, and has practical benefits that can be life-or-death.
However, even with powerful tools like statistical inference, a blind reliance on large datasets without critical thinking can lead to significant errors.
The concept of "Big Data" emerged with a compelling promise rooted in the "law of large numbers." This law theoretically states that as a sample size grows toward infinity, the sample mean will converge on the true value. The allure was that with enough data, we could eliminate uncertainty and find definitive answers. However, it is critical to understand that this promise has very limited applicability to complex, real-world problems.
The law of large numbers only works when we have perfect knowledge of all possible outcomes of a trial, such as the four human blood types. This is rarely the case. A historical example powerfully illustrates this limitation: predicting projectile trajectories. Before the space age, a "Big Data" advocate might have argued that one could simply conduct a practically infinite number of experiments, build a massive data table, and look up the trajectory for any future projectile. This was behind the proclamation that "'scientific modeling is irrelevant' in the era of Big Data."
The flaw in this argument is profound. Because every projectile in these experiments would have ended its journey by falling back to Earth, the resulting "Big Data" would be incapable of predicting a trajectory that could escape Earth's orbit. Yet, Isaac Newton, using sheer theoretical scientific modeling, predicted such an event 250 years before it was ever observed. This demonstrates a crucial lesson: Big Data is inherently limited by how the data is collected, and it cannot tell us what we cannot already observe and comprehensively understand.
Understanding this inherent limitation is essential for developing a mature and effective approach to data literacy.
South Korea's journey—from the restrictive, top-down regulations of the WIPI era to the citizen-led data activism of 'Seoul Bus,' and from the government's ambitious 'Data Dam' to the hard-learned lessons on the perils of Big Data—offers a powerful narrative of progress and caution. This evolution underscores that the path forward requires a robust plan for data literacy that makes the public aware of both the promise and the profound limitations of data. To truly empower citizens, the following hard-won principles must be at the forefront of public education.