Yennie Jun
Seoul National University
Javier Cha
Seoul National University
Yennie Jun
Seoul National University
Javier Cha
Seoul National University
Future humanists will potentially have access to around 40 zettabytes (about 40 trillion gigabytes) of data from 2020, data generated from smart devices, sensors, social media platforms, and content delivery services (Reinsel, Gantz, and Rynding 2017). This immense amount of information takes the form of unstructured, constantly changing, and highly mutable activity streams, which are stored, processed, and cached in server farms, data centers, and content delivery networks worldwide. What are the implications of the big data paradigm to researchers and archivists?
The challenges of archiving mutable data sources are not new. Book historian Robert Darnton compares the fragmentary nature of early modern French “anecdotes” to present-day blogging systems and social media platforms, claiming that “information comes in fragments and embeds itself in whatever niches are provided by the surrounding environment” (2013). Librarian David McKitterick explains that textual variance was common among early printed books, within and without versions; “the mutable and malleable nature of the printed word” is not unique to big data (Howsam 2006, 68).
On the other hand, big data is unlike anything seen in previous information regimes and media landscapes. Its sheer size and scale defy imagination. According to TechJury, “it would take a person approximately 181 million years to download all the data from the internet” (Petrov 2019). In 2019, 30,000 years of video content were uploaded to YouTube. A typical data center with hundreds of thousands of stacked servers requires power stations and backup generators nearby. In Ireland, 54 data centers, including Facebook’s Clonee site, have access to 642 megawatts of power capacity, enough to serve 300,000 average US households (Pollock 2019). In 2017, the US Library of Congress stopped their project of archiving Twitter because “conducting one search of the 2006 to 2010 tweet archive would take 24 hours” (Wamsley 2017).
The social web’s drive to generate dynamic and highly personalized content makes it physically impossible to create a separate archive of what exists only in the specious present that technologists call “real-time”. One of the main characteristics of Web 2.0 is its unprecedented pace of change. According to a report by the International Data Corporation (IDC), “consumers … expect to access products and services wherever they are, over whatever connection they have, and on any device. They want data in the moment, on the go, and personalized” (Reinsel, Gantz, and Rynding 2017). Technologies such as personalization platforms are designed specifically to provide unique experiences for each user. Social media platforms and streaming services depend on “real-time” updates. Existing web archives, such as the Internet Archive, are unable to do more than capture versions of static pages and are not well equipped to deal with big data’s variability or volume.
Additionally, an increasing amount of this data is being stored in public or private cloud services, rather than on personal hard drives. Therefore, we must consider post-custodial, decentralized technological solutions and protocols rather than having a central location for a digital archive. For example, solutions involving emulation, Universal Virtual Computer (UVC), and blockchain have been suggested, but each has their limitations (Lorie 2014). This effort will involve immense technological and legal cooperation among national governments, private companies, scholars, and many other actors.
This leads to the necessity of preserving not just the content, but the context and culture of our current big data society. Merely preserving the raw data (content) cannot tell the stories of how the data is used; capturing the diversity of user experiences (context) is crucial for conveying a more complete picture of the Age of Big Data. Archiving the context requires emulating the user experiences of interacting and creating data within the larger information ecosystems and determining what kind of surrogates would best represent these experiences. Director of Coalition of Networked Information (CNI) Clifford Lynch suggests using robotic witnesses to create and capture fabricated experiences of different social media platforms to document a broad set of experiences (Lynch 2017). Other solutions include video-recording users’ physical interactions with technology, capturing the actual output of the technologies, and creating a snapshot in time of the database.
There is no such thing as a perfect archive; historian Ian Milligan mentions that "no archive is a true reflection of the world" (Milligan 17). As with books (“even if their texts can come down to us unchanged - a virtual impossibility ... our relation to those texts cannot be the same as that of readers in the past"), we cannot perfectly and accurately capture the entirety of Web 2.0 (Darnton 1990). What, then, is the goal of big data as an archive? Lynch suggests that it is to”preserve a reasonably accurate sense of the present for the future" (2017). I argue similarly, that such an archive must preserve, capture, and convey big data in meaningful ways for future generations to best understand our present culture. It is imperative that we preserve both the physical data and the user experiences to best represent our present-day relationships with big data.
Darnton, Robert. 1990. “First Steps Toward a History of Reading.” In The Kiss of Lamourette: Reflections in Cultural History, 154–87. New York: W.W. Norton. http://robertdarnton.org/sites/default/files/First%20Steps%20Toward%20a%20History%20of%20Reading.pdf.
———. 2013. “Blogging, Now and Then (250 Years Ago).” European Romantic Review 24 (3): 255–70. https://doi.org/10.1080/10509585.2013.789694.
Howsam, Leslie. 2006. Old Books and New Histories: An Orientation to Studies in Book and Print Culture. Toronto: University of Toronto Press.
Lorie, Raymond. 2014. “Long Term Preservation of Digital Information.” IBM Almaden Research Center. https://zoo.cs.yale.edu/classes/cs426/2014/bib/lorie01long.pdf.
Lynch, Clifford. 2017. “Stewardship in the Age of Abundance.” First Monday 22 (12–4). https://firstmonday.org/ojs/index.php/fm/article/view/8097/6583.
Milligan, Ian. 2019. History in the Age of Abundance? How the Web Is Transforming Historical Research. Montreal: McGill-Queen’s University Press.
Petrov, Christo. 2019. “Big Data Statistics 2020.” TechJury. https://techjury.net/stats-about/big-data-statistics/#gref.
Pollock, Sean. 2019. “Hundreds of Millions’ to Be Invested in Data Centre in Meath.” Independent.Ie. 2019. https://www.independent.ie/business/irish/hundreds-of-millions-to-be-invested-in-data-centre-in-meath-38720181.html.
Reinsel, David, John Gantz, and John Rynding. 2017. “The Digitization of the World From Edge to Core.” International Data Corporation. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.
Wamsley, Laurel. 2017. “Library of Congress Will No Longer Archive Every Tweet.” NPR. 2017. https://www.npr.org/sections/thetwo-way/2017/12/26/573609499/library-of-congress-will-no-longer-archive-every-tweet.