The short life span of storage hardware reverses our relationship with primary sources. On average, SSDs under stress hold data for about 7 years, HDDs up to around 30 years, and optical discs 100 to 200 years under climate-controlled conditions only. Other than bit rot, future researchers of the 2010s and 2020s risk losing access to digital artifacts deemed “historical” over time due to compatibility issues, encryption, and data sharding. The physical existence of big data in the form of data center infrastructure poses an additional challenge. Hundreds of years from now, one cannot simply excavate what remains of today’s data centers, power them up, and extract information from them.

Future generations will be able to retrieve what we choose to preserve for them. As of 2020, the Internet Archive managed to preserve 70 petabytes of the estimated 59 zettabytes of data generated worldwide. What does this archive represent in terms of language, jurisdiction, and content type? Considering the dynamic existence of big data, how do we define the boundaries of the Canadian, Danish, or South Korean web? Is it determined by the users’ nationality or by server location? How do we crawl online forums restricted by membership levels? How do we optimize, prioritize, and ensure the diversity of this process?

Engineers have offered impressive technological solutions to these challenges. M-DISC’s corrosion-resistant optical discs last 1,000 years, and Project Silica’s 5D optical data storage could preserve data for up to 1,000,000 years. Universal Virtual Computer and the Internet Archive’s Flash emulation are examples of attempts to allow today’s digital content retrievable and readable in the future. Machine learning-driven extreme compression techniques might allow us to pack more data in archival media with limited capacity.

However, we suggest taking a step back and offering a humanist’s intervention. What is an archive in an age of big data? In Prineville, Oregon, Facebook operates a cold storage facility with low-power towers of blu-ray discs and hard drives, which preserve infrequently-accessed social media data. Is this an archive? Or a data dump of decontextualized information and scrambled file structure? How about a collection of notebook computers and mobile phones?

Big Data Studies Lab pursues the following projects germane to this theme:

  • big data as archive
  • defining the Korean web
  • material bibliography of mobile phones