data-refuge | episodes



The Data Refuge Project

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Project, which is led by the UPenn Program in Environmental Humanities and the Penn Libraries group at University of Pennsylvania, aims to foster resilience in an era of anthropogenic global climate change and raise awareness of how social and political events affect transparency.

Since the inauguration, a range of climate change information has been removed from the White House website. Next under threat is the U.S. Environmental Protection Agency's climate change page.

But this was no surprise to the team members of the DataRefuge project, which began in anticipation of the suppression of federal data. Given Trump's anti-climate change views during his campaign, researchers were concerned that climate and environmental data would be blocked from public access in the future. So starting in January, the DataRefuge project has coordinated 'Data Rescue' events nationwide to garner and preserve data on climate change and the environment. During these Data Rescue events, participants can store data or access copies of data from DataRefuge's S3 buckets and datasets stored in CKAN, an opensource data catalog.

Federal environmental and climate change data helps inform our scientific knowledge of the earth. For this reason, DataRefuge advocates for more robust archiving of born-digital materials as well as for more reliable access to them, which are already susceptible to poor management, bit rot (the tendency of data to degrade), or even direct attempts at reducing access. Hence, an underlying goal is to make sure that those data remain available to communities as trustworthy copies and that we don't lose those precious facts.

The cornerstone of DataRefuge is its documentation of a clear "chain of custody." Without this documentation, the trustworthiness and research-quality of the data cannot be created. So to safegaurd the data and ensure its originality, DataRefuge relies on multiple checks along the chain from where the data comes from originally, to who copies them and how, and to who and how they are re-distributed - which is all done by trained librarians and archivists to provide QA along every link in the chain. Then at the very end of the chain, Data Refuge verifies the quality of the data.

Verification of the data is also done in a series of steps to ensure that the data is useable and trustworthy. After a dataset is harvested, it gets checked against the original source copy of the data and makes sure it is complete. Then, digital preservation experts inspect the dataset again to make sure that the content is correct and reflects the right information. The data gets packaged into a bagit file (or "bag") so that any future changes will be recognizable in the future. The bagit files move to describers who spot check for errors and then create a descriptive record in the DataRefuge CKAN repository for each baglinks. After adding more metadata, the describes make the record public.

A related project to DataRefuge is Project Svalbard, a controlled collection of public scientific datasets under the codenamed. It which takes the name of the underground backup library of plant DNA located in the Svalbard Global Seed Vault in the Arctic. Project Svalbard also uses the peer-to-peer data-sharing tools created by the Dat Project.

References to things mentioned in the show:

  • bit rot - data degradation

  • CKAN - an open-source data management system for powering data hubs and data portals that makes it easy to publish, share and use data

  • The Dat Project - a grant-funded, open source, decentralized tool for distributing data sets.

  • BagIt - a hierarchical file packaging format for storage and transfer of arbitrary digital content.




Margaret Janz

Margaret Janz

@MargaretJanz

Margaret Janz holds a masters in library science for Indiana University Bloomington. She is presently the Scholarly Communication and Data Curation Librarian at the University of Pennsylvania. Her areas of expertise include data management and the promotion of digital information literacy. She is currently a member of the Data Refuge project working hard to create trustworthy copies of federal climate and environmental data, and its those efforts I invited her here to discuss today.