Module 2: Building a Preservation System

Duration: 2 weeks Type: Technical Implementation & Infrastructure Design

Module Overview

You've investigated the dead. Now it's time to save the living.

In this module, you'll design and implement a working digital preservation system. You'll learn archival workflows, storage strategies, metadata standards, and the technical infrastructure needed to rescue content before it vanishes. This is hands-on Archive work—by the end, you'll have actually preserved something real.

Learning Objectives

By the end of this module, you will be able to:

1. Design preservation workflows: Create systematic processes for intake, processing, storage, and access 2. Implement storage infrastructure: Set up reliable, redundant storage systems for digital artifacts 3. Apply metadata standards: Use Dublin Core, PREMIS, and other standards to document preserved content 4. Use archival tools: Operate Internet Archive tools, web scrapers, and archival software 5. Preserve dynamic content: Capture not just static files but interactive web experiences 6. Document preservation decisions: Create clear records of what was saved, why, and how

Required Readings

Primary Texts

- Textbook Chapter 10: Triage Workflow—From Discovery to Preservation - Textbook Chapter 11: Sustainable Preservation Organizations

Technical Standards

- Library of Congress, "Sustainability of Digital Formats" - Dublin Core Metadata Initiative, "Basic Guidelines" - PREMIS Editorial Committee, "Data Dictionary for Preservation Metadata"

Case Studies

- Internet Archive, "Wayback Machine Architecture" - Archive Team, "Project Management Workflow" - UCLA Digital Library, "Digital Preservation Policy"

Assignment Structure

Week 1: Infrastructure & Tools

Part 1: Preservation System Design (Due: Day 3)

Design a preservation infrastructure for a specific use case. Choose one:

Option A: Personal Web Archive - Preserve your own digital footprint (social media, blogs, websites you've created) - Design for individual/small team use - Budget: Free or under $50/month

Option B: Community Archive - Preserve a specific online community (subreddit, Discord server, forum) - Design for 5-10 volunteer archivists - Budget: Under $200/month

Option C: Endangered Content Collection - Preserve a category of at-risk content (indie web, GeoCities-style sites, web comics) - Design for distributed volunteer effort - Budget: Grant-funded ($5-10K/year)

Your Design Document (3-4 pages) must include:

1. Mission & Scope - What are you preserving? - Why is it at risk? - What are the boundaries of your collection? (dates, types, geographic scope) - What will you NOT preserve? (out of scope content)

2. Storage Architecture

Choose your storage strategy:

Cold Storage Options: - External hard drives (cheap, manual, at-risk if single copy) - Network Attached Storage / NAS (prosumer, raid redundancy) - Cloud storage (AWS S3, Backblaze B2, Internet Archive)

Hot Storage Options (for access/discovery): - Local web server - Static site generator (Jekyll, Hugo) - Archive hosting service (Omeka, CollectiveAccess)

Redundancy Strategy: - How many copies? (follow 3-2-1 rule: 3 copies, 2 media types, 1 offsite) - Where are copies stored? - How often are backups verified?

3. Technical Stack

Specify your tools for:

Capture: - Web scraping: wget, httrack, webrecorder - Social media: twitter-scraper, gallery-dl, youtube-dl - Dynamic content: Webrecorder, Conifer, Browsertrix

Processing: - File validation: JHOVE, DROID, Siegfried - Format conversion: FFmpeg, ImageMagick, Pandoc - Virus scanning: ClamAV

Metadata: - Schema: Dublin Core, PREMIS, or custom - Database: SQLite, PostgreSQL, or spreadsheet - Tools: OpenRefine, Tropy, Omeka

Access: - How will users discover preserved content? - Search interface or browsing? - Public or restricted access?

4. Workflow Diagram

Create a flowchart showing: - Content discovery → Triage → Capture → Processing → Storage → Access - Decision points: What gets preserved? How much metadata? - Human roles: Who does what?

Deliverable: Upload design document with workflow diagram as PDF

Part 2: Hands-On Preservation (Due: End of Week 1)

Actually preserve something. Conduct a real preservation operation using the tools you specified.

Choose a preservation target:

Small-Scale (recommended for beginners): - A personal website or blog that's inactive (with permission) - A small subreddit or forum thread - A YouTube channel or Instagram account - A collection of 50-100 articles from a dying news site

Medium-Scale (for confident technologists): - An entire small website or blog network - A Discord server's public channels (with mod permission) - A complete Twitter account's history - A webcomic archive

Ambitious (for experienced archivists): - A small online community (forum, GeoCities-style neighborhood) - A category of endangered content (indie game devlogs, personal sites) - A protest movement's digital artifacts - A grassroots organization's web presence

Your Preservation Report (4-5 pages) must document:

1. Pre-Capture Assessment - What did you decide to preserve? - Why is it at risk or significant? - What format(s) is the content in? - How much data? (estimated size, number of files)

2. Technical Process

Document every command you ran:

```bash

Example documentation format:

wget --mirror --page-requisites --adjust-extension \ --convert-links --wait=2 \ https://example.com ```

Include: - Tools used and why you chose them - Problems encountered and how you solved them - Workarounds for technical limitations - Time spent (be honest about how long it took!)

3. Metadata Creation

Create a preservation record for each item or collection:

Minimum metadata fields (Dublin Core): - Title - Creator - Date (original creation + preservation date) - Description (50-200 words) - Format (file types, codecs, versions) - Source (original URL) - Rights (copyright status, preservation justification) - Identifier (unique ID in your system)

4. Quality Assurance

Verify your preservation: - Did everything capture correctly? - Are files readable and uncorrupted? - Did you get images, CSS, JavaScript? - Can dynamic content be replayed? - What's missing? (broken links, external dependencies)

5. Storage & Backup

Document how you stored the preserved content: - Where are the files? (local, cloud, both) - What's your backup strategy? - How will you verify integrity over time? (checksums, periodic testing)

Deliverable: - Upload preservation report as PDF - Submit sample metadata records (CSV or JSON) - Include 3-5 screenshots showing before/after or archival interface

Week 2: Metadata, Ethics & Sustainability

Part 3: Metadata & Documentation (Due: Mid-Week 2)

Create a comprehensive metadata system for your preserved content.

Task 1: Metadata Schema Design

Expand beyond basic Dublin Core. Create a custom metadata schema for your specific collection type.

Required elements: - All 15 Dublin Core elements - PREMIS preservation metadata (capture method, software versions, fixity) - Custom fields relevant to your content type

Example custom fields for a social media archive: - Original post timestamp - Engagement metrics (likes, shares, comments) - Thread structure (replies, quote tweets) - Content warnings or tags - Platform-specific metadata (subreddit, hashtags, etc.)

Task 2: Metadata Entry

Create detailed metadata for at least 10 significant items from your preservation: - Use your custom schema - Export as CSV, JSON, or XML - Include both human-readable and machine-readable formats

Task 3: Documentation Package

Write a Collection Policy Document (2-3 pages) that includes:

Scope Statement: - What's in this collection? - What collection boundaries did you define? - What's explicitly excluded?

Capture Methodology: - Tools and techniques used - Capture settings and parameters - Date(s) of capture

Known Limitations: - What couldn't be captured? - What quality compromises were made? - What's missing or incomplete?

Preservation Decisions: - Why was this content prioritized? - What ethical considerations arose? - Who was consulted (if applicable)?

Future Maintenance: - How often will this be re-captured (if ongoing)? - Who's responsible for updates? - What's the long-term plan?

Deliverable: - Upload metadata files (CSV/JSON) - Upload Collection Policy PDF

Part 4: Ethics & Access (Due: End of Week 2)

Navigate the ethical complexities of preservation and access.

Write a 4-5 page ethics analysis addressing:

1. Permission & Consent

For the content you preserved: - Did you have explicit permission? - Was the content publicly accessible or behind login? - If it was public, does that mean it should be preserved? - Did you contact creators or community members?

Analyze using these frameworks: - Legal: Copyright, terms of service, DMCA - Ethical: Informed consent, community norms, contextual integrity - Practical: What would happen if creators objected?

2. Privacy & Harm

Consider the risks: - Could this preservation harm individuals? - Is personal information exposed? - Are there vulnerable communities involved? - Could this be used for doxxing, harassment, or surveillance?

Apply Helen Nissenbaum's "Contextual Integrity": - What were the original context norms? - Does preservation violate those norms? - How can you preserve content while respecting privacy?

3. Access Restrictions

Design an access policy:

Open Access: Anyone can view preserved content - When is this appropriate? - What content should be public?

Restricted Access: Researchers only, registration required - What's the justification? - How do you verify researcher credentials?

Dark Archive: Preserved but not accessible - When is preservation without access ethical? - How do you document something you won't show?

Embargoed: Time-delayed release - What embargo period makes sense? - Who decides when to lift it?

4. Takedown Policy

What happens if someone objects to their content being preserved?

Write a takedown procedure: - How can creators request removal? - What's your decision process? - Will you preserve metadata even if you remove content? - How do you balance preservation mission vs. individual rights?

Case Study to Address:

A small forum you archived includes intense personal posts from 2010-2015. Users were teens then; they're adults now. One user emails you: "I was 15 when I wrote that stuff. It's embarrassing and I don't consent to it being preserved. Delete everything I wrote."

What do you do? Analyze using: - Legal rights (terms of service, copyright) - Ethical obligations (consent, harm) - Preservation principles (significance, completeness) - Practical considerations (feasibility of removal)

Deliverable: Upload ethics analysis essay as PDF

Assessment Rubric

Preservation System Design (20 points)

- Completeness (10 pts): All components specified (storage, tools, workflow) - Feasibility (5 pts): Realistic for stated budget and resources - Documentation quality (5 pts): Clear diagrams and explanations

Hands-On Preservation (30 points)

- Technical execution (15 pts): Successfully preserved target content - Documentation thoroughness (10 pts): Detailed process notes and commands - Quality assurance (5 pts): Verified completeness and integrity

Metadata & Documentation (25 points)

- Metadata schema design (10 pts): Appropriate fields for content type - Metadata quality (10 pts): Accurate, complete records - Collection policy (5 pts): Clear scope and preservation decisions

Ethics Analysis (25 points)

- Ethical reasoning (15 pts): Sophisticated analysis of consent and harm - Access policy design (5 pts): Thoughtful, practical restrictions - Takedown procedure (5 pts): Balances preservation and individual rights

Discussion Forum Prompts

Week 1: Tool Failures Share a technical problem you encountered during preservation. What broke? How did you troubleshoot it? Failure is part of the process—let's learn from each other.

Week 2: The Right to Be Forgotten Debate: Do individuals have a right to remove themselves from digital archives? Should Internet Archive honor takedown requests for old social media posts? Where's the line between historical preservation and personal privacy?

Technical Resources & Tools

Capture Tools

Web Archiving: - wget: `wget --mirror --page-requisites --convert-links [URL]` - HTTrack: GUI-based website copier - Webrecorder.io: Browser-based high-fidelity capture - Browsertrix-Crawler: Modern web archiving with JavaScript support

Social Media: - gallery-dl: Instagram, Twitter, Reddit, Tumblr - youtube-dl / yt-dlp: Video platform downloads - PSAW: Reddit archive scraper - twarc: Twitter API client for researchers

Dynamic Content: - Webrecorder Desktop: Capture interactive web apps - Conifer: Cloud-based web archiving - ArchiveBox: Self-hosted archiving suite

Storage Solutions

Free/Cheap: - Internet Archive (free, unlimited) - Backblaze B2 ($5/TB/month) - Google Drive (15GB free, $2/100GB)

Prosumer: - Synology/QNAP NAS ($400-1000 hardware) - AWS S3 Glacier (long-term cold storage)

Institutional: - Preservica (commercial DAM) - Archivematica (open-source preservation) - DSpace (institutional repository)

Metadata Tools

- OpenRefine: Clean and transform messy data - Tropy: Photo metadata management - Omeka: Collection management with Dublin Core - ExifTool: Extract metadata from media files

Format Validation

- JHOVE: File format validation - DROID: Format identification - Siegfried: Fast format identification - MediaInfo: Audio/video format inspector

Real-World Examples

Successful Preservation Projects

Internet Archive: - 600+ billion web pages preserved - Wayback Machine infrastructure - $36M annual operating budget

Archive Team: - 500+ platform rescue operations - Distributed volunteer model - Crisis response preservation

Personal Examples: - Jason Scott's personal archive (500TB+) - Decentralized social media backup tools - Scholar-activists preserving protest movements

Failed Preservation Attempts

GeoCities Partial Save (2009): - Only ~650GB of ~1TB captured - Lost pages still being discovered - Lesson: Need better coordination

Flickr Commons Near-Miss (2019): - Announced deletion of 1TB of photos - Community scramble to save - Lesson: Always have backups

Recommended Supplementary Materials

Books

- Digital Preservation by Ross Harvey (technical primer) - Preservation of Electronic Records by William Saffady - Stewarding Digital Humanities by Jennifer Guiliano

Online Courses

- Library Juice Academy, "Digital Preservation" - DPOE, "Digital Preservation Essentials" - Internet Archive, "Web Archiving Basics"

Communities

- Archive Team IRC (#archiveteam on EFNet) - IIPC (International Internet Preservation Consortium) - Digital Preservation Coalition

Getting Help

Office Hours: Book time to troubleshoot technical issues

Lab Hours: Open lab sessions with hands-on assistance (Wed 6-8pm)

Community Support: Archive Team community is incredibly helpful for tool questions

Library Resources: Digital preservation librarian can help with metadata standards and institutional tools

Next Module Preview

In Module 3: Designing for Sovereignty, you'll switch from Archive to Anvil. You'll apply the Three Pillars framework to design systems that CAN'T be murdered—platforms built with sovereignty principles from day one.

Get ready to: Audit platforms, prototype alternatives, and design for resistance against corporate control.

"We do not preserve to keep the past frozen. We preserve so that the lessons of the dead can teach the living how to build systems that resist murder."