Web archiving guide

What is web archiving?

According to the Society of American Archivists, web archiving is…

the process of collecting, preserving, and providing enduring access to web content

Dictionary, Society of American Archivists

Web archiving is a process which can take many forms but most commonly involves making and storing “preserved copies of live web content collected for permanent retention and access”. Practically, this means creating a copy of all of the code behind a webpage and the way that code is displayed at a very specific point in time, with the intention of being able to access that capture of the webpage as-is in the future.

Web archiving is often done by crawlersA web archiving (or “capture”) operation that is conducted by an automated agent, called a crawler, a robot, or a spider. Crawls identify materials on the live web that belong in your collections, based upon your choice of seed URLs and scope. Crawl can also reference the archived content associated with the action., which are automated functions that scrape the web for pages related to a specific URL or collection. Crawlers (also called spiders because they crawl the web) are capable of making distinctions about what pages to preserve, but those decisions are based on human direction and input. For this reason, and given the sheer size and scope of the World Wide Web today, many webpages are not archived by crawlers and will naturally disappear as domains break, are blocked, and expire, as servers go down, as web hosts and builders give up their projects, and as transitions between managers create massive changes to existing sites.

So why archive webpages?

Things are constantly disappearing off of the Internet, and scholarly material is becoming increasingly , meaning that it was first published on the web or in digital format. Web archiving is the primary way to preserve the things on the web that you find important. If you have sources you use often on the web, resources you consult, memes you enjoy sharing, or even social media posts that you find important to you, your process, or your research, you should save them so that they do not become inaccessible if/when they disappear off of the Internet. Web archiving is also a great deterrent against , since no matter what external forces decide to remove from public view, you’ll still be able to view the content that matters to you using an archived copy.

Personal websites, digital portfolios, and digital humanities projects in particular are prone to dying. Since these projects may sometimes be one-off assignments for a course or digital work completed by students with more limited resources than gigantic institutions, they often fall prey to the pit of defunctness. Archiving a copy of web-based or web-hosted projects before you officially decide to step away from them is the best way to preserve your work, either for future self-reflection or for other scholars/users who may find it crucial to their own workflows or research.

What is web archiving good for?

Note that if you’re looking to archive webpages just so that you can see an earlier version of your site, there are much better ways to do that using version control systems like Git or Mercurial. Web archiving is most useful for saving an accessible copy of webpages and websites that are not likely to be repeatedly crawled or are not already part of a dedicated archiving program. At Cornell, for example, archiving a cornell.edu site is a moot point, since Cornell University Library’s Web Archiving Technician crawls all cornell.edu sites on a schedule using an institutional instance of the Internet Archives’ paid ArchiveIt service. Archiving my personal website, my favourite blog post, or an obscure digitized copy of a crucial 19th-century text would be a great use of web archiving, since these sites and pages may not be captured by someone else’s proprietary or institutional crawler.

Getting started

Terms to know

ArchiveIt has a glossary of terms that may be useful to review, but the main five that will come up in this guide are: snapshot, capture, WARC file, website, and webpage. You may be thinking that website and webpage are self-explanatory terms, but it’s important to know the difference since some web archiving tools will only capture pages, while others will capture entire websites.

Capture: A full copy of the digital information encoded within a webpage or website that is then archived.
Snapshot: A complete capture of a website or webpage’s content at a specific point in time. As if you had taken a screenshot of the page, but interactive.
WARC file: A web archive file format that contains a capture of a specific URL.
Website: A collection of web resources and pages hosted in one particular place and hosted on the Internet by a web server.
webpage: A single HTML-based document on the web, often hosted on a website through a relative URL (e.g. website.com/web-page/).

Types of web-archiving tools

Before you can get started with archiving a page or a site, it’s important to know what kinds of tools exist. While there are dozens of tools out there, most of them can be categorized using a few basic types: self-hosted and service-hosted; one-page or full-site; and/or open-source and closed-source.

Self-hosted and service-hosted

Whenever something goes on the Internet, it needs to be hosted and served by a web server of some sort. You can read more about the basics of web servers on our guides site. What this means is that when you make your capture of the site or page you want to archive, you’ll have to store that capture somewhere and use some sort of web service to view it.

Self-stored and self-hosted web archiving tools allow you to create a WARC file that you can store on your computer, which you can then serve to your computer’s local server (which is not connected to the Internet) or view through a browser-based WARC viewer like WebRecorder’s Replay Webpage. The file stays on your computer and is fundamentally controlled and owned by you, but you can use a variety of methods to view the archived page or site.

Service-hosted web archiving tools are full-service pieces of web-based software, meaning all you have to do is make an account, pop in URLs for web sites and pages that you’d like to preserve, and wait for the software to process them. Service-hosted archiving tools often have built-in viewers, let you add captures to specific collections, and have graphical user interfaces (visual interfaces you can interact with in your browser) for accessing, adding, viewing, and editing captures.

So which one should you use? There are many pros and cons to both self-managed and service-managed web archives, so the decision you make can be informed by those positive and negatives. Here are just a few:

Self-hosted

Storage limits dependent on your computer and external harddrives (way beyond 5G)
Full control over the archived page
Can experiment with code of archived page
Relatively stable copy of page DOM
Accessible on the fly
Can view on local server or choose own viewer

Service-hosted

Almost no work
Easily accessible
Viewer built into the platform
Links are easily sharable with others
Can download WARC and WACZ if needed
PDF or png conversion of snapshot
Automated provenance summaries

Self-hosted

High upfront investment of time and energy
Might need an external viewer
Requires technical knowledge (command line, programming languages)

Service-hosted

Upload limits
Sometimes paid features
Snapshots on proprietary servers (companies go out of business)
Storage limits
No access if the service/tool server goes down

If you’re just archiving individual webpages that are useful for your research or personal endeavours, then a service-hosted archive on an open-source platform may be your best option. While we can’t guarantee that even big service-hosted options like Internet Archive’s Wayback Machine will remain available forever, there’s a decent likelihood of their long-term stability given current institutional support.

One-page and full-site

The difference between one-page tools and full-site tools is effectively the difference between taking a snapshot and fully crawling a site. One page web archiving tools will only capture the single page for the URL you give the tool. For example, if I want to archive the projects page of my personal website, I’d tell my tool to capture https://kam535.github.io/projects.html. If I wanted to archive the entire site, I would have to tell my tool to index the site or find the site’s sitemap for a list of every page on the site with the base url https://kam535.github.io/. The tool would then crawl the site and archive every page on that list.

The caveat is that a full-site software will almost always require a crawler. Crawler’s take physical and digital energy, and are often difficult to use if you’re unfamiliar with web archiving. For this reason, archiving of entire websites is usually only done by institutions, memory organizations, and other larger groups and polities. Institutions that require constant captures of full websites usually use a full-service software like ArchiveIt, which is managed by the Internet Archive. However, archiving an entire website is possible using a variety of tools that can group individually archived webpages into collections for easy access to the entire archived site. Depending on the size of the website, you can also simply archive each page of a website individually, and then access them all from a single tool or software.

Open source and closed source

Open source refers to software whose source code (the code that makes up and runs the software) is freely available and open to the public to access, modify, distribute, etc. Open source software is (almost always) free and can be remixed into other tools and software, tailored to an individual’s needs and use case, and shared with anyone you want. Open source software is often published under a specific Creative Commons license that stipulates what can be done with the software (for example, whether it can be modified and then sold commercially).

Closed source refers to software whose source code is owned and copyrighted by an individual, institution, or corporate entity and that cannot be freely shared with the public, modified for personal or commercial use, or distributed openly. Closed source software are black boxes; you can’t see the mechanisms that make them work, only what you put into them and what they spit out. Closed source software is usually paid, with tiers depending on what amount of storage and what features you need.

At the Digital CoLab, we highly prioritize the use of open source software for teaching, learning, and consultations, since we aim to contribute and help others contribute to the greatest possible human good.

Software and tool options

Michael Hucka, a Staff Software Engineer at Google, has compiled a gigantic web archiving software comparison list assessing a broad range of web archiving tools, both open-source and closed-source. Given that this list may be impenetrable to those new to web archiving, we have a few suggestions and recommendations for tools and software depending on your needs.

Note that because of the Digital CoLab’s values, all of the tools listed will be open-source (and thus free). Paying for web archiving software can be useful for institutions, but individuals should have access to high quality tools and software that can help them contribute to the public good without accruing a financial burden. Additionally, open-source tools invite innovation; developments and improvements to open-source software are common and encouraged, meaning that a useful plugin or even a better version of what you’re using may crop up in the future (and who knows, maybe even you could build one!).

Full-service software

Perma.cc

Perma.cc is a service that allows you to paste URLs, one-by-one or in bulk, to generate a permalink that you can use to access a snapshot of a webpage.

Perma.cc is open-source and free, but requires you to create an account to start archiving pages. Perma.cc allows you to export your captures as WARC or WACZ files, sort your links into folders/collections, and copy the unique permalink for your snapshot to share across the web or to use in your citations. Cornell University offers free Perma.cc accounts that will give you unlimited storage.

🌲 Conifer

Formerly known as Webrecorder.io, Conifer is similar to Perma.cc in that you paste URLs to generate snapshots for a page.

Conifer lets you start recording a capture where you can browse through multiple pages on a single website, which are then individually snapshotted as part of the capture (you press a stop button when you’re done). Conifer then organizes captures into collections, which can be downloaded as a WARC file. Conifer is open-source and free, but requires you to create an account. Conifer is the easiest full-service option for saving an entire website.

The Wayback Machine

The Wayback Machine is managed by Internet Archive and is perhaps the most popular web archiving tool.

When you paste a URL into the search bar, the Wayback Machine will search for existing snapshots of that URL and display a calendar that shows when (if any) each existing snapshot was taken. If no snapshot has already been taken, the Wayback Machine will prompt you to archive the page. The Wayback Machine is free and open-source; an account is not required to archive a page, but is required to save a screenshot to your personal web archive and to get a WACZ file of the capture.

Self-managed

🍨 Scoop

Scoop is an easy-to-use, high-fidelity, one-page web archiving library run through your computer’s command line.

Scoop allows you to run a command to “scoop” a browser-based snapshot of a web page in a WARC or WACZ file, which you can then save directly to a folder on your computer. Scoop is easy to install, open-source, and free.

ArchiveBox

ArchiveBox is a self-hosted application that you can use to archive URLs through the command line or through a browser-based GUI run through your local host.

You can also use node.js to host the instance somewhere else. ArchiveBox’s graphical user interface makes it easier to archive links to your local computer without having to rely exclusively on the command line. ArchiveBox also has great documentation! ArchiveBox is open-source and free, and you can download it using pip or git.

With all of the above self-managed software, you can use your local server or a viewer like Replay Webpage to look at your WARC files.

Guide for using self-managed software

Step 1. Learn to use the command line

If you’d like to have a self-managed web archive, you will need to learn how to use your computer’s command line effectively. Almost all self-hosted web archiving software uses the command line, in part since this is where you’ll connect to your local server. Programming Historian has approachable lessons in English for using the Bash Command Line (MacOS) or PowerShell (Windows).

Step 2. Decide what software you want to use

Deciding what software you’d like to use can be difficult! Luckily, there are plenty of lists like this web archiving software comparison list and this awesome web archiving list, which will give you basic information about the features of each software or tool. If you don’t understand a term that’s used to describe a piece of software, make sure you look it up! The difference between a software that offers a CLI (command line interface) and a GUI (graphical user interface) could be the difference between having a usable tool and having an archive you don’t enjoy working with. Looking for something specific or not sure what you’re looking for? Feel free to reach out to the Digital CoLab for help.

Step 3. Read the software documentation

After you’ve learned to use the command line and have decided on a software, you’ll want to navigate to the documentation page for whichever software you decide to use. Documentation includes information for every step and feature of a piece of software. Good documentation should give you a list of dependencies, which are pieces of software or tools that you need to install on your computer before you can use the software. The best place to start in any piece of documentation is usually the Quickstart. Quickstart sections will give you a basic list of things to do to just get the software up and running, and then give you a list of things to do to get the software to do basic tasks (like capture a single webpage and save it into a folder).

For example, ArchiveBox’s documentation has a Quickstart section that guides you through three steps for getting started with archiving pages. You’ll want to make sure you’re following the steps that match your operating system (MacOS, Linux, Windows, etc.). If you’re not sure what your operating system is, you can find it in the system settings for your computer.

Step 4. Follow the documentation and start archiving!

Once you’ve followed the documentation, you should be all set up to start archiving pages! If you want to further control the methods of archiving, the output (snapshot, WARC file, etc.), and any other settings (such as setting up a browser-based interface to access your links), you should peruse the documentation for your software for more advanced guidance and information.

Step 5. (Optional) Learning proper file management

You may assume that self-hosted options or self-captured options are more stable, since they’re not living on someone else’s services, but the increased lack of literacy in basic file management indicates that this isn’t necessarily true. If you delete your WARC file or misplace it, your copy is gone.

File management is a skill dependent on understanding how your computer organizes files. One of the best ways to learn how to set up a file management system is to watch other people do it and try to find what works for you.

Some file management guides for Mac:

Some file management guides for Windows:

Windows 11 – Files & Folders for Beginners – Get Organized – Get Control of Your Files & Folders by Your Windows Guru
Windows 10 – File Explorer Management Tutorial – How to Organize Files and Folders – Folder Manager by Professor Adam Morgan
- Note that Windows 11 has a Files app that is intended to optimize file management. See The File Explorer Replacement – The Files App for Windows! by Productive Tech.

Some file management guides for Linux:

File Management in Linux by Peter Kay

You can find more guides on YouTube; just make sure that you’re searching for a guide that addresses your operating system and operating system version, since operating systems are updated somewhat regularly by Apple and Microsoft.

Step 6. (Optional) View your archive using an external viewer

Depending on what software you’re using, you may have to use an external viewer to look at your archived pages, especially if you have a folder full of WARC files. If your software produces HTML files, you should be able to open those files in any browser on your computer just by double clicking or right clicking the file and clicking “Open”.

The main browser-based viewer for WARC and WACZ files is Replay Webpage. Replay Webpage lets you upload WARC and WACZ files and view their contents in your browser. You can also install and use a replay tool like pywb, SolrWayback, or OpenWayback and use the command line to serve them to a browser.

Digital Scholarship Guides

Web archiving guide

What is web archiving?

So why archive webpages?

What is web archiving good for?

Getting started

Terms to know

Types of web-archiving tools

Self-hosted and service-hosted

Self-hosted

Service-hosted

Self-hosted

Service-hosted

One-page and full-site

Open source and closed source

Software and tool options

Full-service software

The Wayback Machine

Self-managed

🍨 Scoop

ArchiveBox

Guide for using self-managed software

Step 1. Learn to use the command line

Step 2. Decide what software you want to use

Step 3. Read the software documentation

Step 4. Follow the documentation and start archiving!

Step 5. (Optional) Learning proper file management

Step 6. (Optional) View your archive using an external viewer

on this page