Frequently Asked Questions

What is Browsertrix Cloud?

Browsertrix Cloud is a new user-friendly open source high-fidelity crawling system from Webrecorder, featuring an intuitive interface for creating and managing web crawls.

The system will be offered as a service by Webrecorder and is also fully open-source (Available at: https://github.com/webrecorder/browsertrix-cloud ).

After many years of building high-fidelity browser-based archiving tools, we built this platform so anyone can perform high-fidelity archiving at scale.

Who is it for?

Anyone who would like to create high-fidelity web archives in an automated way!

We hope to support archivists, curators, as well as journalists, artists and independent researchers with their automated web archiving needs.

Does Browsertrix Cloud require being run in the cloud? How will it be offered?

While Browsertrix Cloud is designed to be run in a cloud environment, we hope to support local / non-cloud based deployments as well.

At this time, there are at least several plans to run the system:

  • Webrecorder will offer a cloud-based Kubernetes deployment some time in the future. Sign-up for our list to receive more information.

  • Several IIPC institutions will run their own instance, either in the cloud or on existing infrastructure, thanks to our collaboration with the IIPC

  • As an open source project, anyone can run Browsertrix Cloud using Docker or Kubernetes on their own!

When will it be available?

We hope to have an alpha-ready version by middle of 2022, and to have a hosted version of Browsertrix Cloud available for invite-only testing later in 2022.

Per our collaboration with IIPC, other institutions will begin testing the system starting in the middle of 2022 as well.

How does Browsertrix Cloud compare to Browsertrix Crawler?

Browsertrix Crawler is the core crawling system that is at the heart of Browsertrix Cloud. The Browsertrix Cloud service automates and schedules multiple instances of Browsertrix Crawler.

If you’re a developer looking to run a single crawl via the command-line, you can try Browsertrix Crawler directly.

If you’re not a developer and/or want to run many crawls, on a schedule and view wthe results via simple web interface, Browsertrix Cloud may be for you!

Where will you store the data?

One of our goals with Browsertrix Cloud is to allow crawl outputs (WACZ or WARC) to be stored in any storage of the users' choosing.

For our service, we may offer storage through an S3-compatible storage bucket, and allow users to choose their own storage. This will allow the data to be accessible via our existing tools like ReplayWeb.page and pywb without necessarily relying on Webrecorder-hosted infrastructure.

Other institutions may simply configure WARC or WACZ output to be ingested into their existing institutional repositories.

Our hope is to support whatever stroage options make sense for our users and community, from cloud storage like Google Drive to decentralized storage options like IPFS.

How is it different from other archiving tools?

Browsertrix Cloud combines Webrecorder’s high-fidelity archiving approaches with a focus on automated crawling and ease of use. We hope this tool will make it easier to create the best quality web archives at slightly greater scale than with some of our other tools.

See Features for a more detailed list of some of the planned features for Browsertrix Cloud.