Browsertrix Cloud is an ambitious project and we hope to support the following key features:
Browsertrix Cloud is an open source, high fidelity browser-based crawling system. All crawling is done using real browsers and custom behaviors designed to create the highest accuracy of web archiving possible!
All archiving activity happens within a shared archive workspace.
Users will be able to create multiple archives to separate projects. For example, one can have personal archiving or institutional archiving projects, with different user permissions, storage options and crawl configurations for each one.
Archive admins can invite others, and delegate different permissions, such as admin, crawling-only and viewing-only permissions to their archive.
The system will support scheduled crawling on a predefined time schedule.
Users will be able to create crawl configurations, with seeds, scoping rules and time schedules.
Crawl configurations will be editable, with a tracked revision history, to ensure a full record of the crawling activity.
Users will also be able to start, stop, as well as monitor running crawls in real-time while they are running.
Real-Time Crawl Visibility
While a crawl is running, users will be able to watch one or more of the actively running browsers in real-time to allow for instant feedback if crawl is running as expected (or not).
The crawl watch page may allow additional options, such as stopping the crawl, or skipping pages that appear problematic.
Users will also be able to access other information about the state of the crawl, such as the current crawl queue and crawl logs to make informed decisions about the crawl and amend it as necessary.
For example, if a crawl appears to be crawling pages that are not necessary, a user may be able to amend the crawl queue dynamically, instead of stopping and starting a new crawl. If a user does need to stop and cancel the crawl, they will be able to make an informed decision based on improved crawling monitoring.
Crawling with Browser Profiles
Users will have the ability to create custom browser profiles, allowing users to log in to any sites, accept cookies, and otherwise configure browser settings before crawling starts.
The browser profiles can allow crawling of private or paywalled content, without actually crawling the login information. Crawling with browser profiles will allow for accurate capture of content exactly as it appears to users after they’ve logged in.
Unified Automated and Manual Browser-Based Capture
For some complex sites, it may be necessary to augment, or patch, automatically archived content with user-driven, manual archiving.
Browsertrix Cloud will support a way to upload externally created WACZ files which can be used to augment content from scheduled crawls.
The ArchiveWeb.page extension may eventually support uploading of browser-based captures created locally to be combined with automatically crawled content.
Standardized Web Archive Formats
All crawled content will be stored in static storage, and can be accessible via client-side tools, such as ReplayWeb.page
Users will be able to configure their own storage options for the crawls, allowing them to own the data as soon as the crawl finishes, or to download or move the data to their own storage.
The output of the crawls will be standard WARC or the new portable WACZ format. The WACZ format will contain all the data and metadata for the crawls, including raw WARC data, page indexes, full text search, and other metadata that may be part of the WACZ format.
Browsertrix Cloud will support storing and loading web archives to any S3-compatible storage, which may include cloud-based storage or local instance of Minio-based S3 storage.
The system will also support webhooks for uploading WACZ or WARC files to additional systems as needed.
Signed and Authenticatable Web Archives
All WACZ files produced by Browsertrix Cloud will be signed to allow verification that the archives have been created by the operator of Browsertrix Cloud, likely a the third-party witness server or other possible approaches, as developed in the signing specification.
Improved QA Processes
Browsertrix Cloud will feature a simplified, automated QA process to help determine which pages that did not archive properly. The process may attempt to recrawl missing content or warn the user about what may have gone wrong.
Docker and Kubernetes Support
The system will be geared towards cloud-based deployment with Kubernetes, as well as single-machine deployment via Docker Compose.
The Kubernetes deployment is recommended for running in cloud environment, and allows for Browsertrix Cloud to be deployed on any cloud provider that supports Kubernetes.
For single-machine deployments, Browsertrix Cloud will also run with Docker Compose.
Docker Swarm may be added as well in the future.
Standard API and Customizable UI
All crawling operations will be accessible via a well-defined OpenAPI crawling specification.
The UI will interface with the crawling system entirely via the API, allowing for creation of alternative UIs.
Advanced Customized Crawling
The system will support all the crawling options available in the Browsertrix Crawler command-line tool, including ability to add custom crawl drivers and use automated in page behaviors via Browsertrix Behaviors.
Advanced users will be able to create their own custom crawl scripts directly, and some advanced features may be available in the UI as well.