ikreymer/pywb-webrecorder

语言: Python

git: https://github.com/ikreymer/pywb-webrecorder

查看
Check out
README.md (中文)

pywb Wayback网络记录器(Archiver)

注意:这是一个较旧的原型。我们建议您查看https://github.com/webrecorder/webrecorder,这是https://webrecorder.io/的Docker部署,其中包含改进的功能,并且将比此原型更加维护

该项目提供了如何创建简单的Web录制和重放系统的简单示例。

该项目演示了如何通过组合pywb(python wayback)Web存档重放工具和warcprox HTTP / S记录WARC代理来创建简单的Web记录器工具。

有关其他参考,请参阅pywb和warcprox文档。

有关更多参考,https://webrecorder.io是使用某些相同工具构建的托管服务。

基本用法

首先,只需在Python 2.7.x环境下使用pip install -r requirements.txt进行安装即可。

然后,运行python pywb-webrecorder.py

pywb-webrecorder.py脚本将启动pywb,warcprox和timed cdx index updater的实例。 默认情况下,pywb将在端口8080上运行,warcprox在端口9001上运行。

warcprox会将正在写入的每个WARC(一次一个)存储到./recording/目录中。完成(或关闭)后,WARC 将被移动到./done/目录。

(所有设置都可以在config.yaml中调整)

在端口8080上运行的pywb Web应用程序将提供以下端点:

  • / live / * url - 获取url *的实时版本(与pywb中的live-rewrite-server相同)
  • / record / * url - 获取url *的实时版本,但通过warcprox录制代理,记录所有流量。
  • / replay / * url - 如果从./recording或./done dirs中找到,则重播url *的存档版本。如果没有存档,则显示404。标准pywb Wayback行为。
  • / replay-record / * url - 如果从./recording或./done dirs中找到,则重播url *的存档版本。如果不可用,则在内部调用/ record / handler来记录url的新副本。

存档按需

重放记录端点演示了从现有存档中自动记录任何缺失资源的方法。

第一次请求资源时,将记录该资源。在每个后续请求中(在cdx更新之后),它将从现有WARC重放。

横幅将包含实时获取或存档页面,以指示页面是实时还是存档。

怎么运行的

pywb具有“实时重写”重播模式,可以获取实时Web内容并将其显示为与从存档文件中读取内容相同的内容。 (参见live-rewrite-server工具)。

使用pywb> = 0.5.0,现在可以为实时获取指定代理服务器。这允许实时获取通过warcprox, 它代理HTTP / S流量并将其记录到WARC文件。

/ record / endpoint配置为通过端口9001上的代理获取实时内容,而/ live / access point只是在没有录制的情况下获取实时内容。

在某些情况下,仅在存档中缺少内容时才有用。 pywb 0.5.0包含一个允许的新回退机制 pywb调用不同的处理程序而不是显示404。

/ replay-record / endpoint使用此功能在./recording或./done中重放WARCS中的归档内容。但是,如果找不到资源,则将请求委托给/ record /并进行新的记录。 (/ replay / endpoint只提供定期重播而不自动录制)

索引更新

所有上述功能都是由pywb和warcprox并排提供的。

最后一个缺失的部分是自动更新pywb的CDX索引。

pywb没有提供动态添加CDX索引的方法。但是,由于每次请求都会读取cdx, 在pywb运行时简单地更新现有的CDX索引是可能的(并且更有效)。

pywb以两个cdx文件./recording/index.cdx和./done/index.cdx开头,可以在记录新内容时更新。

这个pywb-webrecorder.py引导脚本启动pywb和warcprox作为子进程,然后启动一个定期的CDX更新程序,运行 每隔几秒(由config.yaml中的update_freq属性配置)

当然,有很多方法可以做到这一点。为简单起见,采用以下方法:

定期更新程序找到warcprox打开的最新WARC,文件以.warc.gz.open结尾,并检查它是否已更新。 如果有,则updater调用打开文件上的pywb cdx-indexer以创建新的已排序的./recording/index.cdx。

当warcprox完成文件时,.open扩展名将被删除。更新程序还会检查任何.warc.gz文件和移动 它们到./done目录并重新生成./done/index.cdx。这在启动,关闭或无法再访问curr打开文件时发生。

在正常关闭(使用SIGTERM)时,pywb-webrecorder.py也会关闭pywb和warcprox。

正常关闭后,。/ done / dir应包含所有已完成的warcs,并且录制应为空。

其他设置

config.yaml文件包含用于启动pywb和warcprox的命令行设置。有关命令行选项,请参阅warcprox自述文件,例如在旋转warcs,文件名等之前更改最大WARC大小或空闲...

最大WARC大小和最大空闲时间选项对于调整WARC文件保持打开状态以及何时移动多长时间可能特别有用 到./done/目录。

例如,要设置一个WARC文件,当没有记录新内容60秒或大小超过1Kb时,该文件被认为是完成的, config中的recorder_exec设置可以修改如下:recorder_exec:'warcprox --rollover-idle-time 60 -s 1000 ...

uWSGI用于运行pywb,但当然可以使用其他WSGI容器。

配置还演示了如何使用pywb自定义主页和错误页面:  index.html是pywb-webrecorder的简单自定义主页  error.html修改标准pywb错误页面以包含“未找到”错误的显式/记录/链接(仅在使用/ replay / endpoint时才有意义)。

关于Dedup和Revisits的说明

warcprox使用自己的重复数据删除数据库,默认情况下写入dedup.db。重复数据删除方案与存在/可用的实际WARC文件分离。因此,如果从./done中删除warcs,请务必删除dedup.db以避免重新访问WARC的记录 存在(除非是意图)。 默认情况下,当pywb-webrecorder关闭时,dedup.db会保留。 启动pywb-webrecorder时,可以通过-f标志自动删除dedup.db并重新创建:python pywb-webrecorder.py -f

捐款

该项目旨在作为不同Web记录场景的演示,可以通过组合pywb和warcprox来使用。该项目属于MIT许可,可以自由使用(尽管pywb和warcprox可能有不同的许可)。

鼓励对不同用例进行更改和调整。鼓励反馈和拉取请求!

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

pywb Wayback Web Recorder (Archiver)

Note: this is an older prototype. We suggest taking a look at https://github.com/webrecorder/webrecorder, the Docker deployment for https://webrecorder.io/ which includes improved features and will be more maintained than this prototype

This project provides a bare-bones example of how to create a simple web recording and replay system.

This project demonstrates how to create a simple web recorder tool by combining pywb (python wayback) web archive replay tools and warcprox HTTP/S recording WARC proxy.

For additional reference, please consult the pywb and warcprox docs.

For more reference, https://webrecorder.io is a hosted service built using some of the same tools.

Basic Usage

To start, simply install with pip install -r requirements.txt under a Python 2.7.x environment.

Then, run python pywb-webrecorder.py

The pywb-webrecorder.py script will start an instance of pywb, warcprox and timed cdx index updater.
pywb will be running on port 8080 and warcprox on port 9001 by default.

warcprox will store each WARC that is being written to (one at a time) into the ./recording/ directory. Once completed (or on shutdown), WARCs
will be moved to the ./done/ directory.

(All settings can be adjusted in config.yaml)

The pywb web app running on port 8080 will have the following endpoints available:

  • /live/*url -- Fetch a live version of url* (same as live-rewrite-server in pywb)

  • /record/*url -- Fetch a live version of url* but through warcprox recording proxy, recording all traffic.

  • /replay/*url -- Replay an archived version of url* if found from ./recording or ./done dirs. Display 404 if not archived. Standard pywb Wayback behavior.

  • /replay-record/*url -- Replay an archived version of url* if found from ./recording or ./done dirs. If not available, internally call the /record/ handler to record a new copy of url.

Archive On-Demand

The replay-record endpoint demonstrates way to auto-record any missing resources from an existing archive.

The first time a resource is requested, it will be recorded. On each subsequent request (after the cdx has been updated), it will be replayed from an existing WARC.

The banner will contain either live fetch or archived page to indicate whether the page was live or archived.

How it Works

pywb features a 'live rewrite' replay mode which fetches live web content and displays it same as if it was read from an archive file. (See the live-rewrite-server tool).

With pywb >= 0.5.0, it is now possible to specify a proxy server for the live fetching. This allows the live fetching to go through warcprox,
which proxies HTTP/S traffic and records it to WARC files.

The /record/ endpoint is configured to fetch live content via the proxy on port 9001, while /live/ access point just fetches live without recording.

In some cases, it is useful to record only when content is missing from an archive. pywb 0.5.0 includes a new fallback mechanism which allows
pywb to call a different handler instead of showing a 404.

The /replay-record/ endpoint uses this feature to provide replay of archive content from WARCS in either ./recording or ./done. However, if a resource is not found, the request is delegated to /record/ and a new recording is made.
(The /replay/ endpoint just provides regular replay without auto recording)

Index Updating

All the above functionality is provided by pywb and warcprox side-by-side.

The last missing piece is automatically updating the CDX index for pywb.

pywb does not provide a way to dynamically add CDX indexs on the fly. However, since the cdx is read on each request,
it is possible (and more efficient) to simply update an existing CDX index while pywb is running.

pywb starts with two cdx files ./recording/index.cdx and ./done/index.cdx, which may be updated as new content is recorded.

This pywb-webrecorder.py bootstrap script launches pywb and warcprox as subprocesses, then starts a periodic CDX updater, running
every few seconds (configured by update_freq property in config.yaml)

Of course, There are many ways to do this. For simplicity, the following approach is taken:

The periodic updater finds the latest WARC open by warcprox, a file ending in .warc.gz.open, and checks to see if it has been updated.
If it has, the updater calls the pywb cdx-indexer on the open file to create a new sorted ./recording/index.cdx.

When warcprox is finished with a file, the .open extension is dropped. The updater also checks for any .warc.gz files and moves
them to the ./done directory and regenerates ./done/index.cdx. This happens on startup, shutdown or whenever the curr open file is no longer accessible.

On graceful shutdown (with SIGTERM), pywb-webrecorder.py also shuts down pywb and warcprox.

After graceful shutdown, the ./done/ dir should contain all the finished warcs and recording should be empty.

Other Settings

The config.yaml file contains the command line settings for starting pywb and warcprox. Please refer to warcprox README for command line options, such as changing the max WARC size or idle before rotating warcs, filenames, etc...

The max WARC size and max idle time options may be especially useful for adjusting how long a WARC file remains open and when it is moved
to ./done/ directory.

For instance, to set a WARC file to be considered done when no new content has been recorded for 60 seconds OR when size exceeds 1Kb, the
recorder_exec setting in the config can be modified as follows: recorder_exec: 'warcprox --rollover-idle-time 60 -s 1000 ...

uWSGI is used to run pywb but other WSGI containers can of course be used instead.

The config also demonstrates use of custom home page and error pages with pywb:
index.html is a simple custom home page for pywb-webrecorder
error.html modifies the standard pywb error page to also include an explicit /record/ link for 'not found' errors (only makes sense when using /replay/ endpoint).

A note on Dedup and Revisits

warcprox uses its own dedup db, written to dedup.db by default. The dedup scheme is decoupled from the actual WARC file being present/available. Thus, if removing warcs from ./done, be sure to also delete dedup.db to avoid revisit records to WARCs that no longer
exist (unless that is the intent).
By default, dedup.db is persisted when pywb-webrecorder is shutdown.
When starting pywb-webrecorder, the dedup.db can be automatically deleted and created anew via the -f flag: python pywb-webrecorder.py -f

Contributions

This project is intended as a demo of different web recording scenarios that could be used by combining pywb and warcprox. The project is under the MIT license and can be used freely (although pywb and warcprox may have different licenses).

Changes and adaptions to different use cases is encouraged. Feedback and pull requests encouraged!