ggabelmann/jcrawl

语言: Java

git: https://github.com/ggabelmann/jcrawl

在高层次上,jcrawl是一个用Java编写的网络爬虫,它使用责任链设计模式来...
At a high level jcrawl is a web crawler, written in java, which uses the Chain of Responsibility design pattern to co…
README.md (中文)

jcrawl

在高层次上,jcrawl是一个用Java编写的网络爬虫,它使用责任链设计模式来控制它访问和打印的URL。该链由处理程序组成,它一次一个地决定它们是否将处理URL。如果是,则处理程序可以执行某些操作,返回新的URL列表或两者。如果不是,则下一个处理程序有机会处理URL。

用法

目前,只有一个爬虫:CrawlBootieMashup.java。它可以直接从命令行运行。它没有下载任何东西;它只打印出MP3,PDF或ZIP的URL。

在发送请求之前,通过休眠一秒钟来限制HTTP请求。

未来

添加更多抓取工具。

添加测试。

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

jcrawl

At a high level jcrawl is a web crawler, written in java, which uses the Chain of Responsibility design pattern to control which URLs it visits and prints out. The chain is composed of Handlers which decide one-at-a-time whether or not they will handle the URL. If yes, then the handler can take some action, return a new list of URLs, or both. If no, then the next Handler has a chance to handle the URL.

Usage

Currently, there is only one crawler: CrawlBootieMashup.java. It can be run directly from the commandline. It does not download anything; it only prints out URLs which are MP3s, PDFs, or ZIPs.

HTTP requests are throttled by sleeping one second before sending the request.

Future

Add more crawlers.

Add tests.