Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.
Bixo consists of a number of Cascading Operations and Subassemblies, which can be combined to form a data processing workflow that (typically) starts with a set of URLs to be fetched, and ends with some results extracted from parsed HTML pages.
The Fetch Subassembly consists of several phases, required for efficient, polite fetching. The exact details may vary, but the general sequence is:
- Group URLs by hostname.
- Resolve hostname to IP address, fetch/parse the hostname’s robots.txt file, and apply the Robot Exclusion Protocol rules to filter URLs.
- Group filtered URLs by IP address, and (optionally) restrict the number of URLs per IP address
- Create small batches of URLs, typically no more than 10, that share the same IP address. Assign increasing target fetch times, based on the number of URLs and the crawl delay (which might be a default value, or specified in the robots.txt file)
- Group batches of URLs by a partitioning key with N unique values for N reducers, where URL batches with the same IP address will go to the same reducer
- Start a multi-threaded reduce operation to fetch batches of URLs, using keep-alive on the HTTP connection.
Note: If you are looking for the Bi(x)o command line tool project, the home page is here.