Home

Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.

Take a look at the Getting Started page, and also the list of resources (mailing list, bug database, source code, etc)

Bixo is an open source project released under the  Apache License, Version 2.0. Note that Bixo relies on Cascading, which is released under the GNU General Public License, version 3.

Bixo Architecture

Bixo consists of a number of Cascading Operations and Subassemblies, which can be combined to form a data processing workflow that (typically) starts with a set of URLs to be fetched, and ends with some results extracted from parsed HTML pages.

The Fetch Subassembly is the component where the heavy lifting is done. URLs are passed to it via UrlDatum tuple wrappers, and two tail pipes emit StatusDatums and FetchedDatums.

The Parse Subassembly is commonly used to process the fetched content. It uses Tika to handle the details of extracting text from various formats, most typically HTML pages.

Fetch Process

The Fetch Subassembly consists of several phases, required for efficient, polite fetching. The exact details may vary, but the general sequence is:

  1. Group URLs by hostname.
  2. Resolve hostname to IP address, fetch/parse the hostname’s robots.txt file, and apply the Robot Exclusion Protocol rules to filter URLs.
  3. Group filtered URLs by IP address, and (optionally) restrict the number of URLs per IP address
  4. Create small batches of URLs, typically no more than 10, that share the same IP address. Assign increasing target fetch times, based on the number of URLs and the crawl delay (which might be a default value, or specified in the robots.txt file)
  5. Group batches of URLs by a partitioning key with N unique values for N reducers, where URL batches with the same IP address will go to the same reducer
  6. Start a multi-threaded reduce operation to fetch batches of URLs, using keep-alive on the HTTP connection.

Note: If you are looking for the Bi(x)o command line tool project, the home page is here.