About

About

Background

The Bixo project came about because two different companies needed the same thing – a web mining toolkit that could easily fit into an existing Cascading-based workflow.

In discussing various ways to solve this problem, it became clear that refactoring Nutch to work in this environment would be a painful and error-prone process. In addition, the known limitations of Nutch would still need to be worked around, while the resulting massive fork would have little to no chance of being rolled back into the main Nutch codebase.

So the shortest distance between the two points was a new, slimmed down implementation that satisfied the following constraints:

  • Used Cascading to manage internal workflow as well as integrating with external data sources and sinks (outputs).
  • Supported only http and https protocols, at least initially.
  • Efficiently yet politely crawled white lists, with a limited number of discrete domains.
  • Testable at multiple levels (unit, integration, simulated web crawl)

Powered By

The following is a partial list of companies using Bixo, along with any public details of use cases.

  • Bebo – Help ensure the quality of their user experience.
  • EMI Music – Extract music/artist popularity data from sources such as Facebook.
  • ShareThis – Fetch, parse & generate a searchable index for shared URLs, and to mine a larger set of viewed web pages.
  • Bixo Labs – Bixo is a key component of their new EC2-based elastic web mining platform.

Acknowledgements

We’d like to thank the following companies and individuals for their support of Bixo:

  • EMI Music and ShareThis for sponsoring Bixo.
  • 101tec for initially hosting the project.
  • Stefan Groschupf for getting the project started.
  • Chris Wensel (author of Cascading, co-founder of Scale Unlimited) for extensive technical support.
  • YourKit, for providing a free license to their excellent Java Profiler. YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit’s leading software products: YourKit Java Profiler and YourKit .NET Profiler.

Bixo also makes heavy use of a number of open source projects:

  • Nutch – a great source for ideas and inspiration.
  • HttpClient 4 – for all your HTTP protocol needs.
  • Tika – a relatively new parser framework.
  • Cascading – the key to efficient and reliable workflow.
  • Hadoop – our foundation for distributed data processing.