Getting Started
Requirements
A Java Runtime Environment (JRE) – version 1.6 or later.
Running locally with pre-build binaries
The Bixo distribution comes with some examples that demonstrate how to use the Bixo toolkit. The best way to getting started with Bixo is to experiment with those examples and maybe even use them as templates for your own workflows.
- Download the latest distribution file and save it to your computer.
- Expand the file into a directory on your computer.
- Using the command line:
% cd <bixo distribution directory>/examples% bin/bixodemo crawl -agentname <name> -domain <domain> -outputdir <dir> -numloops 3
This will run the DemoCrawlTool which is an example that show cases how to write a simple crawler using Bixo. With the above set of parameters it starts crawling in <domain>, and does three loops of the crawl cycle. The results will be saved to the output directory you specify. This directory shouldn’t exist yet, as otherwise the crawl will assume you’re continuing from a previous crawl. The <domain> should be a valid top-level domain, e.g. cnn.com, and the <name> you specify for the agent name should be something specific to your organization or use-case, NOT “bixo”. To get the list of all options that can be supplied to the DemoCrawlTool run the bixodemo script with the just the ‘crawl’ command and no arguments.
After the crawl has completed, you can dump some statistics by executing the DemoStatusTool
% bin/bixodemo status -crawldir <dir>
Another example included is called the DemoWebMiningTool which demonstrates how to create a focused crawler with an emphasis on extracting (semi) structured data from web pages by analyzing the fetched data using a DOM parser. To run the DemoWebmingTool just execute
% bin/bixodemo webmining -agentname <name> -workingdir <dir>
See Building Bixo for details on how to build Bixo from source.
Running locally in Eclipse
- Follow the Building Bixo steps for getting the source and creating an Eclipse project.
- Open the Run dialog for the DemoCrawlTool class, and specify appropriate parameters for the file containing URLs to crawl, the directory to use for results, and the user-agent name. You’ll also need to set the JVM parameters to “-Xmx256m” so that there’s enough memory to run the Hadoop jobs.
Running in Amazon’s EC2
See the detailed instructions on the Running Bixo in EC2 page.
Bixo Maven Information
By including the Bixo and Conjars repositories in your pom.xml file
<repositories>
<repository>
<id>Bixo</id>
<name>Bixo GitHub repository</name>
<url>http://bixo.github.com/repo/</url>
</repository>
<repository>
<id>Conjars</id>
<name>Cacscading repository</name>
<url>http://conjars.org/repo/</url>
</repository>
</repositories>
you can directly pull in stable releases of Bixo (and Cascading) via:
<dependency> <groupId>bixo</groupId> <artifactId>bixo-core</artifactId> <version>0.8.0</version> </dependency>