Skip to content

Getting Started

Requirements

A Java Runtime Environment (JRE) – version 1.6 or later.

Running locally with pre-build binaries

The Bixo distribution comes with some examples that demonstrate how to use the Bixo toolkit. The best way to getting started with Bixo is to experiment with those examples and maybe even use them as templates for your own workflows.

  1. Download the latest distribution file and save it to your computer.
  2. Expand the file into a directory on your computer.
  3. Using the command line:
    • % cd /examples
    • % bin/bixodemo crawl -agentname -domain -outputdir-numloops 3

This will run the DemoCrawlTool which is an example that show cases how to write a simple crawler using Bixo. With the above set of parameters it starts crawling in , and does three loops of the crawl cycle. The results will be saved to the output directory you specify. This directory shouldn’t exist yet, as otherwise the crawl will assume you’re continuing from a previous crawl. The should be a valid top-level domain, e.g. cnn.com, and the you specify for the agent name should be something specific to your organization or use-case, NOT “bixo”.  To get the list of all options that can be supplied to the DemoCrawlTool run the bixodemo script with the just the ‘crawl’ command and no arguments.

After the crawl has completed, you can dump some statistics by executing the DemoStatusTool

% bin/bixodemo status -workingdir

Another example included is called the DemoWebMiningTool which demonstrates how to create a focused crawler with an emphasis on extracting (semi) structured data from web pages by analyzing the fetched data using a DOM parser. To run the DemoWebmingTool just execute

% bin/bixodemo webmining -agentname -workingdir

See Building Bixo for details on how to build Bixo from source.

Running locally in Eclipse

  1. Follow the Building Bixo steps for getting the source and creating an Eclipse project.
  2. Open the Run dialog for the DemoCrawlTool class, and specify appropriate parameters for the file containing URLs to crawl, the directory to use for results, and the user-agent name. You’ll also need to set the JVM parameters to “-Xmx256m” so that there’s enough memory to run the Hadoop jobs.

Running in Amazon’s EMR

See the instructions on the Running Bixo in EMR page.

Bixo Maven Information

By including the Bixo and Conjars repositories in your pom.xml file

    <repositories>
        <repository>
            <id>Conjars</id>
            <url>http://conjars.org/repo</url>
        </repository>
    </repositories>

you can directly pull in stable releases of Bixo (and Cascading) via:

        <dependency>
            <groupId>bixo</groupId>
            <artifactId>bixo-core</artifactId>
            <version>0.9.2</version>
        </dependency>
%d bloggers like this: