Skip to content

Bixo gets a YourKit license

November 19, 2010

We’re very excited that YourKit is granting the project a free license to their excellent Java profiler. I used it extensively a few years ago, during my Krugle startup days, and it saved my butt on more than one occasion. As Bixo scales out to a billion pages, the ability to diagnose performance issues and JVM memory usage becomes increasingly critical.

Bixo on CIA.vc

October 10, 2010
tags: ,

CIA.vc is an interesting site that I’ve kept on eye on for many years, ever since we first noticed it during the early days of Krugle. It calls itself “The open source version control informant.”, and its goal is to track changes across all of the open source projects with publicly accessible source code management systems.

I recently noticed that GitHub has a post-commit hook for CIA.vc, so I created an entry for Bixo (http://cia.vc/stats/project/bixo), and connected it to GitHub. So now it should be possible for anyone using CIA to easily track activity on the Bixo project.

All of these HTTP-based, at least somewhat RESTful-ish APIs are making it increasingly trivial to connect the dots, which is great. It’s too bad the Galactic Project Registry never took off, as I’d love to have One Source of Truth about all of the open source projects in the world, especially for license, activity, and usage data.

Redirect Mode added

August 29, 2010

The master version of Bixo now has support for how redirects get handled during fetching. Why would you care? Well, if the URLs you are processing wind up redirecting between domains, then you often want to avoid blindly following them, as when that happens there is no check for whether the URL is blocked by robots.txt. Also, if you need to track links because you’re building a link graph, then you need to know that the link from Page A to Page B should actually be treated as a link to Page C.

How has this been implemented? It’s a new FetcherPolicy setting. From FetcherPolicy.java:


// Possible redirect handling modes. If a redirect is NOT followed
// because of this setting, then a RedirectFetchException is thrown,
// which is the same as what happens if too many redirects occur.
// But RedirectFetchException now has a reason field, which can 
// be set to TOO_MANY_REDIRECTS, PERM_REDIRECT_DISALLOWED,
// or TEMP_REDIRECT_DISALLOWED.

public enum RedirectMode {
    FOLLOW_ALL,       // Fetcher will try to follow all redirects
    FOLLOW_TEMP,     // Temp redirects are auto-followed, but not permanent.
    FOLLOW_NONE      // No redirects are followed.
}

The default setting is FOLLOW_ALL, in which case the SimpleHttpFetcher behaves the same as before. To set a new mode, you’d do something like:


    FetcherPolicy policy = new FetcherPolicy();
    policy.setRedirectMode(RedirectMode.FOLLOW_TEMP);

Bixo Hackathon September 7th & 8th

August 26, 2010

There’s a Bixo hackathon next month, and you’re invited.

While that’s probably a long jaunt for many, even if you can’t make it you can still help by providing input on areas of Bixo that you think need the most love.

Note that even if you’re not a hard-core Bixo user, fringe benefits from participating include learning a lot about the very useful underlying technologies (Cascading, Hadoop, HttpClient) as well as getting an excuse to visit beautiful Nevada City, California.

Some known issues are:

  • Documentation & tutorials (of course).
  • Changing the xxxDatum data model to be wrappers for Cascading tuples, versus POJOs.
  • Making datum metadata into a single unchecked field in datums passed through pre-defined sub-assemblies.
  • Using abstract base classes versus interfaces for many/most extension points.
  • Creating separate crawl and parse policies, versus having just a fetch policy.
  • Emitting (optional) binary data and cleaned up XHTML text in ParseDatum.
  • Supporting page scoring/links scores out-of-the-box.
  • Switching back to using SequenceFiles for maintaining crawl state.

So send a note to the bixo-dev mailing list if you’re interested in attending, or just want to cast a vote/suggest additional changes.

We’ve Moved

January 15, 2010

OK, so that’s pretty obvious since this is at a new domain, with a new look.

101tec graciously hosted Bixo for the first 9 months, and we’d like to thank them for their support.

We’re in the process of migrating other services off of 101tec’s web site as well. The short list is:

  • Issue tracking is moving from Jira to the GitHub Bixo issues system. Existing issues will be ported over incrementally, based on priority.
  • Nexus (Maven repository) will be replaced by static web content served up by either GitHub or code.google.com.
  • TeamCity (build system) will be disabled until we have a pressing need for it.

The mailing list at Yahoo groups and source repository at GitHub remain unchanged.

If anything is broken, please send an email to the mailing list – thanks!