Skip to content

Redirect Mode added

August 29, 2010

The master version of Bixo now has support for how redirects get handled during fetching. Why would you care? Well, if the URLs you are processing wind up redirecting between domains, then you often want to avoid blindly following them, as when that happens there is no check for whether the URL is blocked by robots.txt. Also, if you need to track links because you’re building a link graph, then you need to know that the link from Page A to Page B should actually be treated as a link to Page C.

How has this been implemented? It’s a new FetcherPolicy setting. From FetcherPolicy.java:


// Possible redirect handling modes. If a redirect is NOT followed
// because of this setting, then a RedirectFetchException is thrown,
// which is the same as what happens if too many redirects occur.
// But RedirectFetchException now has a reason field, which can 
// be set to TOO_MANY_REDIRECTS, PERM_REDIRECT_DISALLOWED,
// or TEMP_REDIRECT_DISALLOWED.

public enum RedirectMode {
    FOLLOW_ALL,       // Fetcher will try to follow all redirects
    FOLLOW_TEMP,     // Temp redirects are auto-followed, but not permanent.
    FOLLOW_NONE      // No redirects are followed.
}

The default setting is FOLLOW_ALL, in which case the SimpleHttpFetcher behaves the same as before. To set a new mode, you’d do something like:


    FetcherPolicy policy = new FetcherPolicy();
    policy.setRedirectMode(RedirectMode.FOLLOW_TEMP);
Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: