Redirect Mode added

The master version of Bixo now has support for how redirects get handled during fetching. Why would you care? Well, if the URLs you are processing wind up redirecting between domains, then you often want to avoid blindly following them, as when that happens there is no check for whether the URL is blocked by robots.txt. Also, if you need to track links because you’re building a link graph, then you need to know that the link from Page A to Page B should actually be treated as a link to Page C.

How has this been implemented? It’s a new FetcherPolicy setting. From FetcherPolicy.java:


// Possible redirect handling modes. If a redirect is NOT followed
// because of this setting, then a RedirectFetchException is thrown,
// which is the same as what happens if too many redirects occur.
// But RedirectFetchException now has a reason field, which can 
// be set to TOO_MANY_REDIRECTS, PERM_REDIRECT_DISALLOWED,
// or TEMP_REDIRECT_DISALLOWED.

public enum RedirectMode {
    FOLLOW_ALL,       // Fetcher will try to follow all redirects
    FOLLOW_TEMP,     // Temp redirects are auto-followed, but not permanent.
    FOLLOW_NONE      // No redirects are followed.
}

The default setting is FOLLOW_ALL, in which case the SimpleHttpFetcher behaves the same as before. To set a new mode, you’d do something like:


    FetcherPolicy policy = new FetcherPolicy();
    policy.setRedirectMode(RedirectMode.FOLLOW_TEMP);

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *