Skip to content

Frequently Asked Questions (FAQ)

How much time should I allocate for my crawl?

Unfortunately you can’t know in advance how long it will take to crawl all of your URLs, which is one of the classic problems with a vertical crawl. The time required to crawl N URLs depends on the domain that takes the longest to crawl.

And that depends on the number of URLs for the domain, the robots.txt policy for the domain, that domain’s server performance, the performance of your cluster’s DNS/internet pipe, etc.

Which is why Bixo supports a time-based crawl policy, where you set the target duration for the fetch phase. The concept of time-based crawling is to say “Get me as many of the most important URLs as you can, in the next 90 minutes (or whatever)”.

You often won’t get all of the URLs, but you will get the most important ones. And then you can move on to parsing, analyzing, and doing other useful things with the data you fetched, versus watching the fetch phase slowly and painful limp to completion. See this blog post for more details on why a vertical/focused crawl typically suffers from having a limited number of unique domains.

Can I ignore robots.txt?

Well, yes – but you want to do this with extreme caution. Ignoring the information found in robots.txt files is considered very impolite, so you would only want to do this is very specific situations. For example, if you and the site owner were best friends, and they explicitly said it was OK for you to crawl everything as fast as you wanted.

If so, then the easiest way to ignore robots.txt is to create and use your own implementation of IGroupingKeyGenerator, e.g.

public class NoRobotsKeyGenerator implements IGroupingKeyGenerator {

    public String getGroupingKey(UrlDatum urlDatum) throws IOException {
        return return DomainNames.getPLD(urlDatum.getUrl());
    }
}

How does Bixo crawl politely?

Good question. There are four steps taken to ensure efficient yet polite crawling.

First, robots.txt is fetched for any unique sub-domain, and the resulting rules are used to filter URLs.

Second, the crawl delay specified by robots.txt (or a default value of 30 seconds) is combined with the server’s IP address to create the key used to group URLs. This means that when it comes time to process a queue of URLs, every URL in that queue is from the same server and has the same crawl delay. Note that without using the IP address as the key, it’s impossible to avoid slamming a server that handles multiple domains. However we do offer an alternative that groups by the paid-level domain (PLD) instead of the IP address, as for certain types of crawls it’s OK to assume that different domains are handled by different servers.

Third, when the queue is being created for all URLs from the same server/with the same crawl delay, these are sorted by the URL’s initial score. This is typically calculated by the length of time since the URL was last successfully fetched. Based on the target crawl duration, only the top N URLs are actually queued for fetching. So you wind up with a sorted list of URLs for the same server that are likely to be fetchable during the crawl.

Finally, when a thread is available for fetching, it is given a list of the top URLs from the next available queue. This list contains as many URLs as the server should be able to fetch over a 5 minute window, given the crawl delay. The thread then attempts to fetch the list of URLs as quickly as possible, using keep-alive. This approach balances the desire to avoid repeatedly creating and abandoning server connections, while not slamming a server with too many requests in any given time window. This “batched keep-alive” approach is experimental, and we’ll be discussing it in more detail with IT people to determine how best to balance the value of reusing connections with the load spike caused by batched fetches.

Advertisements
%d bloggers like this: