When the crawler loops it fetches every url found in the previous loop, right? So the crawl time will likely increase exponentially with each loop, right? So
... The most recent loop dir has an up-to-date snapshot of the crawlDB, which is regenerated after each loop. ... Yes, exactly. You've hit upon a fundamental
OK, so if I understand correctly every time I restart a crawl on an existing one, it will extend the original crawl with newly fetched data. It never recrawls
Hi Pat, See below. But in general the SimpleCrawlTool is a demo of Bixo, not a complete crawler, thus much of the functionality you're asking about is missing.
If, for some reason, a crawl fails to finish properly. What is the recommended way to restart it where it left off, or somewhere close. I tried deleting what