... A 404 should result in UrlStatus.HTTP_NOT_FOUND What you do with those entries in the crawlDB is up to your processing code. In the DemoCrawlTool, the
The miner is getting urls to 404s. So Pinterest is allowing removal of pages but leaving the links to those pages around. If leaving them in the crawldb marked
... One other thought - we do "batch" fetching of URLs, using keep-alive to optimize the connection that we create with the server. Pinterest might not like
Again, thanks. You are the only one I know with crawler-fu skills. I'll take a look at the headers in the simple fetcher. I suspect that Pinterest is not