Hi, all,
I used nutch 2.1 + HBase to crawling one website. It seems that the remote
website may have some rate limit and will give me http code=500
occasionally, I knew that I probably need to tune the crawl parameters,
such as several delay, etc. But given that I have crawled lots of pages
successfully and only may have 10% of such failed pages, Is it a way to
only fetch those failed pages incrementally.
For interrupted jobs, I used the following command to resume,
./bin/nutch fetch 1364930286-844556485 -resume
it will successfully resume the job and crawled those unfetched pages from
previous failed job. I checked the code, in FetcherJob.java, it has:
{{{
if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; already
fetched");
}
return;
}
}}}
For those failed urls in hbase table, the row has:
{{{
f:prot
timestamp=1365478335194, value= \x02nHttp code=500, url=
mk:_ftcmrk_
timestamp=1365478335194, value=1364930286-844556485
}}}
It seems that the code only will check _ftcmrk_ regardless of if there is a
"f:cnt" or not.
So the questions, does the nutch has some option for method for me to only
fetch those failed pages?
Thanks a lot.
Tianwei