Only recrawl the pages with http code=500

Tianwei Sheng Tue, 09 Apr 2013 12:17:19 -0700

Hi, all,

I used nutch 2.1 + HBase to crawling one website. It seems that the remote
website may have some rate limit and will give me http code=500
occasionally, I knew that I probably need to tune the crawl parameters,
such as several delay, etc.  But given that I have crawled lots of pages
successfully and only may have 10% of such failed pages, Is it a way to
only fetch those failed pages incrementally.


For interrupted jobs, I used the following command to resume,

./bin/nutch fetch 1364930286-844556485 -resume

it will successfully resume the job and crawled those unfetched pages from
previous failed job. I checked the code, in FetcherJob.java, it has:

{{{
      if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; already
fetched");
        }
        return;
      }
}}}

For those failed urls in hbase table, the row has:
{{{
f:prot
 timestamp=1365478335194, value= \x02nHttp code=500, url=
 mk:_ftcmrk_
timestamp=1365478335194, value=1364930286-844556485
}}}


It seems that the code only will check _ftcmrk_ regardless of if there is a
"f:cnt" or not.


So the questions, does the nutch has some option for method for me to only
fetch those failed pages?

Thanks a lot.

Tianwei

Only recrawl the pages with http code=500

Reply via email to