you can set fetcher.server.delay and fetcher.server.min.delay properties too bigger, maybe the crawl successful rate will be higher. the failed page will be re-fetched when fetch time has come. you can refer to this http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected]>wrote: > Hi, all, > > I used nutch 2.1 + HBase to crawling one website. It seems that the remote > website may have some rate limit and will give me http code=500 > occasionally, I knew that I probably need to tune the crawl parameters, > such as several delay, etc. But given that I have crawled lots of pages > successfully and only may have 10% of such failed pages, Is it a way to > only fetch those failed pages incrementally. > > For interrupted jobs, I used the following command to resume, > > ./bin/nutch fetch 1364930286-844556485 -resume > > it will successfully resume the job and crawled those unfetched pages from > previous failed job. I checked the code, in FetcherJob.java, it has: > > {{{ > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; already > fetched"); > } > return; > } > }}} > > For those failed urls in hbase table, the row has: > {{{ > f:prot > timestamp=1365478335194, value= \x02nHttp code=500, url= > mk:_ftcmrk_ > timestamp=1365478335194, value=1364930286-844556485 > }}} > > > It seems that the code only will check _ftcmrk_ regardless of if there is a > "f:cnt" or not. > > > So the questions, does the nutch has some option for method for me to only > fetch those failed pages? > > Thanks a lot. > > Tianwei > -- Don't Grow Old, Grow Up... :-)

