Hi, Feng, Thanks for a lot for suggestions. Now I have increased the fetcher.server.delay and the job seems OK now.
But due to our own crawling restriction, we set the fetch time expiration very large, like 6 months. This is mainly because we have a large pool of websites and also want to set high priority for new urls. On Wed, Apr 10, 2013 at 9:08 AM, feng lu <[email protected]> wrote: > you can set fetcher.server.delay and fetcher.server.min.delay properties > too bigger, maybe the crawl successful rate will be higher. the failed page > will be re-fetched when fetch time has come. you can refer to this > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected] > >wrote: > > > Hi, all, > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the > remote > > website may have some rate limit and will give me http code=500 > > occasionally, I knew that I probably need to tune the crawl parameters, > > such as several delay, etc. But given that I have crawled lots of pages > > successfully and only may have 10% of such failed pages, Is it a way to > > only fetch those failed pages incrementally. > > > > For interrupted jobs, I used the following command to resume, > > > > ./bin/nutch fetch 1364930286-844556485 -resume > > > > it will successfully resume the job and crawled those unfetched pages > from > > previous failed job. I checked the code, in FetcherJob.java, it has: > > > > {{{ > > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) { > > if (LOG.isDebugEnabled()) { > > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > already > > fetched"); > > } > > return; > > } > > }}} > > > > For those failed urls in hbase table, the row has: > > {{{ > > f:prot > > timestamp=1365478335194, value= \x02nHttp code=500, url= > > mk:_ftcmrk_ > > timestamp=1365478335194, value=1364930286-844556485 > > }}} > > > > > > It seems that the code only will check _ftcmrk_ regardless of if there > is a > > "f:cnt" or not. > > > > > > So the questions, does the nutch has some option for method for me to > only > > fetch those failed pages? > > > > Thanks a lot. > > > > Tianwei > > > > > > -- > Don't Grow Old, Grow Up... :-) >

