Hi Sheng, I haven't tried this but I have read something similar in this mailing list.
May be, you can do a test with separate nutch crawl and see how it works. Are you using 1.x or 2.x ? On Wed, Apr 10, 2013 at 11:17 PM, Tianwei Sheng <[email protected]>wrote: > Hi, Kiran, > > Yeah, that's what I want. We also used pig, I can just write a pig script > to get those urls and inject them again to the table. > > Btw, are you sure that reinjecting an url into an existing table with the > same row key there will force nutch to recrawl it? Where I can find the > document or code for this? > > > On Wed, Apr 10, 2013 at 9:25 AM, kiran chitturi > <[email protected]>wrote: > > > In addition to feng lu suggestions, > > > > You can also try to reinject the records. A hbase query with a filter of > > http status code 500 will give you the list of urls with stauts code 500. > > > > Then you can simply reinject them, which will ask for nutch to crawl them > > again if I am correct. > > > > > > On Wed, Apr 10, 2013 at 12:08 PM, feng lu <[email protected]> wrote: > > > > > you can set fetcher.server.delay and fetcher.server.min.delay > properties > > > too bigger, maybe the crawl successful rate will be higher. the failed > > page > > > will be re-fetched when fetch time has come. you can refer to this > > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ > > > > > > > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng < > [email protected] > > > >wrote: > > > > > > > Hi, all, > > > > > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the > > > remote > > > > website may have some rate limit and will give me http code=500 > > > > occasionally, I knew that I probably need to tune the crawl > parameters, > > > > such as several delay, etc. But given that I have crawled lots of > > pages > > > > successfully and only may have 10% of such failed pages, Is it a way > to > > > > only fetch those failed pages incrementally. > > > > > > > > For interrupted jobs, I used the following command to resume, > > > > > > > > ./bin/nutch fetch 1364930286-844556485 -resume > > > > > > > > it will successfully resume the job and crawled those unfetched pages > > > from > > > > previous failed job. I checked the code, in FetcherJob.java, it has: > > > > > > > > {{{ > > > > if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) > { > > > > if (LOG.isDebugEnabled()) { > > > > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > > > already > > > > fetched"); > > > > } > > > > return; > > > > } > > > > }}} > > > > > > > > For those failed urls in hbase table, the row has: > > > > {{{ > > > > f:prot > > > > timestamp=1365478335194, value= \x02nHttp code=500, url= > > > > mk:_ftcmrk_ > > > > timestamp=1365478335194, value=1364930286-844556485 > > > > }}} > > > > > > > > > > > > It seems that the code only will check _ftcmrk_ regardless of if > there > > > is a > > > > "f:cnt" or not. > > > > > > > > > > > > So the questions, does the nutch has some option for method for me to > > > only > > > > fetch those failed pages? > > > > > > > > Thanks a lot. > > > > > > > > Tianwei > > > > > > > > > > > > > > > > -- > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > > > -- > > Kiran Chitturi > > > > <http://www.linkedin.com/in/kiranchitturi> > > > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi>

