Re: Only recrawl the pages with http code=500

kiran chitturi Wed, 10 Apr 2013 09:25:51 -0700

In addition to feng lu suggestions,

You can also try to reinject the records. A hbase query with a filter of
http status code 500 will give you the list of urls with stauts code 500.


Then you can simply reinject them, which will ask for nutch to crawl them
again if I am correct.


On Wed, Apr 10, 2013 at 12:08 PM, feng lu <[email protected]> wrote:

> you can set fetcher.server.delay and fetcher.server.min.delay properties
> too bigger, maybe the crawl successful rate will be higher. the failed page
> will be re-fetched when fetch time has come. you can refer to this
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
>
>
> On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected]
> >wrote:
>
> > Hi, all,
> >
> > I used nutch 2.1 + HBase to crawling one website. It seems that the
> remote
> > website may have some rate limit and will give me http code=500
> > occasionally, I knew that I probably need to tune the crawl parameters,
> > such as several delay, etc.  But given that I have crawled lots of pages
> > successfully and only may have 10% of such failed pages, Is it a way to
> > only fetch those failed pages incrementally.
> >
> > For interrupted jobs, I used the following command to resume,
> >
> > ./bin/nutch fetch 1364930286-844556485 -resume
> >
> > it will successfully resume the job and crawled those unfetched pages
> from
> > previous failed job. I checked the code, in FetcherJob.java, it has:
> >
> > {{{
> >       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
> >         if (LOG.isDebugEnabled()) {
> >           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> already
> > fetched");
> >         }
> >         return;
> >       }
> > }}}
> >
> > For those failed urls in hbase table, the row has:
> > {{{
> > f:prot
> >  timestamp=1365478335194, value= \x02nHttp code=500, url=
> >  mk:_ftcmrk_
> > timestamp=1365478335194, value=1364930286-844556485
> > }}}
> >
> >
> > It seems that the code only will check _ftcmrk_ regardless of if there
> is a
> > "f:cnt" or not.
> >
> >
> > So the questions, does the nutch has some option for method for me to
> only
> > fetch those failed pages?
> >
> > Thanks a lot.
> >
> > Tianwei
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Only recrawl the pages with http code=500

Reply via email to