Re: Only recrawl the pages with http code=500

Tianwei Sheng Wed, 10 Apr 2013 20:15:52 -0700

Hi, Feng,

Thanks for a lot for suggestions. Now I have increased the
fetcher.server.delay and the job seems OK now.


But due to our own crawling restriction, we set the fetch time expiration
very large, like 6 months. This is mainly because we have a large pool of
websites and also want to set high priority for new urls.


On Wed, Apr 10, 2013 at 9:08 AM, feng lu <[email protected]> wrote:

> you can set fetcher.server.delay and fetcher.server.min.delay properties
> too bigger, maybe the crawl successful rate will be higher. the failed page
> will be re-fetched when fetch time has come. you can refer to this
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
>
>
> On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected]
> >wrote:
>
> > Hi, all,
> >
> > I used nutch 2.1 + HBase to crawling one website. It seems that the
> remote
> > website may have some rate limit and will give me http code=500
> > occasionally, I knew that I probably need to tune the crawl parameters,
> > such as several delay, etc.  But given that I have crawled lots of pages
> > successfully and only may have 10% of such failed pages, Is it a way to
> > only fetch those failed pages incrementally.
> >
> > For interrupted jobs, I used the following command to resume,
> >
> > ./bin/nutch fetch 1364930286-844556485 -resume
> >
> > it will successfully resume the job and crawled those unfetched pages
> from
> > previous failed job. I checked the code, in FetcherJob.java, it has:
> >
> > {{{
> >       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
> >         if (LOG.isDebugEnabled()) {
> >           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> already
> > fetched");
> >         }
> >         return;
> >       }
> > }}}
> >
> > For those failed urls in hbase table, the row has:
> > {{{
> > f:prot
> >  timestamp=1365478335194, value= \x02nHttp code=500, url=
> >  mk:_ftcmrk_
> > timestamp=1365478335194, value=1364930286-844556485
> > }}}
> >
> >
> > It seems that the code only will check _ftcmrk_ regardless of if there
> is a
> > "f:cnt" or not.
> >
> >
> > So the questions, does the nutch has some option for method for me to
> only
> > fetch those failed pages?
> >
> > Thanks a lot.
> >
> > Tianwei
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Only recrawl the pages with http code=500

Reply via email to