Re: Only recrawl the pages with http code=500

feng lu Wed, 10 Apr 2013 09:09:14 -0700

you can set fetcher.server.delay and fetcher.server.min.delay properties
too bigger, maybe the crawl successful rate will be higher. the failed page
will be re-fetched when fetch time has come. you can refer to this
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/



On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected]>wrote:

> Hi, all,
>
> I used nutch 2.1 + HBase to crawling one website. It seems that the remote
> website may have some rate limit and will give me http code=500
> occasionally, I knew that I probably need to tune the crawl parameters,
> such as several delay, etc.  But given that I have crawled lots of pages
> successfully and only may have 10% of such failed pages, Is it a way to
> only fetch those failed pages incrementally.
>
> For interrupted jobs, I used the following command to resume,
>
> ./bin/nutch fetch 1364930286-844556485 -resume
>
> it will successfully resume the job and crawled those unfetched pages from
> previous failed job. I checked the code, in FetcherJob.java, it has:
>
> {{{
>       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; already
> fetched");
>         }
>         return;
>       }
> }}}
>
> For those failed urls in hbase table, the row has:
> {{{
> f:prot
>  timestamp=1365478335194, value= \x02nHttp code=500, url=
>  mk:_ftcmrk_
> timestamp=1365478335194, value=1364930286-844556485
> }}}
>
>
> It seems that the code only will check _ftcmrk_ regardless of if there is a
> "f:cnt" or not.
>
>
> So the questions, does the nutch has some option for method for me to only
> fetch those failed pages?
>
> Thanks a lot.
>
> Tianwei
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Only recrawl the pages with http code=500

Reply via email to