Re: Only recrawl the pages with http code=500

kiran chitturi Thu, 11 Apr 2013 06:38:05 -0700

Hi Sheng,

I haven't tried this but I have read something similar in this mailing list.


May be, you can do a test with separate nutch crawl and see how it works.
Are you using 1.x or 2.x ?







On Wed, Apr 10, 2013 at 11:17 PM, Tianwei Sheng <[email protected]>wrote:

> Hi, Kiran,
>
> Yeah, that's what I want. We also used pig, I can just write a pig script
> to get those urls and inject them again to the table.
>
> Btw, are you sure that reinjecting an url into an existing table with the
> same row key there will force nutch to recrawl it?  Where I can find the
> document or code for this?
>
>
> On Wed, Apr 10, 2013 at 9:25 AM, kiran chitturi
> <[email protected]>wrote:
>
> > In addition to feng lu suggestions,
> >
> > You can also try to reinject the records. A hbase query with a filter of
> > http status code 500 will give you the list of urls with stauts code 500.
> >
> > Then you can simply reinject them, which will ask for nutch to crawl them
> > again if I am correct.
> >
> >
> > On Wed, Apr 10, 2013 at 12:08 PM, feng lu <[email protected]> wrote:
> >
> > > you can set fetcher.server.delay and fetcher.server.min.delay
> properties
> > > too bigger, maybe the crawl successful rate will be higher. the failed
> > page
> > > will be re-fetched when fetch time has come. you can refer to this
> > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > >
> > >
> > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <
> [email protected]
> > > >wrote:
> > >
> > > > Hi, all,
> > > >
> > > > I used nutch 2.1 + HBase to crawling one website. It seems that the
> > > remote
> > > > website may have some rate limit and will give me http code=500
> > > > occasionally, I knew that I probably need to tune the crawl
> parameters,
> > > > such as several delay, etc.  But given that I have crawled lots of
> > pages
> > > > successfully and only may have 10% of such failed pages, Is it a way
> to
> > > > only fetch those failed pages incrementally.
> > > >
> > > > For interrupted jobs, I used the following command to resume,
> > > >
> > > > ./bin/nutch fetch 1364930286-844556485 -resume
> > > >
> > > > it will successfully resume the job and crawled those unfetched pages
> > > from
> > > > previous failed job. I checked the code, in FetcherJob.java, it has:
> > > >
> > > > {{{
> > > >       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null)
> {
> > > >         if (LOG.isDebugEnabled()) {
> > > >           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> > > already
> > > > fetched");
> > > >         }
> > > >         return;
> > > >       }
> > > > }}}
> > > >
> > > > For those failed urls in hbase table, the row has:
> > > > {{{
> > > > f:prot
> > > >  timestamp=1365478335194, value= \x02nHttp code=500, url=
> > > >  mk:_ftcmrk_
> > > > timestamp=1365478335194, value=1364930286-844556485
> > > > }}}
> > > >
> > > >
> > > > It seems that the code only will check _ftcmrk_ regardless of if
> there
> > > is a
> > > > "f:cnt" or not.
> > > >
> > > >
> > > > So the questions, does the nutch has some option for method for me to
> > > only
> > > > fetch those failed pages?
> > > >
> > > > Thanks a lot.
> > > >
> > > > Tianwei
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
> > <http://www.linkedin.com/in/kiranchitturi>
> >
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Only recrawl the pages with http code=500

Reply via email to