Re: Only recrawl the pages with http code=500

Tianwei Sheng Thu, 11 Apr 2013 10:09:24 -0700

Hi, Kiran,

I just have a test, it works.  I also read the InjectorJob.java a bit, it
seems that the injector will reset the FetchTime and FetchInterval, so the
urls reinjected may be recrawled.


I used 2.x.

Tianwei


On Thu, Apr 11, 2013 at 6:25 AM, kiran chitturi
<[email protected]>wrote:

> Hi Sheng,
>
> I haven't tried this but I have read something similar in this mailing
> list.
>
> May be, you can do a test with separate nutch crawl and see how it works.
> Are you using 1.x or 2.x ?
>
>
>
>
>
>
>
> On Wed, Apr 10, 2013 at 11:17 PM, Tianwei Sheng <[email protected]
> >wrote:
>
> > Hi, Kiran,
> >
> > Yeah, that's what I want. We also used pig, I can just write a pig script
> > to get those urls and inject them again to the table.
> >
> > Btw, are you sure that reinjecting an url into an existing table with the
> > same row key there will force nutch to recrawl it?  Where I can find the
> > document or code for this?
> >
> >
> > On Wed, Apr 10, 2013 at 9:25 AM, kiran chitturi
> > <[email protected]>wrote:
> >
> > > In addition to feng lu suggestions,
> > >
> > > You can also try to reinject the records. A hbase query with a filter
> of
> > > http status code 500 will give you the list of urls with stauts code
> 500.
> > >
> > > Then you can simply reinject them, which will ask for nutch to crawl
> them
> > > again if I am correct.
> > >
> > >
> > > On Wed, Apr 10, 2013 at 12:08 PM, feng lu <[email protected]>
> wrote:
> > >
> > > > you can set fetcher.server.delay and fetcher.server.min.delay
> > properties
> > > > too bigger, maybe the crawl successful rate will be higher. the
> failed
> > > page
> > > > will be re-fetched when fetch time has come. you can refer to this
> > > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > > >
> > > >
> > > > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <
> > [email protected]
> > > > >wrote:
> > > >
> > > > > Hi, all,
> > > > >
> > > > > I used nutch 2.1 + HBase to crawling one website. It seems that the
> > > > remote
> > > > > website may have some rate limit and will give me http code=500
> > > > > occasionally, I knew that I probably need to tune the crawl
> > parameters,
> > > > > such as several delay, etc.  But given that I have crawled lots of
> > > pages
> > > > > successfully and only may have 10% of such failed pages, Is it a
> way
> > to
> > > > > only fetch those failed pages incrementally.
> > > > >
> > > > > For interrupted jobs, I used the following command to resume,
> > > > >
> > > > > ./bin/nutch fetch 1364930286-844556485 -resume
> > > > >
> > > > > it will successfully resume the job and crawled those unfetched
> pages
> > > > from
> > > > > previous failed job. I checked the code, in FetcherJob.java, it
> has:
> > > > >
> > > > > {{{
> > > > >       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) !=
> null)
> > {
> > > > >         if (LOG.isDebugEnabled()) {
> > > > >           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> > > > already
> > > > > fetched");
> > > > >         }
> > > > >         return;
> > > > >       }
> > > > > }}}
> > > > >
> > > > > For those failed urls in hbase table, the row has:
> > > > > {{{
> > > > > f:prot
> > > > >  timestamp=1365478335194, value= \x02nHttp code=500, url=
> > > > >  mk:_ftcmrk_
> > > > > timestamp=1365478335194, value=1364930286-844556485
> > > > > }}}
> > > > >
> > > > >
> > > > > It seems that the code only will check _ftcmrk_ regardless of if
> > there
> > > > is a
> > > > > "f:cnt" or not.
> > > > >
> > > > >
> > > > > So the questions, does the nutch has some option for method for me
> to
> > > > only
> > > > > fetch those failed pages?
> > > > >
> > > > > Thanks a lot.
> > > > >
> > > > > Tianwei
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Don't Grow Old, Grow Up... :-)
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> > > <http://www.linkedin.com/in/kiranchitturi>
> > >
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>

Re: Only recrawl the pages with http code=500

Reply via email to