Re: Only recrawl the pages with http code=500

kiran chitturi Wed, 10 Apr 2013 11:02:15 -0700

Hi Alex,

I see two ways of doing it. I don't have the scripts right now but I will
give pointers


1) You can use HBase shell or any HBase client and query HBase by using the
filters. In this case, ValueFilter can be used to check exact value in the
column [0]

2) Second is to write a pig script, which reads the data from HBase (Read
the fields url and http status code) and then filter the records based on
the status code, store the output wherever you want. [1]

I would suggest to go with Pig in case if you are operating on a cluster
and it is easy to write pig scripts for these kind of jobs than doing a
mapreduce.

[0] -
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ValueFilter.html
[1] http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#FILTER

Hope this helps.




On Wed, Apr 10, 2013 at 1:24 PM, <[email protected]> wrote:

> Hi,
>
> ==
>
> A hbase query with a filter of
> http status code 500 will give you the list of urls with stauts code 500.
> ==
> Could you please let me know how to do this? I was trying to get an answer
> to this kind of selection in hbase mailing list without success.
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: kiran chitturi <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Apr 10, 2013 9:25 am
> Subject: Re: Only recrawl the pages with http code=500
>
>
> In addition to feng lu suggestions,
>
> You can also try to reinject the records. A hbase query with a filter of
> http status code 500 will give you the list of urls with stauts code 500.
>
> Then you can simply reinject them, which will ask for nutch to crawl them
> again if I am correct.
>
>
> On Wed, Apr 10, 2013 at 12:08 PM, feng lu <[email protected]> wrote:
>
> > you can set fetcher.server.delay and fetcher.server.min.delay properties
> > too bigger, maybe the crawl successful rate will be higher. the failed
> page
> > will be re-fetched when fetch time has come. you can refer to this
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> >
> >
> > On Wed, Apr 10, 2013 at 3:16 AM, Tianwei Sheng <[email protected]
> > >wrote:
> >
> > > Hi, all,
> > >
> > > I used nutch 2.1 + HBase to crawling one website. It seems that the
> > remote
> > > website may have some rate limit and will give me http code=500
> > > occasionally, I knew that I probably need to tune the crawl parameters,
> > > such as several delay, etc.  But given that I have crawled lots of
> pages
> > > successfully and only may have 10% of such failed pages, Is it a way to
> > > only fetch those failed pages incrementally.
> > >
> > > For interrupted jobs, I used the following command to resume,
> > >
> > > ./bin/nutch fetch 1364930286-844556485 -resume
> > >
> > > it will successfully resume the job and crawled those unfetched pages
> > from
> > > previous failed job. I checked the code, in FetcherJob.java, it has:
> > >
> > > {{{
> > >       if (shouldContinue && Mark.FETCH_MARK.checkMark(page) != null) {
> > >         if (LOG.isDebugEnabled()) {
> > >           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> > already
> > > fetched");
> > >         }
> > >         return;
> > >       }
> > > }}}
> > >
> > > For those failed urls in hbase table, the row has:
> > > {{{
> > > f:prot
> > >  timestamp=1365478335194, value= \x02nHttp code=500, url=
> > >  mk:_ftcmrk_
> > > timestamp=1365478335194, value=1364930286-844556485
> > > }}}
> > >
> > >
> > > It seems that the code only will check _ftcmrk_ regardless of if there
> > is a
> > > "f:cnt" or not.
> > >
> > >
> > > So the questions, does the nutch has some option for method for me to
> > only
> > > fetch those failed pages?
> > >
> > > Thanks a lot.
> > >
> > > Tianwei
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
>


-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Re: Only recrawl the pages with http code=500

Reply via email to