Re: Continuous crawling

Markus Jelsma Mon, 28 Nov 2011 06:19:27 -0800

Nutch already uses it in the Fetcher. It outputs stuff like below in the 
Hadoop GUI and after each job has finished on stdout.



exception       30,154  0       30,154
access_denied   380     0       380
gone    3,159   0       3,159
moved   18,601  0       18,601
robots_denied   7,889   0       7,889
robots_denied_maxcrawldelay     167     0       167
hitByThrougputThreshold         5       0       5
bytes_downloaded        24,012,066,657  0       24,012,066,657
hitByTimeLimit  3,020   0       3,020
notmodified     30,223  0       30,223
temp_moved      21,653  0       21,653
success         433,955         0       433,955
notfound        23,384  0       23,384


On Monday 28 November 2011 15:09:49 Bai Shen wrote:
> We looked at the hadoop reporter and aren't sure how to access it with
> nutch.  Is there a certain way it works?  Can you give me an example?
> Thanks.
> 
> On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > **
> > 
> > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> > > 
> > > <[email protected]>wrote:
> > > > > Interesting. How do you tell if the segments have been fetched,
> > > > > etc?
> > > > 
> > > > after a job the shell script waits for its completion and return
> > > > code.
> > 
> > If
> > 
> > > > it
> > > > 
> > > > returns 0 all is fine and we move it to another queue. If != 0 then
> > > > 
> > > > there's an
> > > > 
> > > > error and reports via mail.
> > > > 
> > > > 
> > > > 
> > > > Ah, okay. I didn't realize it was returning an error code.
> > > > 
> > > > > How
> > > > > 
> > > > > do you know if there are any urls that had problems?
> > > > 
> > > > Hadoop reporter shows statistics. There are always many errors for
> > > > many
> > > > 
> > > > reasons. This is normal because we crawl everything.
> > > 
> > > How are you running Hadoop reporter?
> > 
> > You'll get it for free when operating a Hadoop cluster.
> > 
> > > > > Or fetch jobs that
> > > > > 
> > > > > errored out, etc.
> > > > 
> > > > The non-zero return code.

-- 
Markus Jelsma - CTO - Openindex

Re: Continuous crawling

Reply via email to