Re: Continuous crawling

Markus Jelsma Thu, 10 Nov 2011 12:33:52 -0800

> Interesting.  How do you tell if the segments have been fetched, etc?


after a job the shell script waits for its completion and return code. If it 
returns 0 all is fine and we move it to another queue. If != 0 then there's an 
error and reports via mail.

> How
> do you know if there are any urls that had problems?

Hadoop reporter shows statistics. There are always many errors for many 
reasons. This is normal because we crawl everything.

> Or fetch jobs that
> errored out, etc.

The non-zero return code.

> 
> On Thu, Nov 10, 2011 at 2:01 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > I prefer a suite of shell scripts and cron jobs. We simply generate many
> > segments at once, have a cron job checking for available segments we can
> > fetch
> > and fetch them. If all are fetched, the segemnts are moved to a queue
> > directory for updating the DB. Once the DB has been updated the
> > generators are
> > triggered and the whole circus repeats.
> > 
> > > I've done some searching on this, but haven't found any real solutions.
> >  
> >  Is
> >  
> > > there an existing way to do a continuous crawl using Nutch?  I know I
> > > can use the bin/nutch crawl command, but that stops after a certain
> > > number of iterations.
> > > 
> > > Right now I'm working on a java class to do it, but I would assume it's
> > > a problem that's been solved already.  Unfortunately I can't seem to
> > > find
> > 
> > any
> > 
> > > evidence of this.
> > > 
> > > Thanks.

Re: Continuous crawling

Reply via email to