Re: Continuous crawling

Bai Shen Thu, 10 Nov 2011 12:21:28 -0800

Interesting.  How do you tell if the segments have been fetched, etc?  How
do you know if there are any urls that had problems?  Or fetch jobs that
errored out, etc.


On Thu, Nov 10, 2011 at 2:01 PM, Markus Jelsma
<[email protected]>wrote:

> I prefer a suite of shell scripts and cron jobs. We simply generate many
> segments at once, have a cron job checking for available segments we can
> fetch
> and fetch them. If all are fetched, the segemnts are moved to a queue
> directory for updating the DB. Once the DB has been updated the generators
> are
> triggered and the whole circus repeats.
>
>
> > I've done some searching on this, but haven't found any real solutions.
>  Is
> > there an existing way to do a continuous crawl using Nutch?  I know I can
> > use the bin/nutch crawl command, but that stops after a certain number of
> > iterations.
> >
> > Right now I'm working on a java class to do it, but I would assume it's a
> > problem that's been solved already.  Unfortunately I can't seem to find
> any
> > evidence of this.
> >
> > Thanks.
>

Re: Continuous crawling

Reply via email to