> Interesting. How do you tell if the segments have been fetched, etc?
after a job the shell script waits for its completion and return code. If it returns 0 all is fine and we move it to another queue. If != 0 then there's an error and reports via mail. > How > do you know if there are any urls that had problems? Hadoop reporter shows statistics. There are always many errors for many reasons. This is normal because we crawl everything. > Or fetch jobs that > errored out, etc. The non-zero return code. > > On Thu, Nov 10, 2011 at 2:01 PM, Markus Jelsma > > <[email protected]>wrote: > > I prefer a suite of shell scripts and cron jobs. We simply generate many > > segments at once, have a cron job checking for available segments we can > > fetch > > and fetch them. If all are fetched, the segemnts are moved to a queue > > directory for updating the DB. Once the DB has been updated the > > generators are > > triggered and the whole circus repeats. > > > > > I've done some searching on this, but haven't found any real solutions. > > > > Is > > > > > there an existing way to do a continuous crawl using Nutch? I know I > > > can use the bin/nutch crawl command, but that stops after a certain > > > number of iterations. > > > > > > Right now I'm working on a java class to do it, but I would assume it's > > > a problem that's been solved already. Unfortunately I can't seem to > > > find > > > > any > > > > > evidence of this. > > > > > > Thanks.

