NutchGora continuous indexing

Daniel Rosher Thu, 08 Mar 2012 02:33:44 -0800

Hi,

We want to have continuous indexing with NutchGora and are wondering what
implementation others might already use?


Our current thinking is along these lines:

A helper script (perl) to start the Crawler. The helper script runs often
from cron, it has 2 main tasks

-Overruns are avoided by use of a pid file, so only 1 Crawler at a time
runs.
-Faliures are recorded with a lock file, this stops subsequent runs until
the problem is resolved.

The custom Crawler records the batch-id in the log so we can resolve issue
and run manually before removing the lockfile.

One thing that might also be useful in NutchGora is something like a
batch-prefix, e.g. covert a date to Epoch-secs, then e,g,
NutchJob.shouldProcess can compare current marks with the prefix to index
all recods since that date for example, any thoughts on this ?

Cheers,
Dan

NutchGora continuous indexing

Reply via email to