Re: NutchGora continuous indexing

Mathijs Homminga Thu, 08 Mar 2012 06:57:49 -0800

Hi Dan,

Are you using Solr? What exactly do you mean by 'continuous'?


Perhaps this relates to your thoughts: 
We thought about adding a optional preliminary indexing step directly after the 
fetch/parse step. Within the FetcherReducer that is.
In this indexing step, we output newly fetched documents directly to Solr as 
soon as they are parsed (and of course also persist them through Gora in e.g. 
HBase). 

The idea here is that you can have a first version of your page in the index 
almost immediately.
The trade off is that you index the page before a DbUpdate, which means that 
you cannot use (up-to-date) link information while indexing. However, you might 
still be able to do a DbUpdate job + Indexing job later to overwrite these 
preliminary indexed documents with the real deal.

Also, if you mark the preliminary documents in the index, you can choose to 
include/exclude them in your search. Or at least see that these results are 
preliminary.

Mathijs





> Hi,
> 
> We want to have continuous indexing with NutchGora and are wondering what
> implementation others might already use?
> 
> Our current thinking is along these lines:
> 
> A helper script (perl) to start the Crawler. The helper script runs often
> from cron, it has 2 main tasks
> 
> -Overruns are avoided by use of a pid file, so only 1 Crawler at a time
> runs.
> -Faliures are recorded with a lock file, this stops subsequent runs until
> the problem is resolved.
> 
> The custom Crawler records the batch-id in the log so we can resolve issue
> and run manually before removing the lockfile.
> 
> One thing that might also be useful in NutchGora is something like a
> batch-prefix, e.g. covert a date to Epoch-secs, then e,g,
> NutchJob.shouldProcess can compare current marks with the prefix to index
> all recods since that date for example, any thoughts on this ?
> 
> Cheers,
> Dan
>

Re: NutchGora continuous indexing

Reply via email to