>
> I am using nutch for a search engine. I can not index webpages until the
> entire crawling process has ended. But i would like a quick update
> operation. The data crawled in front of several  can be added to the index
> even if the entire crawl process is not over yet.
> 1. Have any good idea?
> 2. If i do the indexing operation after every crawl depth, it will waste a
> lot of time. Beause the current solution is rebuilding the whole index. Is
> it possible to index incrementally?
>

It won't rebuild the whole index if you specify what segments to use. Just
write a script instead of using the all-in-one-crawl command and index on
the last segment. Avoiding the Crawl command is a good practice anyway.

Basically you want to have rounds of : generate - fetch - parse - update -
invert - index

BTW there is a JIRA open for adding a generic shell script to replace the
Crawl command : https://issues.apache.org/jira/browse/NUTCH-1087.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to