> > I am using nutch for a search engine. I can not index webpages until the > entire crawling process has ended. But i would like a quick update > operation. The data crawled in front of several can be added to the index > even if the entire crawl process is not over yet. > 1. Have any good idea? > 2. If i do the indexing operation after every crawl depth, it will waste a > lot of time. Beause the current solution is rebuilding the whole index. Is > it possible to index incrementally? >
It won't rebuild the whole index if you specify what segments to use. Just write a script instead of using the all-in-one-crawl command and index on the last segment. Avoiding the Crawl command is a good practice anyway. Basically you want to have rounds of : generate - fetch - parse - update - invert - index BTW there is a JIRA open for adding a generic shell script to replace the Crawl command : https://issues.apache.org/jira/browse/NUTCH-1087. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

