We have done that too. The biggest problem is not having a reliable
lastModified date and indeed inlinks and not knowing whether the document has
changed. The inlink problem can be solved with the new Solr update semantics
where partial updates are possible.
-----Original message-----
> From:Ferdy Galema <[email protected]>
> Sent: Mon 02-Jul-2012 10:32
> To: [email protected]
> Subject: Re: How to update the index quickly?
>
> We also have plans to make a quick indexer, but we have not got around to
> it yet. The trick is to simply call the indexing code for a page when it is
> parsed. (This can even happen during fetch, so this will combine a
> fetch-parse-index in a a single step). The tradeoff is that some
> information might not be available yet, such as inlink information.
>
> Keep an eye on the Jira list for a possible implementation. (Or try
> yourself if you are into Nutch hacking).
>
> On Mon, Jul 2, 2012 at 5:20 AM, 何建云 <[email protected]> wrote:
>
> > Hi,
> > I am using nutch for a search engine. I can not index webpages until the
> > entire crawling process has ended. But i would like a quick update
> > operation. The data crawled in front of several can be added to the index
> > even if the entire crawl process is not over yet.
> > 1. Have any good idea?
> > 2. If i do the indexing operation after every crawl depth, it will waste a
> > lot of time. Beause the current solution is rebuilding the whole index. Is
> > it possible to index incrementally?
> > thanks.
>