Hi,

> I'm writing a custom IndexWriter and I had some questions on the execution
> workflow.
Have a look at NUTCH-1527 and NUTCH-1541.

> 
> I notice that when I run my index writer plugin the following happens:
> 
> - the describe String is printed
> - the .open method is called once
> - the .write method is called for every NutchDocument
> - the .close method is called
> - the .open method is called
with argument "name" = "commit"
> - the .commit method is called
> - the .close method is called again
> 
> This in most cases seems fine, however I'm not totally clear on what the
> .update or the .delete methods would be used. What is the "expected" use
> for these?
Intuitively, update resp. delete documents which are already in the index
Delete is used, e.g., to be sure that 404 documents are definitely removed
from a Solr index.
Update is actually not used. It may be useful for index end-points which
support field-level updates to update only some fields (e.g. score/boost
and anchor texts which depend on many documents and are permanently changing).

But you are definitively right. The interface o.a.n.indexer.IndexWriter
should provide good documentation for all required methods. Feel free
to open a jira.

> As a possibly related question, is it possible to change the workflow of
> the plugin (without editing Nutch source beyond the plugin)?

Hardly. You have some control what is done by the command-line options -noCommit
and -deleteGone. See o.a.n.indexer.IndexingJob.run(), also shown by
 % bin/nutch index

Bye,
Sebastian

Reply via email to