Hi, > I'm writing a custom IndexWriter and I had some questions on the execution > workflow. Have a look at NUTCH-1527 and NUTCH-1541.
> > I notice that when I run my index writer plugin the following happens: > > - the describe String is printed > - the .open method is called once > - the .write method is called for every NutchDocument > - the .close method is called > - the .open method is called with argument "name" = "commit" > - the .commit method is called > - the .close method is called again > > This in most cases seems fine, however I'm not totally clear on what the > .update or the .delete methods would be used. What is the "expected" use > for these? Intuitively, update resp. delete documents which are already in the index Delete is used, e.g., to be sure that 404 documents are definitely removed from a Solr index. Update is actually not used. It may be useful for index end-points which support field-level updates to update only some fields (e.g. score/boost and anchor texts which depend on many documents and are permanently changing). But you are definitively right. The interface o.a.n.indexer.IndexWriter should provide good documentation for all required methods. Feel free to open a jira. > As a possibly related question, is it possible to change the workflow of > the plugin (without editing Nutch source beyond the plugin)? Hardly. You have some control what is done by the command-line options -noCommit and -deleteGone. See o.a.n.indexer.IndexingJob.run(), also shown by % bin/nutch index Bye, Sebastian