Using Nutch 1.6 on Ubuntu in a standalone configuration.
I have a custom plugin based on HtmlParseFilter that adds metadata to a
document. It has been working well and has been used to parse at least
100,000 documents successfully. However, I now see some strange behavior
that I am struggling to debug. The specifics are as follow:
I have a field that I want to add to the metadata called 'education'.
There are multiple values parsed from each document - usually 2, sometimes
more.
When I retrieve a value I call:
metadata.add('education', value);
For a test document with two values for the education field, I can see in
the log file that this is called twice with the correct values.
I have added logging to the metadata.add method and I can see that on the
first call it creates a new key in the metadata and on the second call it
adds to the existing key.
So far, so good.
Before the parser plugin returns I print the value of the metadata to the
log using metadata.toString(). In the logfile I see the two entries for the
education field that I would expect and they match what was logged in the
prior call.
However, when I index to Solr I get four education values -- each of the two
entries that I expected are duplicated.
If I run indexchecker it also shows four values.
I have stripped down the plugins list so that my plugin is the last to
execute, but the behavior is the same.
Any ideas what else to look at?
Thanks,
Iain