Using Nutch 1.6 on Ubuntu in a standalone configuration.

 

I have a custom plugin based on HtmlParseFilter that adds metadata to a
document.  It has been working well and has been used to parse at least
100,000 documents successfully.  However, I now see some strange behavior
that I am struggling to debug.  The specifics are as follow:

 

I have a field that I want to add to the metadata called 'education'.

There are multiple values parsed from each document - usually 2, sometimes
more.

When I retrieve a value I call:

                

   metadata.add('education', value);

 

For a test document with two values for the education field, I can see in
the log file that this is called twice with the correct values.

 

I have added logging to the metadata.add method and I can see that on the
first call it creates a new key in the metadata and on the second call it
adds to the existing key.

 

So far, so good.

 

Before the parser plugin returns I print the value of the metadata to the
log using metadata.toString().  In the logfile I see the two entries for the
education field that I would expect and they match what was logged in the
prior call.

 

However, when I index to Solr I get four education values -- each of the two
entries that I expected are duplicated.

 

If I run indexchecker it also shows four values.

 

I have stripped down the plugins list so that my plugin is the last to
execute, but the behavior is the same.

 

Any ideas what else to look at?

 

Thanks,

 

Iain

Reply via email to