I am indexing pages from Nutch into Solr.  The pages that I am indexing
generally occur in pairs - let's call them Page A and Page B.  Page B is
linked from Page A, but not vice versa.  You can think of Page B as
providing additional information from Page A.

 

Ideally the data from Page A and Page B would be consolidated into a single
document in Solr because they relate to a single business entity. Page A is
HTML, but Page B will require a custom parser.

 

I am considering two options to achieve this:

 

1.       Build an
<http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/parse/HtmlPars
eFilter.html> HtmlParseFilter plugin that will be invoked for Page A and
will retrieve Page B outside Nutch's fetch process.  The plugin would add
the data parsed from Page B to the data for Page A and this would then be
stored as a single document in Nutch and passed to Solr accordingly.

 

2.       Build a Parser plugin for Page B and treat it as a separate
document as far as Nutch is concerned.  Then use the 'join' functionality in
Solr 4.x to retrieve the two documents (I have not worked with this before).

 

I can see pros and cons with each approach.  Option 1 simplifies Solr
queries and ensures that the pairs are always a consistent set since the
data is retrieved at the same time (a minor issue but worth noting).  Option
2 takes better advantage of Nutch's capabilities for redirects, respecting
robots.txt fetch rates, and provides a new parser  better separates out this
functionality etc.

 

What would you recommend in this situation?  Are there other options that I
am missing?

 

Thanks

Reply via email to