I am indexing pages from Nutch into Solr. The pages that I am indexing generally occur in pairs - let's call them Page A and Page B. Page B is linked from Page A, but not vice versa. You can think of Page B as providing additional information from Page A.
Ideally the data from Page A and Page B would be consolidated into a single document in Solr because they relate to a single business entity. Page A is HTML, but Page B will require a custom parser. I am considering two options to achieve this: 1. Build an <http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/parse/HtmlPars eFilter.html> HtmlParseFilter plugin that will be invoked for Page A and will retrieve Page B outside Nutch's fetch process. The plugin would add the data parsed from Page B to the data for Page A and this would then be stored as a single document in Nutch and passed to Solr accordingly. 2. Build a Parser plugin for Page B and treat it as a separate document as far as Nutch is concerned. Then use the 'join' functionality in Solr 4.x to retrieve the two documents (I have not worked with this before). I can see pros and cons with each approach. Option 1 simplifies Solr queries and ensures that the pairs are always a consistent set since the data is retrieved at the same time (a minor issue but worth noting). Option 2 takes better advantage of Nutch's capabilities for redirects, respecting robots.txt fetch rates, and provides a new parser better separates out this functionality etc. What would you recommend in this situation? Are there other options that I am missing? Thanks

