Hi, If you need to merge different NutchDocument objects in a single SolrDocument you need to partially redesign IndexerMapReduce and create some composite key. The goal needs to be to get the same map/reduce key for both documents and process then accordingly in the reducer where you can merge them.
This is not going to be easy if you're not familiar with map/reduce programming. In the mapper you must check each input object for a marker that tells you it belongs to a group and emit a unique key for all objects for that group. In the reducer you can them process them together. The hard part is determining whether the various input objects are part of a group because you get ParseData, ParseText and CrawlDatum objects. However, there may be another user with a better idea ;) Cheers, On Thursday 24 November 2011 15:37:04 Jose Gil wrote: > Hi, > > we are crawling a site which splits the content about a single item across > a main page and several sub-pages. > > We've written custom parsers to extract the necessary data from each of > those pages, but we can't think of a clean way to merge all that into one > single document for indexing in Solr -- specially taking into account that > Nutch doesn't provide guarantees that the sub-pages will be parsed just > after the main page, or even in the same run. > > One solution would be to store each of the information fragments as a > separate document in Solr and run a batch process to merge them together > and store the "complete" document -- but we would really prefer Nutch to > index the complete document at first and not having the noise of incomplete > documents in Solr. > > Ideas welcome. > > Thanks, -- Markus Jelsma - CTO - Openindex

