Hi, we are crawling a site which splits the content about a single item across a main page and several sub-pages.
We've written custom parsers to extract the necessary data from each of those pages, but we can't think of a clean way to merge all that into one single document for indexing in Solr -- specially taking into account that Nutch doesn't provide guarantees that the sub-pages will be parsed just after the main page, or even in the same run. One solution would be to store each of the information fragments as a separate document in Solr and run a batch process to merge them together and store the "complete" document -- but we would really prefer Nutch to index the complete document at first and not having the noise of incomplete documents in Solr. Ideas welcome. Thanks,

