Merging content of multiple pages into one single Solr document

Jose Gil Thu, 24 Nov 2011 07:27:26 -0800

Hi,

we are crawling a site which splits the content about a single item across
a main page and several sub-pages.


We've written custom parsers to extract the necessary data from each of
those pages, but we can't think of a clean way to merge all that into one
single document for indexing in Solr -- specially taking into account that
Nutch doesn't provide guarantees that the sub-pages will be parsed just
after the main page, or even in the same run.

One solution would be to store each of the information fragments as a
separate document in Solr and run a batch process to merge them together
and store the "complete" document -- but we would really prefer Nutch to
index the complete document at first and not having the noise of incomplete
documents in Solr.

Ideas welcome.

Thanks,

Merging content of multiple pages into one single Solr document

Reply via email to