Hi,

we are crawling a site which splits the content about a single item across
a main page and several sub-pages.

We've written custom parsers to extract the necessary data from each of
those pages, but we can't think of a clean way to merge all that into one
single document for indexing in Solr -- specially taking into account that
Nutch doesn't provide guarantees that the sub-pages will be parsed just
after the main page, or even in the same run.

One solution would be to store each of the information fragments as a
separate document in Solr and run a batch process to merge them together
and store the "complete" document -- but we would really prefer Nutch to
index the complete document at first and not having the noise of incomplete
documents in Solr.

Ideas welcome.

Thanks,

Reply via email to