Hi Iain, Just thinking aloud here so please do not take any of what follows for granted. I can't think of a way of doing this out of the box but here is what you could do :
1. have a normalizer to be used at indexing time so that you could rewrite URL A into URL B. B would be the one used as a field in the index 2. modify the IndexerMapReduce class so that it marges all the ParseData it finds in the Reduce step ( https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L195) instead of keeping only the last one found. if a parsedata has already been found then you could add metadata to it from any new ones 3. you could simply add a prefix to distinguish the metadata from the various ParseData instances and then have a custom IndexingFilter to rewrite/ normalise the key/values in any way you'd want I think this should work but it requires to add a few lines of code for the merging of the ParseData. HTH Julien On 8 May 2014 13:55, Iain Lopata <[email protected]> wrote: > I have a situation in which, ideally, I would like to combine data parsed > from two separate web pages into a single document, which would then be > indexed into Solr. I have looked at the options for passing two separate > documents to Solr and combining the data at query time, but none of the > available options fit my needs very well. > > > > Does anyone have any suggestions on how to approach this? Do I need to > write a custom Indexer or is there a better approach? > > > > It may be worth noting that there is a fixed relationship between the urls > of the two pages, so given the url of either one I can derive the url of > the > oher. > > > > Thanks for any ideas > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

