Re: Combining Document Parse Data

Julien Nioche Mon, 19 May 2014 06:27:40 -0700

Hi Iain,

Just thinking aloud here so please do not take any of what follows for
granted. I can't think of a way of doing this out of the box but here is
what you could do :

1. have a normalizer to be used at indexing time so that you could rewrite
URL A into URL B. B would be the one used as a field in the index
2. modify the IndexerMapReduce class so that it marges all the ParseData it
finds in the Reduce step (
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L195)
instead of keeping only the last one found. if a parsedata has already been
found then you could add metadata to it from any new ones
3. you could simply add a prefix to distinguish the metadata from the
various ParseData instances and then have a custom IndexingFilter to
rewrite/ normalise the key/values in any way you'd want

I think this should work but it requires to add a few lines of code for the
merging of the ParseData.

HTH

Julien

On 8 May 2014 13:55, Iain Lopata <[email protected]> wrote:

> I have a situation in which, ideally, I would like to combine data parsed
> from two separate web pages into a single document, which would then be
> indexed into Solr.  I have looked at the options for passing two separate
> documents to Solr and combining the data at query time, but none of the
> available options fit my needs very well.
>
>
>
> Does anyone have any suggestions on how to approach this?  Do I need to
> write a custom Indexer or is there a better approach?
>
>
>
> It may be worth noting that there is a fixed relationship between the urls
> of the two pages, so given the url of either one I can derive the url of
> the
> oher.
>
>
>
> Thanks for any ideas
>
>

-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Combining Document Parse Data

Reply via email to