At the top of the De-Duplication wiki page is a note about collapsing
results. Once you have the signature (identical for each of the duplicates)
you'll want to collapse your results, keeping the one with max date.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results


k/r,
Scott

On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> of the content to a signature field, and group the signature field during
> your search.
>
> You can find more information here:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> I have been using this method to group the index with duplicated content,
> and it is working fine.
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com>
> wrote:
>
> > Hi,
> >
> >   I'm looking to customizing index time de-duplication. Here's my use
> case
> > and what I'm trying to achieve.
> >
> > I've identical documents coming from different release year of a given
> > product. I need to index them in Solr as they are required in individual
> > year context. But there's a generic search which spans across all the
> years
> > and hence bring back duplicate/identical content. My goal is to only
> return
> > the latest document and filter out the rest. For e.g. if product A has
> > identical documents for 2015, 2014 and 2013, search should only return
> 2015
> > (latest document) and filter out the rest.
> >
> > What I'm thinking (if possible) during index time :
> >
> > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > 2014 content, keeping 2015 (the latest release) untouched. During query
> > time, I'll add a filter which will exclude contents tagged with "dedup".
> >
> > Just wondering if this is achievable by perhaps extending
> > UpdateRequestProcessorFactory or
> > customizing SignatureUpdateProcessorFactory ?
> >
> > Any pointers will be appreciated.
> >
> > Regards,
> > Shamik
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Reply via email to