Re: Question on index time de-duplication

2015-11-01 Thread shamik
That's what I observed as well. Perhaps there's a way to customize
SignatureUpdateProcessorFactory to support my use case. I'll look into the
source code and figure if there's a way to do it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237623.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-31 Thread Zheng Lin Edwin Yeo
Hi Shamik,

I'm using most of the configuration out of the box, but I'm also looking at
tagging an identifier or something so that it will always show the latest
documents.

At first I thought it will automatically show the one that is indexed
later, but seems that it is not the case. It will just show a random one if
we use the default configurations.

Will update here also if I find any solutions or tips.

Regards,
Edwin


On 31 October 2015 at 00:38, shamik <sham...@gmail.com> wrote:

> Thanks for your reply. Have you customized SignatureUpdateProcessorFactory
> or
> are you using the configuration out of the box ? I know it works for simple
> dedup, but my requirement is tad different as I need to tag an identifier
> to
> the latest document. My goal is to understand if that's possible using
> SignatureUpdateProcessorFactory.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Question on index time de-duplication

2015-10-30 Thread Markus Jelsma
Hello - keep in mind that both SignatureUpdateProcessorFactory and field 
collapsing do not work in distributed search unless you map identical 
signatures to identical shards.
Markus
 
-Original message-
> From:Scott Stults <sstu...@opensourceconnections.com>
> Sent: Friday 30th October 2015 11:58
> To: solr-user@lucene.apache.org
> Subject: Re: Question on index time de-duplication
> 
> At the top of the De-Duplication wiki page is a note about collapsing
> results. Once you have the signature (identical for each of the duplicates)
> you'll want to collapse your results, keeping the one with max date.
> 
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 
> 
> k/r,
> Scott
> 
> On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> 
> > Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> > of the content to a signature field, and group the signature field during
> > your search.
> >
> > You can find more information here:
> > https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >
> > I have been using this method to group the index with duplicated content,
> > and it is working fine.
> >
> > Regards,
> > Edwin
> >
> >
> > On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > >   I'm looking to customizing index time de-duplication. Here's my use
> > case
> > > and what I'm trying to achieve.
> > >
> > > I've identical documents coming from different release year of a given
> > > product. I need to index them in Solr as they are required in individual
> > > year context. But there's a generic search which spans across all the
> > years
> > > and hence bring back duplicate/identical content. My goal is to only
> > return
> > > the latest document and filter out the rest. For e.g. if product A has
> > > identical documents for 2015, 2014 and 2013, search should only return
> > 2015
> > > (latest document) and filter out the rest.
> > >
> > > What I'm thinking (if possible) during index time :
> > >
> > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > > 2014 content, keeping 2015 (the latest release) untouched. During query
> > > time, I'll add a filter which will exclude contents tagged with "dedup".
> > >
> > > Just wondering if this is achievable by perhaps extending
> > > UpdateRequestProcessorFactory or
> > > customizing SignatureUpdateProcessorFactory ?
> > >
> > > Any pointers will be appreciated.
> > >
> > > Regards,
> > > Shamik
> > >
> >
> 
> 
> 
> -- 
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
> 


RE: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Markus. I've been using field collapsing till now but the performance
constraint is forcing me to think about index time de-duplication. I've been
using a composite router to make sure that duplicate documents are routed to
the same shard. Won't that work for SignatureUpdateProcessorFactory ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Scott. I could directly use field collapsing on adskdedup field
without the signature field. Problem with field collapsing is the
performance overhead. It slows down the query to 10 folds.
CollapsingQParserPlugin is a better option, unfortunately, it doesn't
support ngroups equivalent, which is a requirement for me.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237401.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or
are you using the configuration out of the box ? I know it works for simple
dedup, but my requirement is tad different as I need to tag an identifier to
the latest document. My goal is to understand if that's possible using
SignatureUpdateProcessorFactory. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on index time de-duplication

2015-10-30 Thread Scott Stults
At the top of the De-Duplication wiki page is a note about collapsing
results. Once you have the signature (identical for each of the duplicates)
you'll want to collapse your results, keeping the one with max date.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results


k/r,
Scott

On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo 
wrote:

> Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> of the content to a signature field, and group the signature field during
> your search.
>
> You can find more information here:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
> I have been using this method to group the index with duplicated content,
> and it is working fine.
>
> Regards,
> Edwin
>
>
> On 30 October 2015 at 07:20, Shamik Bandopadhyay 
> wrote:
>
> > Hi,
> >
> >   I'm looking to customizing index time de-duplication. Here's my use
> case
> > and what I'm trying to achieve.
> >
> > I've identical documents coming from different release year of a given
> > product. I need to index them in Solr as they are required in individual
> > year context. But there's a generic search which spans across all the
> years
> > and hence bring back duplicate/identical content. My goal is to only
> return
> > the latest document and filter out the rest. For e.g. if product A has
> > identical documents for 2015, 2014 and 2013, search should only return
> 2015
> > (latest document) and filter out the rest.
> >
> > What I'm thinking (if possible) during index time :
> >
> > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > 2014 content, keeping 2015 (the latest release) untouched. During query
> > time, I'll add a filter which will exclude contents tagged with "dedup".
> >
> > Just wondering if this is achievable by perhaps extending
> > UpdateRequestProcessorFactory or
> > customizing SignatureUpdateProcessorFactory ?
> >
> > Any pointers will be appreciated.
> >
> > Regards,
> > Shamik
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Question on index time de-duplication

2015-10-29 Thread Shamik Bandopadhyay
Hi,

  I'm looking to customizing index time de-duplication. Here's my use case
and what I'm trying to achieve.

I've identical documents coming from different release year of a given
product. I need to index them in Solr as they are required in individual
year context. But there's a generic search which spans across all the years
and hence bring back duplicate/identical content. My goal is to only return
the latest document and filter out the rest. For e.g. if product A has
identical documents for 2015, 2014 and 2013, search should only return 2015
(latest document) and filter out the rest.

What I'm thinking (if possible) during index time :

Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
2014 content, keeping 2015 (the latest release) untouched. During query
time, I'll add a filter which will exclude contents tagged with "dedup".

Just wondering if this is achievable by perhaps extending
UpdateRequestProcessorFactory or
customizing SignatureUpdateProcessorFactory ?

Any pointers will be appreciated.

Regards,
Shamik


Re: Question on index time de-duplication

2015-10-29 Thread Zheng Lin Edwin Yeo
Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
of the content to a signature field, and group the signature field during
your search.

You can find more information here:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

I have been using this method to group the index with duplicated content,
and it is working fine.

Regards,
Edwin


On 30 October 2015 at 07:20, Shamik Bandopadhyay  wrote:

> Hi,
>
>   I'm looking to customizing index time de-duplication. Here's my use case
> and what I'm trying to achieve.
>
> I've identical documents coming from different release year of a given
> product. I need to index them in Solr as they are required in individual
> year context. But there's a generic search which spans across all the years
> and hence bring back duplicate/identical content. My goal is to only return
> the latest document and filter out the rest. For e.g. if product A has
> identical documents for 2015, 2014 and 2013, search should only return 2015
> (latest document) and filter out the rest.
>
> What I'm thinking (if possible) during index time :
>
> Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> 2014 content, keeping 2015 (the latest release) untouched. During query
> time, I'll add a filter which will exclude contents tagged with "dedup".
>
> Just wondering if this is achievable by perhaps extending
> UpdateRequestProcessorFactory or
> customizing SignatureUpdateProcessorFactory ?
>
> Any pointers will be appreciated.
>
> Regards,
> Shamik
>