Re: Question on index time de-duplication
That's what I observed as well. Perhaps there's a way to customize SignatureUpdateProcessorFactory to support my use case. I'll look into the source code and figure if there's a way to do it. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on index time de-duplication
Hi Shamik, I'm using most of the configuration out of the box, but I'm also looking at tagging an identifier or something so that it will always show the latest documents. At first I thought it will automatically show the one that is indexed later, but seems that it is not the case. It will just show a random one if we use the default configurations. Will update here also if I find any solutions or tips. Regards, Edwin On 31 October 2015 at 00:38, shamik <sham...@gmail.com> wrote: > Thanks for your reply. Have you customized SignatureUpdateProcessorFactory > or > are you using the configuration out of the box ? I know it works for simple > dedup, but my requirement is tad different as I need to tag an identifier > to > the latest document. My goal is to understand if that's possible using > SignatureUpdateProcessorFactory. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html > Sent from the Solr - User mailing list archive at Nabble.com. >
RE: Question on index time de-duplication
Hello - keep in mind that both SignatureUpdateProcessorFactory and field collapsing do not work in distributed search unless you map identical signatures to identical shards. Markus -Original message- > From:Scott Stults <sstu...@opensourceconnections.com> > Sent: Friday 30th October 2015 11:58 > To: solr-user@lucene.apache.org > Subject: Re: Question on index time de-duplication > > At the top of the De-Duplication wiki page is a note about collapsing > results. Once you have the signature (identical for each of the duplicates) > you'll want to collapse your results, keeping the one with max date. > > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > > > k/r, > Scott > > On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing > > of the content to a signature field, and group the signature field during > > your search. > > > > You can find more information here: > > https://cwiki.apache.org/confluence/display/solr/De-Duplication > > > > I have been using this method to group the index with duplicated content, > > and it is working fine. > > > > Regards, > > Edwin > > > > > > On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com> > > wrote: > > > > > Hi, > > > > > > I'm looking to customizing index time de-duplication. Here's my use > > case > > > and what I'm trying to achieve. > > > > > > I've identical documents coming from different release year of a given > > > product. I need to index them in Solr as they are required in individual > > > year context. But there's a generic search which spans across all the > > years > > > and hence bring back duplicate/identical content. My goal is to only > > return > > > the latest document and filter out the rest. For e.g. if product A has > > > identical documents for 2015, 2014 and 2013, search should only return > > 2015 > > > (latest document) and filter out the rest. > > > > > > What I'm thinking (if possible) during index time : > > > > > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and > > > 2014 content, keeping 2015 (the latest release) untouched. During query > > > time, I'll add a filter which will exclude contents tagged with "dedup". > > > > > > Just wondering if this is achievable by perhaps extending > > > UpdateRequestProcessorFactory or > > > customizing SignatureUpdateProcessorFactory ? > > > > > > Any pointers will be appreciated. > > > > > > Regards, > > > Shamik > > > > > > > > > -- > Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC > | 434.409.2780 > http://www.opensourceconnections.com >
RE: Question on index time de-duplication
Thanks Markus. I've been using field collapsing till now but the performance constraint is forcing me to think about index time de-duplication. I've been using a composite router to make sure that duplicate documents are routed to the same shard. Won't that work for SignatureUpdateProcessorFactory ? -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on index time de-duplication
Thanks Scott. I could directly use field collapsing on adskdedup field without the signature field. Problem with field collapsing is the performance overhead. It slows down the query to 10 folds. CollapsingQParserPlugin is a better option, unfortunately, it doesn't support ngroups equivalent, which is a requirement for me. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237401.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on index time de-duplication
Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or are you using the configuration out of the box ? I know it works for simple dedup, but my requirement is tad different as I need to tag an identifier to the latest document. My goal is to understand if that's possible using SignatureUpdateProcessorFactory. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplication-tp4237306p4237409.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on index time de-duplication
At the top of the De-Duplication wiki page is a note about collapsing results. Once you have the signature (identical for each of the duplicates) you'll want to collapse your results, keeping the one with max date. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results k/r, Scott On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeowrote: > Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing > of the content to a signature field, and group the signature field during > your search. > > You can find more information here: > https://cwiki.apache.org/confluence/display/solr/De-Duplication > > I have been using this method to group the index with duplicated content, > and it is working fine. > > Regards, > Edwin > > > On 30 October 2015 at 07:20, Shamik Bandopadhyay > wrote: > > > Hi, > > > > I'm looking to customizing index time de-duplication. Here's my use > case > > and what I'm trying to achieve. > > > > I've identical documents coming from different release year of a given > > product. I need to index them in Solr as they are required in individual > > year context. But there's a generic search which spans across all the > years > > and hence bring back duplicate/identical content. My goal is to only > return > > the latest document and filter out the rest. For e.g. if product A has > > identical documents for 2015, 2014 and 2013, search should only return > 2015 > > (latest document) and filter out the rest. > > > > What I'm thinking (if possible) during index time : > > > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and > > 2014 content, keeping 2015 (the latest release) untouched. During query > > time, I'll add a filter which will exclude contents tagged with "dedup". > > > > Just wondering if this is achievable by perhaps extending > > UpdateRequestProcessorFactory or > > customizing SignatureUpdateProcessorFactory ? > > > > Any pointers will be appreciated. > > > > Regards, > > Shamik > > > -- Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC | 434.409.2780 http://www.opensourceconnections.com
Question on index time de-duplication
Hi, I'm looking to customizing index time de-duplication. Here's my use case and what I'm trying to achieve. I've identical documents coming from different release year of a given product. I need to index them in Solr as they are required in individual year context. But there's a generic search which spans across all the years and hence bring back duplicate/identical content. My goal is to only return the latest document and filter out the rest. For e.g. if product A has identical documents for 2015, 2014 and 2013, search should only return 2015 (latest document) and filter out the rest. What I'm thinking (if possible) during index time : Index all documents, but add a special tag (e.g. dedup=true) to 2013 and 2014 content, keeping 2015 (the latest release) untouched. During query time, I'll add a filter which will exclude contents tagged with "dedup". Just wondering if this is achievable by perhaps extending UpdateRequestProcessorFactory or customizing SignatureUpdateProcessorFactory ? Any pointers will be appreciated. Regards, Shamik
Re: Question on index time de-duplication
Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing of the content to a signature field, and group the signature field during your search. You can find more information here: https://cwiki.apache.org/confluence/display/solr/De-Duplication I have been using this method to group the index with duplicated content, and it is working fine. Regards, Edwin On 30 October 2015 at 07:20, Shamik Bandopadhyaywrote: > Hi, > > I'm looking to customizing index time de-duplication. Here's my use case > and what I'm trying to achieve. > > I've identical documents coming from different release year of a given > product. I need to index them in Solr as they are required in individual > year context. But there's a generic search which spans across all the years > and hence bring back duplicate/identical content. My goal is to only return > the latest document and filter out the rest. For e.g. if product A has > identical documents for 2015, 2014 and 2013, search should only return 2015 > (latest document) and filter out the rest. > > What I'm thinking (if possible) during index time : > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and > 2014 content, keeping 2015 (the latest release) untouched. During query > time, I'll add a filter which will exclude contents tagged with "dedup". > > Just wondering if this is achievable by perhaps extending > UpdateRequestProcessorFactory or > customizing SignatureUpdateProcessorFactory ? > > Any pointers will be appreciated. > > Regards, > Shamik >