Re: Removing duplicate documents from search results

François Schiettecatte Tue, 28 Jun 2011 04:47:34 -0700

Create a hash from the url and use that as the unique key, md5 or sha1 would 
probably be good enough.


Cheers

François

On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

> I also have the problem of duplicate docs.
> I am indexing news articles, Every news article will have the source URL,
> If two news-article has the same URL, only one need to index,
> removal of duplicate at index time.
> 
> 
> 
> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
> 
>> have you checked out the deduplication process that's available at
>> indexing time ? This includes a fuzzy hash algorithm .
>> 
>> http://wiki.apache.org/solr/Deduplication
>> 
>> -Simon
>> 
>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> wrote:
>>> This approach would definitely work is the two documents are *Exactly*
>> the
>>> same. But this is very fragile. Even if one extra space has been added,
>> the
>>> whole hash would change. What I am really looking for is some %age
>>> similarity between documents, and remove those documents which are more
>> than
>>> 95% similar.
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com> |
>>> Google <http://www.google.com/profiles/pranny>
>>> 
>>> 
>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
>>> 
>>>> What you need to do, is to calculate some HASH (using any message digest
>>>> algorithm you want, md5, sha-1 and so on), then do some reading on solr
>>>> field collapse capabilities. Should not be too complicated..
>>>> 
>>>> *Omri Cohen*
>>>> 
>>>> 
>>>> 
>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>> +972-3-6036295
>>>> 
>>>> 
>>>> 
>>>> 
>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>> [image:
>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>> WordPress]<http://omricohen.me>
>>>> Please consider your environmental responsibility. Before printing this
>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>> IMPORTANT: The contents of this email and any attachments are
>> confidential.
>>>> They are intended for the named recipient(s) only. If you have received
>>>> this
>>>> email by mistake, please notify the sender immediately and do not
>> disclose
>>>> the contents to anyone or make copies thereof.
>>>> Signature powered by
>>>> <
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>> 
>>>> WiseStamp<
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------- Forwarded message ----------
>>>> From: Pranav Prakash <pra...@gmail.com>
>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>> Subject: Removing duplicate documents from search results
>>>> To: solr-user@lucene.apache.org
>>>> 
>>>> 
>>>> How can I remove very similar documents from search results?
>>>> 
>>>> My scenario is that there are documents in the index which are almost
>>>> similar (people submitting same stuff multiple times, sometimes
>> different
>>>> people submitting same stuff). Now when a search is performed for
>>>> "keyword",
>>>> in the top N results, quite frequently, same document comes up multiple
>>>> times. I want to remove those duplicate (or possible duplicate)
>> documents.
>>>> Very similar to what Google does when they say "In order to show you
>> most
>>>> relevant result, duplicates have been removed". How can I achieve this
>>>> functionality using Solr? Does Solr has an implied or plugin which could
>>>> help me with it?
>>>> 
>>>> 
>>>> *Pranav Prakash*
>>>> 
>>>> "temet nosce"
>>>> 
>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com
>>>>> 
>>>> |
>>>> Google <http://www.google.com/profiles/pranny>
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to