Re: Removing duplicate documents from search results

Pranav Prakash Tue, 28 Jun 2011 05:45:47 -0700

I found the deduplication thing really useful. Although I have not yet
started to work on it, as there are some other low hanging fruits I've to
capture. Will share my thoughts soon.



*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


2011/6/28 François Schiettecatte <fschietteca...@gmail.com>

> Maybe there is a way to get Solr to reject documents that already exist in
> the index but I doubt it, maybe someone else with can chime here here. You
> could do a search for each document prior to indexing it so see if it is
> already in the index, that is probably non-optimal, maybe it is easiest to
> check if the document exists in your Riak repository, it no add it and index
> it, and drop if it already exists.
>
> François
>
> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>
> > I am making the Hash from URL, but I can't use this as UniqueKey because
> I
> > am using UUID as UniqueKey,
> > Since I am using SOLR as  index engine Only and using Riak(key-value
> > storage) as storage engine, I dont want to do the overwrite on duplicate.
> > I just need to discard the duplicates.
> >
> >
> >
> > 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
> >
> >> Create a hash from the url and use that as the unique key, md5 or sha1
> >> would probably be good enough.
> >>
> >> Cheers
> >>
> >> François
> >>
> >> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
> >>
> >>> I also have the problem of duplicate docs.
> >>> I am indexing news articles, Every news article will have the source
> URL,
> >>> If two news-article has the same URL, only one need to index,
> >>> removal of duplicate at index time.
> >>>
> >>>
> >>>
> >>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
> >>>
> >>>> have you checked out the deduplication process that's available at
> >>>> indexing time ? This includes a fuzzy hash algorithm .
> >>>>
> >>>> http://wiki.apache.org/solr/Deduplication
> >>>>
> >>>> -Simon
> >>>>
> >>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
> >> wrote:
> >>>>> This approach would definitely work is the two documents are
> *Exactly*
> >>>> the
> >>>>> same. But this is very fragile. Even if one extra space has been
> added,
> >>>> the
> >>>>> whole hash would change. What I am really looking for is some %age
> >>>>> similarity between documents, and remove those documents which are
> more
> >>>> than
> >>>>> 95% similar.
> >>>>>
> >>>>> *Pranav Prakash*
> >>>>>
> >>>>> "temet nosce"
> >>>>>
> >>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>> http://blog.myblive.com> |
> >>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
> >>>>>
> >>>>>> What you need to do, is to calculate some HASH (using any message
> >> digest
> >>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >> solr
> >>>>>> field collapse capabilities. Should not be too complicated..
> >>>>>>
> >>>>>> *Omri Cohen*
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
> >>>> +972-3-6036295
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>> [image:
> >>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>>>> WordPress]<http://omricohen.me>
> >>>>>> Please consider your environmental responsibility. Before printing
> >> this
> >>>>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>>>> IMPORTANT: The contents of this email and any attachments are
> >>>> confidential.
> >>>>>> They are intended for the named recipient(s) only. If you have
> >> received
> >>>>>> this
> >>>>>> email by mistake, please notify the sender immediately and do not
> >>>> disclose
> >>>>>> the contents to anyone or make copies thereof.
> >>>>>> Signature powered by
> >>>>>> <
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>
> >>>>>> WiseStamp<
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ---------- Forwarded message ----------
> >>>>>> From: Pranav Prakash <pra...@gmail.com>
> >>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>>>> Subject: Removing duplicate documents from search results
> >>>>>> To: solr-user@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>> How can I remove very similar documents from search results?
> >>>>>>
> >>>>>> My scenario is that there are documents in the index which are
> almost
> >>>>>> similar (people submitting same stuff multiple times, sometimes
> >>>> different
> >>>>>> people submitting same stuff). Now when a search is performed for
> >>>>>> "keyword",
> >>>>>> in the top N results, quite frequently, same document comes up
> >> multiple
> >>>>>> times. I want to remove those duplicate (or possible duplicate)
> >>>> documents.
> >>>>>> Very similar to what Google does when they say "In order to show you
> >>>> most
> >>>>>> relevant result, duplicates have been removed". How can I achieve
> this
> >>>>>> functionality using Solr? Does Solr has an implied or plugin which
> >> could
> >>>>>> help me with it?
> >>>>>>
> >>>>>>
> >>>>>> *Pranav Prakash*
> >>>>>>
> >>>>>> "temet nosce"
> >>>>>>
> >>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>> http://blog.myblive.com
> >>>>>>>
> >>>>>> |
> >>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards
> >>> Mohammad Shariq
> >>
> >>
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
>
>

Re: Removing duplicate documents from search results

Reply via email to