This approach would definitely work is the two documents are *Exactly* the
same. But this is very fragile. Even if one extra space has been added, the
whole hash would change. What I am really looking for is some %age
similarity between documents, and remove those documents which are more than
95% similar.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:

> What you need to do, is to calculate some HASH (using any message digest
> algorithm you want, md5, sha-1 and so on), then do some reading on solr
> field collapse capabilities. Should not be too complicated..
>
> *Omri Cohen*
>
>
>
> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295
>
>
>
>
> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
> Twitter] <http://www.twitter.com/omricohe> [image:
> WordPress]<http://omricohen.me>
>  Please consider your environmental responsibility. Before printing this
> e-mail message, ask yourself whether you really need a hard copy.
> IMPORTANT: The contents of this email and any attachments are confidential.
> They are intended for the named recipient(s) only. If you have received
> this
> email by mistake, please notify the sender immediately and do not disclose
> the contents to anyone or make copies thereof.
> Signature powered by
> <
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
> WiseStamp<
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
>
>
>
> ---------- Forwarded message ----------
> From: Pranav Prakash <pra...@gmail.com>
> Date: Thu, Jun 23, 2011 at 12:26 PM
> Subject: Removing duplicate documents from search results
> To: solr-user@lucene.apache.org
>
>
> How can I remove very similar documents from search results?
>
> My scenario is that there are documents in the index which are almost
> similar (people submitting same stuff multiple times, sometimes different
> people submitting same stuff). Now when a search is performed for
> "keyword",
> in the top N results, quite frequently, same document comes up multiple
> times. I want to remove those duplicate (or possible duplicate) documents.
> Very similar to what Google does when they say "In order to show you most
> relevant result, duplicates have been removed". How can I achieve this
> functionality using Solr? Does Solr has an implied or plugin which could
> help me with it?
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
> >
> |
> Google <http://www.google.com/profiles/pranny>
>

Reply via email to