This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar.
*Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: > What you need to do, is to calculate some HASH (using any message digest > algorithm you want, md5, sha-1 and so on), then do some reading on solr > field collapse capabilities. Should not be too complicated.. > > *Omri Cohen* > > > > Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 > > > > > My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image: > Twitter] <http://www.twitter.com/omricohe> [image: > WordPress]<http://omricohen.me> > Please consider your environmental responsibility. Before printing this > e-mail message, ask yourself whether you really need a hard copy. > IMPORTANT: The contents of this email and any attachments are confidential. > They are intended for the named recipient(s) only. If you have received > this > email by mistake, please notify the sender immediately and do not disclose > the contents to anyone or make copies thereof. > Signature powered by > < > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > > > WiseStamp< > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > > > > > > ---------- Forwarded message ---------- > From: Pranav Prakash <pra...@gmail.com> > Date: Thu, Jun 23, 2011 at 12:26 PM > Subject: Removing duplicate documents from search results > To: solr-user@lucene.apache.org > > > How can I remove very similar documents from search results? > > My scenario is that there are documents in the index which are almost > similar (people submitting same stuff multiple times, sometimes different > people submitting same stuff). Now when a search is performed for > "keyword", > in the top N results, quite frequently, same document comes up multiple > times. I want to remove those duplicate (or possible duplicate) documents. > Very similar to what Google does when they say "In order to show you most > relevant result, duplicates have been removed". How can I achieve this > functionality using Solr? Does Solr has an implied or plugin which could > help me with it? > > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com > > > | > Google <http://www.google.com/profiles/pranny> >