Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough.
Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > I also have the problem of duplicate docs. > I am indexing news articles, Every news article will have the source URL, > If two news-article has the same URL, only one need to index, > removal of duplicate at index time. > > > > On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: > >> have you checked out the deduplication process that's available at >> indexing time ? This includes a fuzzy hash algorithm . >> >> http://wiki.apache.org/solr/Deduplication >> >> -Simon >> >> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> wrote: >>> This approach would definitely work is the two documents are *Exactly* >> the >>> same. But this is very fragile. Even if one extra space has been added, >> the >>> whole hash would change. What I am really looking for is some %age >>> similarity between documents, and remove those documents which are more >> than >>> 95% similar. >>> >>> *Pranav Prakash* >>> >>> "temet nosce" >>> >>> Twitter <http://twitter.com/pranavprakash> | Blog < >> http://blog.myblive.com> | >>> Google <http://www.google.com/profiles/pranny> >>> >>> >>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: >>> >>>> What you need to do, is to calculate some HASH (using any message digest >>>> algorithm you want, md5, sha-1 and so on), then do some reading on solr >>>> field collapse capabilities. Should not be too complicated.. >>>> >>>> *Omri Cohen* >>>> >>>> >>>> >>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | >> +972-3-6036295 >>>> >>>> >>>> >>>> >>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> >> [image: >>>> Twitter] <http://www.twitter.com/omricohe> [image: >>>> WordPress]<http://omricohen.me> >>>> Please consider your environmental responsibility. Before printing this >>>> e-mail message, ask yourself whether you really need a hard copy. >>>> IMPORTANT: The contents of this email and any attachments are >> confidential. >>>> They are intended for the named recipient(s) only. If you have received >>>> this >>>> email by mistake, please notify the sender immediately and do not >> disclose >>>> the contents to anyone or make copies thereof. >>>> Signature powered by >>>> < >>>> >> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer >>>>> >>>> WiseStamp< >>>> >> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer >>>>> >>>> >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Pranav Prakash <pra...@gmail.com> >>>> Date: Thu, Jun 23, 2011 at 12:26 PM >>>> Subject: Removing duplicate documents from search results >>>> To: solr-user@lucene.apache.org >>>> >>>> >>>> How can I remove very similar documents from search results? >>>> >>>> My scenario is that there are documents in the index which are almost >>>> similar (people submitting same stuff multiple times, sometimes >> different >>>> people submitting same stuff). Now when a search is performed for >>>> "keyword", >>>> in the top N results, quite frequently, same document comes up multiple >>>> times. I want to remove those duplicate (or possible duplicate) >> documents. >>>> Very similar to what Google does when they say "In order to show you >> most >>>> relevant result, duplicates have been removed". How can I achieve this >>>> functionality using Solr? Does Solr has an implied or plugin which could >>>> help me with it? >>>> >>>> >>>> *Pranav Prakash* >>>> >>>> "temet nosce" >>>> >>>> Twitter <http://twitter.com/pranavprakash> | Blog < >> http://blog.myblive.com >>>>> >>>> | >>>> Google <http://www.google.com/profiles/pranny> >>>> >>> >> > > > > -- > Thanks and Regards > Mohammad Shariq