I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon.
*Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > Maybe there is a way to get Solr to reject documents that already exist in > the index but I doubt it, maybe someone else with can chime here here. You > could do a search for each document prior to indexing it so see if it is > already in the index, that is probably non-optimal, maybe it is easiest to > check if the document exists in your Riak repository, it no add it and index > it, and drop if it already exists. > > François > > On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: > > > I am making the Hash from URL, but I can't use this as UniqueKey because > I > > am using UUID as UniqueKey, > > Since I am using SOLR as index engine Only and using Riak(key-value > > storage) as storage engine, I dont want to do the overwrite on duplicate. > > I just need to discard the duplicates. > > > > > > > > 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > > > >> Create a hash from the url and use that as the unique key, md5 or sha1 > >> would probably be good enough. > >> > >> Cheers > >> > >> François > >> > >> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > >> > >>> I also have the problem of duplicate docs. > >>> I am indexing news articles, Every news article will have the source > URL, > >>> If two news-article has the same URL, only one need to index, > >>> removal of duplicate at index time. > >>> > >>> > >>> > >>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: > >>> > >>>> have you checked out the deduplication process that's available at > >>>> indexing time ? This includes a fuzzy hash algorithm . > >>>> > >>>> http://wiki.apache.org/solr/Deduplication > >>>> > >>>> -Simon > >>>> > >>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> > >> wrote: > >>>>> This approach would definitely work is the two documents are > *Exactly* > >>>> the > >>>>> same. But this is very fragile. Even if one extra space has been > added, > >>>> the > >>>>> whole hash would change. What I am really looking for is some %age > >>>>> similarity between documents, and remove those documents which are > more > >>>> than > >>>>> 95% similar. > >>>>> > >>>>> *Pranav Prakash* > >>>>> > >>>>> "temet nosce" > >>>>> > >>>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >>>> http://blog.myblive.com> | > >>>>> Google <http://www.google.com/profiles/pranny> > >>>>> > >>>>> > >>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: > >>>>> > >>>>>> What you need to do, is to calculate some HASH (using any message > >> digest > >>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on > >> solr > >>>>>> field collapse capabilities. Should not be too complicated.. > >>>>>> > >>>>>> *Omri Cohen* > >>>>>> > >>>>>> > >>>>>> > >>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | > >>>> +972-3-6036295 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > >>>> [image: > >>>>>> Twitter] <http://www.twitter.com/omricohe> [image: > >>>>>> WordPress]<http://omricohen.me> > >>>>>> Please consider your environmental responsibility. Before printing > >> this > >>>>>> e-mail message, ask yourself whether you really need a hard copy. > >>>>>> IMPORTANT: The contents of this email and any attachments are > >>>> confidential. > >>>>>> They are intended for the named recipient(s) only. If you have > >> received > >>>>>> this > >>>>>> email by mistake, please notify the sender immediately and do not > >>>> disclose > >>>>>> the contents to anyone or make copies thereof. > >>>>>> Signature powered by > >>>>>> < > >>>>>> > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>>>> > >>>>>> WiseStamp< > >>>>>> > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ---------- Forwarded message ---------- > >>>>>> From: Pranav Prakash <pra...@gmail.com> > >>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM > >>>>>> Subject: Removing duplicate documents from search results > >>>>>> To: solr-user@lucene.apache.org > >>>>>> > >>>>>> > >>>>>> How can I remove very similar documents from search results? > >>>>>> > >>>>>> My scenario is that there are documents in the index which are > almost > >>>>>> similar (people submitting same stuff multiple times, sometimes > >>>> different > >>>>>> people submitting same stuff). Now when a search is performed for > >>>>>> "keyword", > >>>>>> in the top N results, quite frequently, same document comes up > >> multiple > >>>>>> times. I want to remove those duplicate (or possible duplicate) > >>>> documents. > >>>>>> Very similar to what Google does when they say "In order to show you > >>>> most > >>>>>> relevant result, duplicates have been removed". How can I achieve > this > >>>>>> functionality using Solr? Does Solr has an implied or plugin which > >> could > >>>>>> help me with it? > >>>>>> > >>>>>> > >>>>>> *Pranav Prakash* > >>>>>> > >>>>>> "temet nosce" > >>>>>> > >>>>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >>>> http://blog.myblive.com > >>>>>>> > >>>>>> | > >>>>>> Google <http://www.google.com/profiles/pranny> > >>>>>> > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Thanks and Regards > >>> Mohammad Shariq > >> > >> > > > > > > -- > > Thanks and Regards > > Mohammad Shariq > >