I'm very interested in this topic as well. Plz let the community know if/when you get smth cool implemented =)
On Saturday, June 23, 2012, parnab kumar wrote: > Hi, > > I have crawled and indexed around 2.5 million web pages . However , > almost 30 % of the pages are near duplicates . Is there any functionality > in SOLR or nutch to remove those near duplicates from the index. Nutch > dedup command only handles exact duplicates i guess . Exact duplicates wont > serve my purpose . > Please help / advise me on how to address the problem. > > Thanks , > Parnab >

