Re: Near Duplicate Detection in nutch /Solr

remi tassing Sat, 23 Jun 2012 02:59:40 -0700

I'm very interested in this topic as well. Plz let the community know
if/when you get smth cool implemented =)


On Saturday, June 23, 2012, parnab kumar wrote:

> Hi,
>
> I have crawled and  indexed  around 2.5 million web pages . However ,
> almost 30 % of the pages are near duplicates . Is there any functionality
> in SOLR or nutch to remove those near duplicates from the index. Nutch
> dedup command only handles exact duplicates i guess . Exact duplicates wont
> serve my purpose .
>     Please help / advise me on how to address the problem.
>
> Thanks ,
> Parnab
>

Re: Near Duplicate Detection in nutch /Solr

Reply via email to