Re: [MASSMAIL]RE: how to avoid duplicate pages in nutch and solr?

Eyeris Rodriguez Rueda Thu, 22 Oct 2015 08:35:34 -0700

Thanks a lot markus for your answer. it was very usefull for me.
The problem with the solution using DeduplicationJob in nutch is that i have 
deleted sometimes crawldb and duplicates pages are in solr only.
I think that the best solution for me must be in Solr.
I was reading about dedupe in solr and the post below was very usefull for me, 
it explain exactly what i need.


https://cwiki.apache.org/confluence/display/solr/De-Duplication

I have use TextProfileSignature (Fuzzy hashing implementation from nutch for 
near duplicate detection)
I will wait for tika boilerpipe to avoid page´s content repetitive.
I have detected that 2 pages has identical signature.
Do you know how to do a mechanism to delete the older of these duplicate 
document in solr ? 

17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07

Re: [MASSMAIL]RE: how to avoid duplicate pages in nutch and solr?

Reply via email to