Thanks a lot markus for your answer. it was very usefull for me.
The problem with the solution using DeduplicationJob in nutch is that i have 
deleted sometimes crawldb and duplicates pages are in solr only.
I think that the best solution for me must be in Solr.
I was reading about dedupe in solr and the post below was very usefull for me, 
it explain exactly what i need.

https://cwiki.apache.org/confluence/display/solr/De-Duplication

I have use TextProfileSignature (Fuzzy hashing implementation from nutch for 
near duplicate detection)
I will wait for tika boilerpipe to avoid page´s content repetitive.
I have detected that 2 pages has identical signature.
Do you know how to do a mechanism to delete the older of these duplicate 
document in solr ? 

17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07

Reply via email to