Thanks a lot markus for your answer. it was very usefull for me. The problem with the solution using DeduplicationJob in nutch is that i have deleted sometimes crawldb and duplicates pages are in solr only. I think that the best solution for me must be in Solr. I was reading about dedupe in solr and the post below was very usefull for me, it explain exactly what i need.
https://cwiki.apache.org/confluence/display/solr/De-Duplication I have use TextProfileSignature (Fuzzy hashing implementation from nutch for near duplicate detection) I will wait for tika boilerpipe to avoid page´s content repetitive. I have detected that 2 pages has identical signature. Do you know how to do a mechanism to delete the older of these duplicate document in solr ? 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC. http://coj.uci.cu/contest/contestview.xhtml?cid07

