Hello Eyeris - Boilerpipe support in Nutch is not coming very soon but it should be straightforward to apply some patch to your version of Nutch. Regarding the duplicates still in Solr, you need to remove these manually because the update prcoessor only deletes on update, or you need to wait until the crawler refetches any of the duplicates and indexes them. By faceting on signature/digest you can find duplicates.
Markus -----Original message----- > From:Eyeris Rodriguez Rueda <[email protected]> > Sent: Thursday 22nd October 2015 17:34 > To: [email protected] > Subject: Re: [MASSMAIL]RE: how to avoid duplicate pages in nutch and solr? > > Thanks a lot markus for your answer. it was very usefull for me. > The problem with the solution using DeduplicationJob in nutch is that i have > deleted sometimes crawldb and duplicates pages are in solr only. > I think that the best solution for me must be in Solr. > I was reading about dedupe in solr and the post below was very usefull for > me, it explain exactly what i need. > > https://cwiki.apache.org/confluence/display/solr/De-Duplication > > I have use TextProfileSignature (Fuzzy hashing implementation from nutch for > near duplicate detection) > I will wait for tika boilerpipe to avoid page´s content repetitive. > I have detected that 2 pages has identical signature. > Do you know how to do a mechanism to delete the older of these duplicate > document in solr ? > > 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC. > http://coj.uci.cu/contest/contestview.xhtml?cid07 >

