Hello Eyeris, This is possible with both Nutch and Solr out-of-the-box, but only if you manage to get identical Nutch signatures for all duplicates at all times. Nutch generates a signature for each URL and stores it in the CrawlDB, these are by default also indexed to Solr. There are two ways to deduplicate:
1. via Nutch, use the DeduplicationJob to mark URL's are duplicate and then the clean job to remove them from your indexing backend. This works for Solr and Elasticsearch. 2. via Solr, use SignatureUpdateProcessor, on update, it can remove documents with the same signature/digest you are indexing, thus removing duplicates. This does not work in Solr cloud. Your HTML pages are dynamic, including hyperlinks to very recent pages, which is going to cause a problem. I assume you are using Nutch' default parse-html or parse-tika to extract text from pages, meaning you are going to get different output for the same URL at different times, and thus a different signature, not matching another near-duplicate signature and then breaking the deduplication. There are two options left, but not available in both software packages. You can employ proper custom text extraction to solve the problem extracting 'recent items' as part of the document text, or you can generate a custom LSH signature for all documents and a custom update processor to resolve the problem of having not identical hashes. Regards, Markus -----Original message----- > From:Eyeris Rodriguez Rueda <[email protected]> > Sent: Monday 19th October 2015 22:07 > To: [email protected] > Subject: how to avoid duplicate pages in nutch and solr? > > Hello all. > I am using nutch 1.9(local mode) and solr 4.10.3 > I have detected that some pages will appear duplicates in solr with diferent > url but the same information > This are two examples of url > > http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/ > http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/comment-page-1/ > > How nutch try with duplicate pages? > The solution must be in nutch or in solr? > Any body can suggest me any way to avoid and solve that problem? > 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC. > http://coj.uci.cu/contest/contestview.xhtml?cid=1407 >

