RE: how to avoid duplicate pages in nutch and solr?

Markus Jelsma Mon, 19 Oct 2015 14:15:18 -0700

Hello Eyeris,

This is possible with both Nutch and Solr out-of-the-box, but only if you 
manage to get identical Nutch signatures for all duplicates at all times. Nutch 
generates a signature for each URL and stores it in the CrawlDB, these are by 
default also indexed to Solr. There are two ways to deduplicate:


1. via Nutch, use the DeduplicationJob to mark URL's are duplicate and then the 
clean job to remove them from your indexing backend. This works for Solr and 
Elasticsearch. 
2. via Solr, use SignatureUpdateProcessor, on update, it can remove documents 
with the same signature/digest you are indexing, thus removing duplicates. This 
does not work in Solr cloud.

Your HTML pages are dynamic, including hyperlinks to very recent pages, which 
is going to cause a problem. I assume you are using Nutch' default parse-html 
or parse-tika to extract text from pages, meaning you are going to get 
different output for the same URL at different times, and thus a different 
signature, not matching another near-duplicate signature and then breaking the 
deduplication.

There are two options left, but not available in both software packages. You 
can employ proper custom text extraction to solve the problem extracting 
'recent items' as part of the document text, or you can generate a custom LSH 
signature for all documents and a custom update processor to resolve the 
problem of having not identical hashes.

Regards,
Markus

 
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <[email protected]>
> Sent: Monday 19th October 2015 22:07
> To: [email protected]
> Subject: how to avoid duplicate pages in nutch and solr?
> 
> Hello all.
> I am using nutch 1.9(local mode) and solr 4.10.3
> I have detected that some pages will appear duplicates in solr with diferent 
> url but the same information
> This are two examples of url
> 
> http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/
> http://www.cubadebate.cu/noticias/2012/07/06/cientificos-espanoles-trabajan-en-gel-para-prevenir-el-sida/comment-page-1/
> 
> How nutch try with duplicate pages? 
> The solution must be in nutch or in solr?
> Any body can suggest me any way to avoid and solve that problem? 
> 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
> http://coj.uci.cu/contest/contestview.xhtml?cid=1407
>

RE: how to avoid duplicate pages in nutch and solr?

Reply via email to