De-duplication of Nutch parsed data

Vikas Hazrati Tue, 08 May 2012 04:15:03 -0700

Hi,

A few days back there was a discussion on the way to extract data from raw
html content (
http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html)
and how to read it as DOM. We have a custom parser which ends up working on
the raw content.



This is how it works for us
Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins

In the custom parser, we end up parsing content as DOM and populating our
database.


I am wondering can Nutch do anything in this scenario to help in
de-duplication of content OR would it be the responsibility of the parse
logic to also verify if the content is duplicate or not by keeping a hash
of already existing content?

I see that there is a Nutch plugin for Solr dedup,
http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not using
Solr.

Also for the link deduplication, is my assumption correct that CrawlDB
would not allow duplicate links to get inside it?

Regards | Vikas
www.knoldus.com

De-duplication of Nutch parsed data

Reply via email to