Hi, A few days back there was a discussion on the way to extract data from raw html content ( http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html) and how to read it as DOM. We have a custom parser which ends up working on the raw content.
This is how it works for us Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins In the custom parser, we end up parsing content as DOM and populating our database. I am wondering can Nutch do anything in this scenario to help in de-duplication of content OR would it be the responsibility of the parse logic to also verify if the content is duplicate or not by keeping a hash of already existing content? I see that there is a Nutch plugin for Solr dedup, http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not using Solr. Also for the link deduplication, is my assumption correct that CrawlDB would not allow duplicate links to get inside it? Regards | Vikas www.knoldus.com

