Any ideas? On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]> wrote:
> Hi, > > A few days back there was a discussion on the way to extract data from raw > html content ( > http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html) > and how to read it as DOM. We have a custom parser which ends up working on > the raw content. > > > This is how it works for us > Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins > > In the custom parser, we end up parsing content as DOM and populating our > database. > > > I am wondering can Nutch do anything in this scenario to help in > de-duplication of content OR would it be the responsibility of the parse > logic to also verify if the content is duplicate or not by keeping a hash > of already existing content? > > I see that there is a Nutch plugin for Solr dedup, > http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not using > Solr. > > Also for the link deduplication, is my assumption correct that CrawlDB > would not allow duplicate links to get inside it? > > Regards | Vikas > www.knoldus.com > >

