hi

On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]> wrote:
Any ideas?

On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]> wrote:

Hi,

A few days back there was a discussion on the way to extract data from raw
html content (

http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html)
and how to read it as DOM. We have a custom parser which ends up working on
the raw content.


This is how it works for us
Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins

In the custom parser, we end up parsing content as DOM and populating our
database.


I am wondering can Nutch do anything in this scenario to help in
de-duplication of content OR would it be the responsibility of the parse logic to also verify if the content is duplicate or not by keeping a hash
of already existing content?

What do you want to deduplicate? CrawlDB records based on what? Segment records? ParseData? ParseText?


I see that there is a Nutch plugin for Solr dedup,
http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not using
Solr.

Also for the link deduplication, is my assumption correct that CrawlDB
would not allow duplicate links to get inside it?

What link deduplication do you mean? CrawlDB records have a unique key on the URL.


Regards | Vikas
www.knoldus.com



--
Markus Jelsma - CTO - Openindex

Reply via email to