Re: De-duplication of Nutch parsed data

Markus Jelsma Wed, 09 May 2012 12:02:20 -0700

hi

On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]>wrote:

Any ideas?
On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]>wrote:
Hi,
A few days back there was a discussion on the way to extract datafrom raw
html content (

http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html)
and how to read it as DOM. We have a custom parser which ends upworking on
the raw content.


This is how it works for us
Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutchplugins
In the custom parser, we end up parsing content as DOM andpopulating our
database.


I am wondering can Nutch do anything in this scenario to help in
de-duplication of content OR would it be the responsibility of theparselogic to also verify if the content is duplicate or not by keeping ahash
of already existing content?

What do you want to deduplicate? CrawlDB records based on what? Segmentrecords? ParseData? ParseText?

I see that there is a Nutch plugin for Solr dedup,
http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are notusing
Solr.
Also for the link deduplication, is my assumption correct thatCrawlDB
would not allow duplicate links to get inside it?

What link deduplication do you mean? CrawlDB records have a unique keyon the URL.


Regards | Vikas
www.knoldus.com


--
Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

Reply via email to