hi
On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]>
wrote:
Any ideas?
On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]>
wrote:
Hi,
A few days back there was a discussion on the way to extract data
from raw
html content (
http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html)
and how to read it as DOM. We have a custom parser which ends up
working on
the raw content.
This is how it works for us
Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch
plugins
In the custom parser, we end up parsing content as DOM and
populating our
database.
I am wondering can Nutch do anything in this scenario to help in
de-duplication of content OR would it be the responsibility of the
parse
logic to also verify if the content is duplicate or not by keeping a
hash
of already existing content?
What do you want to deduplicate? CrawlDB records based on what? Segment
records? ParseData? ParseText?
I see that there is a Nutch plugin for Solr dedup,
http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not
using
Solr.
Also for the link deduplication, is my assumption correct that
CrawlDB
would not allow duplicate links to get inside it?
What link deduplication do you mean? CrawlDB records have a unique key
on the URL.
Regards | Vikas
www.knoldus.com
--
Markus Jelsma - CTO - Openindex