Any ideas?

On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]> wrote:

> Hi,
>
> A few days back there was a discussion on the way to extract data from raw
> html content (
> http://lucene.472066.n3.nabble.com/Getting-the-parsed-HTML-content-back-td3916555.html)
> and how to read it as DOM. We have a custom parser which ends up working on
> the raw content.
>
>
> This is how it works for us
> Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch plugins
>
> In the custom parser, we end up parsing content as DOM and populating our
> database.
>
>
> I am wondering can Nutch do anything in this scenario to help in
> de-duplication of content OR would it be the responsibility of the parse
> logic to also verify if the content is duplicate or not by keeping a hash
> of already existing content?
>
> I see that there is a Nutch plugin for Solr dedup,
> http://wiki.apache.org/nutch/bin/nutch%20solrdedup but we are not using
> Solr.
>
> Also for the link deduplication, is my assumption correct that CrawlDB
> would not allow duplicate links to get inside it?
>
> Regards | Vikas
> www.knoldus.com
>
>

Reply via email to