Re: De-duplication of Nutch parsed data

Vikas Hazrati Thu, 10 May 2012 03:10:44 -0700

On Thu, May 10, 2012 at 3:25 PM, Markus Jelsma
<[email protected]>wrote:


> hi
>
> On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote:
> > Hi Markus,
> >
> > Thanks for your response. My responses inline
> >
> > On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma
> >
> > <[email protected]>wrote:
> > > hi
> > >
> > >
> > > On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]>
> > >
> > > wrote:
> > >> Any ideas?
> > >>
> > >> On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]>
> wrote:
> > >>  Hi,
> > >>
> > >>> A few days back there was a discussion on the way to extract data
> from
> > >>> raw
> > >>> html content (
> > >>>
> > >>> http://lucene.472066.n3.**nabble.com/Getting-the-parsed-**
> > >>> HTML-content-back-td3916555.**html<
> http://lucene.472066.n3.nabble.com/Ge
> > >>> tting-the-parsed-HTML-content-back-td3916555.html> )
> > >>> and how to read it as DOM. We have a custom parser which ends up
> working
> > >>> on
> > >>> the raw content.
> > >>>
> > >>>
> > >>> This is how it works for us
> > >>> Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch
> plugins
> > >>>
> > >>> In the custom parser, we end up parsing content as DOM and populating
> > >>> our
> > >>> database.
> > >>>
> > >>>
> > >>> I am wondering can Nutch do anything in this scenario to help in
> > >>> de-duplication of content OR would it be the responsibility of the
> parse
> > >>> logic to also verify if the content is duplicate or not by keeping a
> > >>> hash
> > >>> of already existing content?
> > >
> > > What do you want to deduplicate? CrawlDB records based on what? Segment
> > > records? ParseData? ParseText?
> > >
> > > Primarily parse text. But your questions have got me thinking. I guess
> the
> >
> > parse text would be/might be  different because of the dynamic content
> that
> > might appear on the page at different times right? Parse data would be
> > mostly meta and outlinks which is not as interesting.
> >
> > Would nutch have helped if we were getting the same parsed text?
> > Nevertheless, since the data is extracted and persisted before it reaches
> > the segment, it should be the parser custom parser which is responsible.
>
> Hmm, there is no way to deduplicate segment data. It's also isolated from
> other segments. I think deduplication would only be required when all
> segments
> end up in some database or index.
>
> You could, however, use the CrawlDatum's signature to deduplicate. With it
> you
> can find different records sharing the same signature which is based on the
> parsed text.
>


Ok that is where i believe the following could help
<property>
  <name>db.signature.class</name>


Btw, just confirming, since I end up extracting data and persisting it as a
part of my custom parser plugin, deduping on segments would not help as
ultimately the data would get there only _after_ it has reached my db and i
need a way for data not to reach the db






>
> >
> > >>> I see that there is a Nutch plugin for Solr dedup,
> > >>> http://wiki.apache.org/nutch/**bin/nutch%20solrdedup<
> http://wiki.apache.
> > >>> org/nutch/bin/nutch%20solrdedup>but we are not using Solr.
> > >>>
> > >>> Also for the link deduplication, is my assumption correct that
> CrawlDB
> > >>> would not allow duplicate links to get inside it?
> > >
> > > What link deduplication do you mean? CrawlDB records have a unique key
> on
> > > the URL.
> >
> > Ok good, that helps.
> >
> > >>> Regards | Vikas
> > >>> www.knoldus.com
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> --
> Markus Jelsma - CTO - Openindex
>
>

Re: De-duplication of Nutch parsed data

Reply via email to