On Thu, May 10, 2012 at 3:25 PM, Markus Jelsma <[email protected]>wrote:
> hi > > On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote: > > Hi Markus, > > > > Thanks for your response. My responses inline > > > > On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma > > > > <[email protected]>wrote: > > > hi > > > > > > > > > On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati <[email protected]> > > > > > > wrote: > > >> Any ideas? > > >> > > >> On Tue, May 8, 2012 at 4:44 PM, Vikas Hazrati <[email protected]> > wrote: > > >> Hi, > > >> > > >>> A few days back there was a discussion on the way to extract data > from > > >>> raw > > >>> html content ( > > >>> > > >>> http://lucene.472066.n3.**nabble.com/Getting-the-parsed-** > > >>> HTML-content-back-td3916555.**html< > http://lucene.472066.n3.nabble.com/Ge > > >>> tting-the-parsed-HTML-content-back-td3916555.html> ) > > >>> and how to read it as DOM. We have a custom parser which ends up > working > > >>> on > > >>> the raw content. > > >>> > > >>> > > >>> This is how it works for us > > >>> Crawl cycle - Custom URL Filter - Custom Parser - Rest of Nutch > plugins > > >>> > > >>> In the custom parser, we end up parsing content as DOM and populating > > >>> our > > >>> database. > > >>> > > >>> > > >>> I am wondering can Nutch do anything in this scenario to help in > > >>> de-duplication of content OR would it be the responsibility of the > parse > > >>> logic to also verify if the content is duplicate or not by keeping a > > >>> hash > > >>> of already existing content? > > > > > > What do you want to deduplicate? CrawlDB records based on what? Segment > > > records? ParseData? ParseText? > > > > > > Primarily parse text. But your questions have got me thinking. I guess > the > > > > parse text would be/might be different because of the dynamic content > that > > might appear on the page at different times right? Parse data would be > > mostly meta and outlinks which is not as interesting. > > > > Would nutch have helped if we were getting the same parsed text? > > Nevertheless, since the data is extracted and persisted before it reaches > > the segment, it should be the parser custom parser which is responsible. > > Hmm, there is no way to deduplicate segment data. It's also isolated from > other segments. I think deduplication would only be required when all > segments > end up in some database or index. > > You could, however, use the CrawlDatum's signature to deduplicate. With it > you > can find different records sharing the same signature which is based on the > parsed text. > Ok that is where i believe the following could help <property> <name>db.signature.class</name> Btw, just confirming, since I end up extracting data and persisting it as a part of my custom parser plugin, deduping on segments would not help as ultimately the data would get there only _after_ it has reached my db and i need a way for data not to reach the db > > > > > >>> I see that there is a Nutch plugin for Solr dedup, > > >>> http://wiki.apache.org/nutch/**bin/nutch%20solrdedup< > http://wiki.apache. > > >>> org/nutch/bin/nutch%20solrdedup>but we are not using Solr. > > >>> > > >>> Also for the link deduplication, is my assumption correct that > CrawlDB > > >>> would not allow duplicate links to get inside it? > > > > > > What link deduplication do you mean? CrawlDB records have a unique key > on > > > the URL. > > > > Ok good, that helps. > > > > >>> Regards | Vikas > > >>> www.knoldus.com > > > > > > -- > > > Markus Jelsma - CTO - Openindex > -- > Markus Jelsma - CTO - Openindex > >

