This is really tricky given the familiarity with nutch, I have. I will try with sorensen as you suggested. Thanks for the input Markus.
Regards, Madan Patil On Wed, Feb 18, 2015 at 1:28 PM, Markus Jelsma <[email protected]> wrote: > Hi - this is not going to work. URLFilter interface operates on single > URL's only, it is not aware of content, it is not aware of possible > metadata (simhash) attached to the CrawlDatum. It would be more > straightforward to implement Signature and calculate the simhash there. > Now, Nutch has a DeduplicationJob but it it operates on equals signatures > as mapreduce key, and this is not going to work with simhashes. I remember > there was a trick to get similar hashes in the same key buckets by emitting > them to multiple buckets from the mapper, so then in the reducer you can do > a sorensen similarity on the hashes. > > This is really tricky stuff, especially getting the hashes in the same > bucket. > > Are you doing this for removing duplicates from search results? Then it > might be more easier to implement the sorensen similarity in a custom > Lucene collector. Because the top docs contain duplicates, the pass through > the same collector implementation and a single point to remove them. The > problem now is that it wont really work with distributed search, unless you > hash similar URL's to the same shard, but the cluster now becomes > unbalanced and difficult to manage, plus that IDF and norms become skewed. > > Good luck, we have tried many different approaches to this problem, > especially online deduplication. But offline is also hard because of > reducer keys. > > Markus > > > > -----Original message----- > > From:Madan Patil <[email protected]> > > Sent: Wednesday 18th February 2015 22:16 > > To: user <[email protected]> > > Subject: Re: URL filter plugins for nutch > > > > I am not sure if I understand you right. But here is what I am trying to > > implement, > > > > I have implemented Charikar's simhash and now want to use it to detect > > near-duplicates/duplicates. > > I would like to make it a plugin(which implements URLFilter interface). > > Hence filter all those URLs, whose content is nearly same as one which > have > > alrady been fetched. Would this be possible or I am heading in wrong > > direction. > > > > Thanks for your patience Markus. > > > > > > Regards, > > Madan Patil > > > > On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma < > [email protected]> > > wrote: > > > > > Easiest is to set the signature to TextProfileSignature and delete > > > duplicates from the index, but you will still crawl and waste > resources on > > > them. Or are you by any change trying to prevent spider traps from > being > > > crawled? > > > > > > > > > > > > -----Original message----- > > > > From:Madan Patil <[email protected]> > > > > Sent: Wednesday 18th February 2015 21:58 > > > > To: user <[email protected]> > > > > Subject: Re: URL filter plugins for nutch > > > > > > > > Hi Markus, > > > > > > > > I am looking for the one's with similar in content. > > > > > > > > Regards, > > > > Madan Patil > > > > > > > > On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma < > > > [email protected]> > > > > wrote: > > > > > > > > > By near-duplicate you mean similar URL's, or URL's with similar > > > content? > > > > > > > > > > -----Original message----- > > > > > > From:Madan Patil <[email protected]> > > > > > > Sent: Wednesday 18th February 2015 21:10 > > > > > > To: user <[email protected]> > > > > > > Subject: URL filter plugins for nutch > > > > > > > > > > > > Hi, > > > > > > > > > > > > I am working on assignment where I am supposed to use nutch to > crawl > > > > > > antractic data. > > > > > > I am writing a plugin which extends URLFilter to not crawl > duplicate > > > > > (exact > > > > > > and near duplicate) URLs. All the plugins, the defaults ones and > > > others > > > > > on > > > > > > web, have only one URL. They decide what to do or not to do > based on > > > > > > content of one URL. > > > > > > > > > > > > Could any one point me to resources which would help me compare > > > content > > > > > of > > > > > > one URL with the ones already crawled? > > > > > > > > > > > > Thanks in advance. > > > > > > > > > > > > Regards, > > > > > > Madan Patil > > > > > > > > > > > > > > > > > > > > >

