​​This is really tricky given the familiarity with nutch, I have. I will
try with sorensen as you suggested.
Thanks for the input Markus.

Regards,
Madan Patil

On Wed, Feb 18, 2015 at 1:28 PM, Markus Jelsma <[email protected]>
wrote:

> Hi - this is not going to work. URLFilter interface operates on single
> URL's only, it is not aware of content, it is not aware of possible
> metadata (simhash) attached to the CrawlDatum. It would be more
> straightforward to implement Signature and calculate the simhash there.
> Now, Nutch has a DeduplicationJob but it it operates on equals signatures
> as mapreduce key, and this is not going to work with simhashes. I remember
> there was a trick to get similar hashes in the same key buckets by emitting
> them to multiple buckets from the mapper, so then in the reducer you can do
> a sorensen similarity on the hashes.
>
> This is really tricky stuff, especially getting the hashes in the same
> bucket.
>
> Are you doing this for removing duplicates from search results? Then it
> might be more easier to implement the sorensen similarity in a custom
> Lucene collector. Because the top docs contain duplicates, the pass through
> the same collector implementation and a single point to remove them. The
> problem now is that it wont really work with distributed search, unless you
> hash similar URL's to the same shard, but the cluster now becomes
> unbalanced and difficult to manage, plus that IDF and norms become skewed.
>
> Good luck, we have tried many different approaches to this problem,
> especially online deduplication. But offline is also hard because of
> reducer keys.
>
> Markus
>
>
>
> -----Original message-----
> > From:Madan Patil <[email protected]>
> > Sent: Wednesday 18th February 2015 22:16
> > To: user <[email protected]>
> > Subject: Re: URL filter plugins for nutch
> >
> > I am not sure if I understand you right. But here is what I am trying to
> > implement,
> >
> > I have implemented Charikar's simhash and now want to use it to detect
> > near-duplicates/duplicates.
> > I would like to make it a plugin(which implements URLFilter interface).
> > Hence filter all those URLs, whose content is nearly same as one which
> have
> > alrady been fetched. Would this be possible or I am heading in wrong
> > direction.
> >
> > Thanks for your patience Markus.
> >
> >
> > Regards,
> > Madan Patil
> >
> > On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma <
> [email protected]>
> > wrote:
> >
> > > Easiest is to set the signature to TextProfileSignature and delete
> > > duplicates from the index, but you will still crawl and waste
> resources on
> > > them. Or are you by any change trying to prevent spider traps from
> being
> > > crawled?
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Madan Patil <[email protected]>
> > > > Sent: Wednesday 18th February 2015 21:58
> > > > To: user <[email protected]>
> > > > Subject: Re: URL filter plugins for nutch
> > > >
> > > > Hi Markus,
> > > >
> > > > I am looking for the one's with similar in content.
> > > >
> > > > Regards,
> > > > Madan Patil
> > > >
> > > > On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > By near-duplicate you mean similar URL's, or URL's with similar
> > > content?
> > > > >
> > > > > -----Original message-----
> > > > > > From:Madan Patil <[email protected]>
> > > > > > Sent: Wednesday 18th February 2015 21:10
> > > > > > To: user <[email protected]>
> > > > > > Subject: URL filter plugins for nutch
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am working on assignment where I am supposed to use nutch to
> crawl
> > > > > > antractic data.
> > > > > > I am writing a plugin which extends URLFilter to not crawl
> duplicate
> > > > > (exact
> > > > > > and near duplicate) URLs. All the plugins, the defaults ones and
> > > others
> > > > > on
> > > > > > web, have only one URL. They decide what to do or not to do
> based on
> > > > > > content of one URL.
> > > > > >
> > > > > > Could any one point me to resources which would help me compare
> > > content
> > > > > of
> > > > > > one URL with the ones already crawled?
> > > > > >
> > > > > > Thanks in advance.
> > > > > >
> > > > > > Regards,
> > > > > > Madan Patil
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to