Easiest is to set the signature to TextProfileSignature and delete duplicates from the index, but you will still crawl and waste resources on them. Or are you by any change trying to prevent spider traps from being crawled?
-----Original message----- > From:Madan Patil <[email protected]> > Sent: Wednesday 18th February 2015 21:58 > To: user <[email protected]> > Subject: Re: URL filter plugins for nutch > > Hi Markus, > > I am looking for the one's with similar in content. > > Regards, > Madan Patil > > On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma <[email protected]> > wrote: > > > By near-duplicate you mean similar URL's, or URL's with similar content? > > > > -----Original message----- > > > From:Madan Patil <[email protected]> > > > Sent: Wednesday 18th February 2015 21:10 > > > To: user <[email protected]> > > > Subject: URL filter plugins for nutch > > > > > > Hi, > > > > > > I am working on assignment where I am supposed to use nutch to crawl > > > antractic data. > > > I am writing a plugin which extends URLFilter to not crawl duplicate > > (exact > > > and near duplicate) URLs. All the plugins, the defaults ones and others > > on > > > web, have only one URL. They decide what to do or not to do based on > > > content of one URL. > > > > > > Could any one point me to resources which would help me compare content > > of > > > one URL with the ones already crawled? > > > > > > Thanks in advance. > > > > > > Regards, > > > Madan Patil > > > > > >

