Hi Markus, I am looking for the one's with similar in content.
Regards, Madan Patil On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma <[email protected]> wrote: > By near-duplicate you mean similar URL's, or URL's with similar content? > > -----Original message----- > > From:Madan Patil <[email protected]> > > Sent: Wednesday 18th February 2015 21:10 > > To: user <[email protected]> > > Subject: URL filter plugins for nutch > > > > Hi, > > > > I am working on assignment where I am supposed to use nutch to crawl > > antractic data. > > I am writing a plugin which extends URLFilter to not crawl duplicate > (exact > > and near duplicate) URLs. All the plugins, the defaults ones and others > on > > web, have only one URL. They decide what to do or not to do based on > > content of one URL. > > > > Could any one point me to resources which would help me compare content > of > > one URL with the ones already crawled? > > > > Thanks in advance. > > > > Regards, > > Madan Patil > > >

