By near-duplicate you mean similar URL's, or URL's with similar content? 
 
-----Original message-----
> From:Madan Patil <[email protected]>
> Sent: Wednesday 18th February 2015 21:10
> To: user <[email protected]>
> Subject: URL filter plugins for nutch
> 
> Hi,
> 
> I am working on assignment where I am supposed to use nutch to crawl
> antractic data.
> I am writing a plugin which extends URLFilter to not crawl duplicate (exact
> and near duplicate) URLs. All the plugins, the defaults ones and others on
> web, have only one URL. They decide what to do or not to do based on
> content of one URL.
> 
> Could any one point me to resources which would help me compare content of
> one URL with the ones already crawled?
> 
> Thanks in advance.
> 
> Regards,
> Madan Patil
> 

Reply via email to