Hi Markus,

I am looking for the one's with similar in content.

Regards,
Madan Patil

On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma <[email protected]>
wrote:

> By near-duplicate you mean similar URL's, or URL's with similar content?
>
> -----Original message-----
> > From:Madan Patil <[email protected]>
> > Sent: Wednesday 18th February 2015 21:10
> > To: user <[email protected]>
> > Subject: URL filter plugins for nutch
> >
> > Hi,
> >
> > I am working on assignment where I am supposed to use nutch to crawl
> > antractic data.
> > I am writing a plugin which extends URLFilter to not crawl duplicate
> (exact
> > and near duplicate) URLs. All the plugins, the defaults ones and others
> on
> > web, have only one URL. They decide what to do or not to do based on
> > content of one URL.
> >
> > Could any one point me to resources which would help me compare content
> of
> > one URL with the ones already crawled?
> >
> > Thanks in advance.
> >
> > Regards,
> > Madan Patil
> >
>

Reply via email to