Hi,

I am working on assignment where I am supposed to use nutch to crawl
antractic data.
I am writing a plugin which extends URLFilter to not crawl duplicate (exact
and near duplicate) URLs. All the plugins, the defaults ones and others on
web, have only one URL. They decide what to do or not to do based on
content of one URL.

Could any one point me to resources which would help me compare content of
one URL with the ones already crawled?

Thanks in advance.

Regards,
Madan Patil

Reply via email to