Hi, I am working on assignment where I am supposed to use nutch to crawl antractic data. I am writing a plugin which extends URLFilter to not crawl duplicate (exact and near duplicate) URLs. All the plugins, the defaults ones and others on web, have only one URL. They decide what to do or not to do based on content of one URL.
Could any one point me to resources which would help me compare content of one URL with the ones already crawled? Thanks in advance. Regards, Madan Patil

