Hi Shiva, 1. you can define URL normalizer rules to rewrite the URLs but it only works for sites where you know which URL is the canonical form.
2. you can deduplicate (command "nutch dedup") based on the content checksum: the duplicates are still crawled but deleted afterwards It's a frequent problem (plus http:// vs. https://) but there is no solution which works for all sites because each site or web server behaves different. Well-configured servers wouldn't present variant URLs and also could redirect the user or crawler to the canonical page. Best, Sebastian On 03/15/2018 10:12 AM, ShivaKarthik S wrote: > Hi, > > I am crawling many websites using Nutch-1.11 or Nutch-1.13 or 1.14. > While crawling am getting > near duplicate URLs like the following where the content is exactly the same > > *_Case1: URLs with and Without WWW_* > http://www.samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma > http://samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma > > *_Case2: URLs ending with and without Slash (/)_* > http://eng.belta.by/news-headers <http://eng.belta.by/news-headers> > http://eng.belta.by/news-headers/ <http://eng.belta.by/news-headers/> > http://eng.belta.by/products > http://eng.belta.by/products/ > > Nutch is not able to handle this and the it is sending as separate document > in each case whereas it > is actually duplicate URLs. Can you give me a solution to handles these kind > of pages and treat them > as a single one. > > -- > Thanks and Regards > Shiva