Hi Shiva,

1. you can define URL normalizer rules to rewrite the URLs
   but it only works for sites where you know which URL is
   the canonical form.

2. you can deduplicate (command "nutch dedup") based on the
   content checksum: the duplicates are still crawled but deleted
   afterwards

It's a frequent problem (plus http:// vs. https://) but there is
no solution which works for all sites because each site or web server
behaves different. Well-configured servers wouldn't present variant
URLs and also could redirect the user or crawler to the canonical page.

Best,
Sebastian



On 03/15/2018 10:12 AM, ShivaKarthik S wrote:
> Hi,
> 
>       I am crawling many websites using Nutch-1.11 or Nutch-1.13 or 1.14. 
> While crawling am getting
> near duplicate URLs like the following where the content is exactly the same 
> 
> *_Case1: URLs with and Without WWW_*
> http://www.samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma
> http://samacharplus.com/~samachar/index.php/en/worlds/11-india/24151-nine-crpf-soldiers-martyred-in-naxal-attack-in-sukma
> 
> *_Case2: URLs ending with and without Slash (/)_*
>  http://eng.belta.by/news-headers <http://eng.belta.by/news-headers>
>  http://eng.belta.by/news-headers/ <http://eng.belta.by/news-headers/> 
> http://eng.belta.by/products
> http://eng.belta.by/products/
> 
> Nutch is not able to handle this and the it is sending as separate document 
> in each case whereas it
> is actually duplicate URLs. Can you give me a solution to handles these kind 
> of pages and treat them
> as a single one. 
> 
> -- 
> Thanks and Regards
> Shiva

Reply via email to