Another analysis job just finished and it indicates how serious this problem 
is in internet crawling: 10.01% of all domains we've visited so far contain 
one or more duplicates of themselves.

We're glad we select and limit the generator by domain name, this limits the 
problem of wasted resources but also prevents crawling of legitimate sub 
domains such as wikipedia. When limiting on domain such a massive website will 
never be indexed at all. But limiting on host would mean we quickly download 
_all_ duplicates of a bad site.

We intend to filter these legitimate sites out by calculating the ratio of 
duplicate sub domains and have some, yet to determine, threshold.



> Hi,
> 
> How do you handle the issue of duplicate (sub) domains? We measure a
> significant amount of duplicate pages across sub domains. Bad websites for
> example do not force a single sub domain and accept anything. With regex
> normalizers we can easily tackle a portion of the problem by normalizing
> www derivatives such as ww. wwww. or www.w.ww.www. to www. This still
> leaves a huge amount of incorrect sub domains, leading to duplicates of
> _entire_ websites.
> 
> We've built analysis jobs to detect and list duplicate pages within sub
> domains (but also works across domains) which we can then reduce with
> another job to bad sub domains. Yet, one of each sub domain for a given
> domain must be kept but i've still to figure out which sub domain will
> prevail.
> 
> Here's an example of one such site:
> 113425:188    example.org
> 114314:186    startpagina.example.org
> 114334:186    mobile.example.org
> 114339:186    massages.example.org
> 114340:186    massage.example.org
> 114362:186    http.www.example.org
> 114446:185    www.example.org
> 115280:184    m.example.org
> 115316:184    forum.example.org
> 
> In this case it may be simple to select www as the sub domain we want to
> keep but it is not always so trivial.
> 
> Anyone to share some inspiring insights for edge cases that make up the
> bulk of duplicates?
> 
> Thanks,
> markus

Reply via email to