Hi,

How do you handle the issue of duplicate (sub) domains? We measure a 
significant amount of duplicate pages across sub domains. Bad websites for 
example do not force a single sub domain and accept anything. With regex 
normalizers we can easily tackle a portion of the problem by normalizing www 
derivatives such as ww. wwww. or www.w.ww.www. to www. This still leaves a 
huge amount of incorrect sub domains, leading to duplicates of _entire_ 
websites.

We've built analysis jobs to detect and list duplicate pages within sub 
domains (but also works across domains) which we can then reduce with another 
job to bad sub domains. Yet, one of each sub domain for a given domain must be 
kept but i've still to figure out which sub domain will prevail.

Here's an example of one such site:
113425:188      example.org
114314:186      startpagina.example.org
114334:186      mobile.example.org
114339:186      massages.example.org
114340:186      massage.example.org
114362:186      http.www.example.org
114446:185      www.example.org
115280:184      m.example.org
115316:184      forum.example.org

In this case it may be simple to select www as the sub domain we want to keep 
but it is not always so trivial.

Anyone to share some inspiring insights for edge cases that make up the bulk 
of duplicates?

Thanks,
markus

Reply via email to