Hi, How do you handle the issue of duplicate (sub) domains? We measure a significant amount of duplicate pages across sub domains. Bad websites for example do not force a single sub domain and accept anything. With regex normalizers we can easily tackle a portion of the problem by normalizing www derivatives such as ww. wwww. or www.w.ww.www. to www. This still leaves a huge amount of incorrect sub domains, leading to duplicates of _entire_ websites.
We've built analysis jobs to detect and list duplicate pages within sub domains (but also works across domains) which we can then reduce with another job to bad sub domains. Yet, one of each sub domain for a given domain must be kept but i've still to figure out which sub domain will prevail. Here's an example of one such site: 113425:188 example.org 114314:186 startpagina.example.org 114334:186 mobile.example.org 114339:186 massages.example.org 114340:186 massage.example.org 114362:186 http.www.example.org 114446:185 www.example.org 115280:184 m.example.org 115316:184 forum.example.org In this case it may be simple to select www as the sub domain we want to keep but it is not always so trivial. Anyone to share some inspiring insights for edge cases that make up the bulk of duplicates? Thanks, markus

