Hi Mathijs, I've already implemented several of your clues and get good results. The final problem is deciding when an entire sub domain is to be filtered out based on just a couple of duplicates. I've seen similar metrics for both good and bad sites.
I haven't yet decided on how to prevent false positives for these edge cases. Thanks Markus On Sunday 27 November 2011 20:12:48 Mathijs Homminga wrote: > Hi Markus, > > What is your definition of duplicate (sub) domains? > > By reading your examples, I think you are looking for domains (or host > IP's) that are interchangeable. That is, domains that give identical > response when combined with the same protocol, port, path and query (a > url). > > You could indeed use heuristics (like normalizing wwww. to www.). > > I guess that most of the time this happens when the domain has set a > wildcard dns record (catch-all). No guarantee however that wildcard > domains act 'identical' of course. Although (sub) domains may point to the > same canonical name or IP address, they still may give different responses > because of domain/url based dispatching on that host (think virtual hosts > in Apache) or application level logic. I guess this is why you never can > be 100% sure that the domains are duplicates... > > Clues I can think of (none of them are hard guarantees): > > - Your heuristics using common patterns. > - Do a DNS lookup of the domains... does it point to another domain or an > IP address which is shared among other domains? - Did we find duplicate > URLs on different hosts? > - Quick: if there are a lot of identical urls (paths+query of > substantial > length) on different subdomains, then the domains might be identical. - > You might want to include a content check in the above. > - Actively check a fingerprint of the main page of each subdomain (e.g. > title + some headers) and group domains based on this. > > I'm currently working on the Host table (in nutchgora) and like to include > some of this in there too. > > Mathijs > > On Nov 27, 2011, at 15:46 , Markus Jelsma wrote: > > Hi, > > > > How do you handle the issue of duplicate (sub) domains? We measure a > > significant amount of duplicate pages across sub domains. Bad websites > > for example do not force a single sub domain and accept anything. With > > regex normalizers we can easily tackle a portion of the problem by > > normalizing www derivatives such as ww. wwww. or www.w.ww.www. to www. > > This still leaves a huge amount of incorrect sub domains, leading to > > duplicates of _entire_ websites. > > > > We've built analysis jobs to detect and list duplicate pages within sub > > domains (but also works across domains) which we can then reduce with > > another job to bad sub domains. Yet, one of each sub domain for a given > > domain must be kept but i've still to figure out which sub domain will > > prevail. > > > > Here's an example of one such site: > > 113425:188 example.org > > 114314:186 startpagina.example.org > > 114334:186 mobile.example.org > > 114339:186 massages.example.org > > 114340:186 massage.example.org > > 114362:186 http.www.example.org > > 114446:185 www.example.org > > 115280:184 m.example.org > > 115316:184 forum.example.org > > > > In this case it may be simple to select www as the sub domain we want to > > keep but it is not always so trivial. > > > > Anyone to share some inspiring insights for edge cases that make up the > > bulk of duplicates? > > > > Thanks, > > markus -- Markus Jelsma - CTO - Openindex

