Hi Mathijs,

I've already implemented several of your clues and get good results. The final 
problem is deciding when an entire sub domain is to be filtered out based on 
just a couple of duplicates. I've seen similar metrics for both good and bad 
sites.

I haven't yet decided on how to prevent false positives for these edge cases.

Thanks
Markus

On Sunday 27 November 2011 20:12:48 Mathijs Homminga wrote:
> Hi Markus,
> 
> What is your definition of duplicate (sub) domains?
> 
> By reading your examples, I think you are looking for domains (or host
> IP's) that are interchangeable. That is, domains that give identical
> response when combined with the same protocol, port, path and query (a
> url).
> 
> You could indeed use heuristics (like normalizing wwww. to www.).
> 
> I guess that most of the time this happens when the domain has set a
> wildcard dns record (catch-all). No guarantee however that wildcard
> domains act 'identical' of course. Although (sub) domains may point to the
> same canonical name or IP address, they still may give different responses
> because of domain/url based dispatching on that host (think virtual hosts
> in Apache) or application level logic. I guess this is why you never can
> be 100% sure that the domains are duplicates...
> 
> Clues I can think of (none of them are hard guarantees):
> 
> - Your heuristics using common patterns.
> - Do a DNS lookup of the domains... does it point to another domain or an
> IP address which is shared among other domains? - Did we find duplicate
> URLs on different hosts?
>       - Quick: if there are a lot of identical urls (paths+query of 
> substantial
> length) on different subdomains, then the domains might be identical. -
> You might want to include a content check in the above.
> - Actively check a fingerprint of the main page of each subdomain (e.g.
> title + some headers) and group domains based on this.
> 
> I'm currently working on the Host table (in nutchgora) and like to include
> some of this in there too.
> 
> Mathijs
> 
> On Nov 27, 2011, at 15:46 , Markus Jelsma wrote:
> > Hi,
> > 
> > How do you handle the issue of duplicate (sub) domains? We measure a
> > significant amount of duplicate pages across sub domains. Bad websites
> > for example do not force a single sub domain and accept anything. With
> > regex normalizers we can easily tackle a portion of the problem by
> > normalizing www derivatives such as ww. wwww. or www.w.ww.www. to www.
> > This still leaves a huge amount of incorrect sub domains, leading to
> > duplicates of _entire_ websites.
> > 
> > We've built analysis jobs to detect and list duplicate pages within sub
> > domains (but also works across domains) which we can then reduce with
> > another job to bad sub domains. Yet, one of each sub domain for a given
> > domain must be kept but i've still to figure out which sub domain will
> > prevail.
> > 
> > Here's an example of one such site:
> > 113425:188  example.org
> > 114314:186  startpagina.example.org
> > 114334:186  mobile.example.org
> > 114339:186  massages.example.org
> > 114340:186  massage.example.org
> > 114362:186  http.www.example.org
> > 114446:185  www.example.org
> > 115280:184  m.example.org
> > 115316:184  forum.example.org
> > 
> > In this case it may be simple to select www as the sub domain we want to
> > keep but it is not always so trivial.
> > 
> > Anyone to share some inspiring insights for edge cases that make up the
> > bulk of duplicates?
> > 
> > Thanks,
> > markus

-- 
Markus Jelsma - CTO - Openindex

Reply via email to