Re: Handling duplicate sub domains

Mathijs Homminga Sun, 27 Nov 2011 11:13:19 -0800

Hi Markus,

What is your definition of duplicate (sub) domains?

By reading your examples, I think you are looking for domains (or host IP's) 
that are interchangeable. 
That is, domains that give identical response when combined with the same 
protocol, port, path and query (a url).

You could indeed use heuristics (like normalizing wwww. to www.).

I guess that most of the time this happens when the domain has set a wildcard 
dns record (catch-all).
No guarantee however that wildcard domains act 'identical' of course. Although 
(sub) domains may point to the same canonical name or IP address, they still 
may give different responses because of domain/url based dispatching on that 
host (think virtual hosts in Apache) or application level logic.
I guess this is why you never can be 100% sure that the domains are 
duplicates...

Clues I can think of (none of them are hard guarantees):

- Your heuristics using common patterns.
- Do a DNS lookup of the domains... does it point to another domain or an IP 
address which is shared among other domains?
- Did we find duplicate URLs on different hosts?
        - Quick: if there are a lot of identical urls (paths+query of 
substantial length) on different subdomains, then the domains might be 
identical. 
        - You might want to include a content check in the above.
- Actively check a fingerprint of the main page of each subdomain (e.g. title + 
some headers) and group domains based on this.

I'm currently working on the Host table (in nutchgora) and like to include some 
of this in there too.

Mathijs 

On Nov 27, 2011, at 15:46 , Markus Jelsma wrote:

> Hi,
> 
> How do you handle the issue of duplicate (sub) domains? We measure a 
> significant amount of duplicate pages across sub domains. Bad websites for 
> example do not force a single sub domain and accept anything. With regex 
> normalizers we can easily tackle a portion of the problem by normalizing www 
> derivatives such as ww. wwww. or www.w.ww.www. to www. This still leaves a 
> huge amount of incorrect sub domains, leading to duplicates of _entire_ 
> websites.
> 
> We've built analysis jobs to detect and list duplicate pages within sub 
> domains (but also works across domains) which we can then reduce with another 
> job to bad sub domains. Yet, one of each sub domain for a given domain must 
> be 
> kept but i've still to figure out which sub domain will prevail.
> 
> Here's an example of one such site:
> 113425:188    example.org
> 114314:186    startpagina.example.org
> 114334:186    mobile.example.org
> 114339:186    massages.example.org
> 114340:186    massage.example.org
> 114362:186    http.www.example.org
> 114446:185    www.example.org
> 115280:184    m.example.org
> 115316:184    forum.example.org
> 
> In this case it may be simple to select www as the sub domain we want to keep 
> but it is not always so trivial.
> 
> Anyone to share some inspiring insights for edge cases that make up the bulk 
> of duplicates?
> 
> Thanks,
> markus

Re: Handling duplicate sub domains

Reply via email to