Markus,

Interestingly enough, we do use OpenDNS to filter undesirable content, 
including parked content. In this case, however, the domain in question isn't 
tagged in OpenDNS and is therefore "allowed", along with all its subdomains.

This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear 
to all point to the same link-filled content, possibly a domain park site. 
Example URLs:
- http://e2qya.hjsjp.com/
- http://ml081.hjsjp.com/xzudb
- http://www.ch8yu.hjsjp.com/1805/8371.html

As Julien mentioned, partitioning and fetching by IP would help.

---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, January 11, 2017 9:43 AM
To: user@nutch.apache.org
Subject: RE: General question about subdomains

Hello Joseph,

The only feasible method, as i see, is being able to detect these kinds of spam 
sites as well as domain park sites, they produce lots of garbage as well. Once 
you detect them, you can chose not to follow outlinks, or to mark them in a 
domain-blacklist urlfilter.

We have seen these examples as well and they caused similar problems but we 
lost track of them, those domains don't exist anymore. Can you send me the 
domains that cause you trouble, we could use them for our classification 
training sets.

Regards,
Markus
 
-----Original message-----
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Wednesday 11th January 2017 15:21
> To: user@nutch.apache.org
> Subject: General question about subdomains
> 
> This is more of a general question, not Nutch-specific:
> 
> Our crawler discovered some URLs pointing to a number of subdomains of a 
> Chinese-owned [strmy domain. It then proceeded to discover millions more URLs 
> pointing to other subdomains (hosts) of the same domain. Most of the names 
> appear to be gibberish but they do have robots.txt files and the URLs appear 
> to serve HTML. A few days later I found that our crawler machine was no 
> longer able to resolve these subdomains, as if it was blocked by their DNS 
> servers, significantly slowing our crawl (due to DNS timeouts). This led me 
> to investigate and find that 40% of all our known URLs were hosts on this 
> same parent domain.
> 
> Since the hosts are actually different, is Nutch able to prevent this 
> trap-like behavior? Are there any established methods for preventing similar 
> issues in web crawlers?
> 
> Thanks
> 
> ---
> Joe Naegele
> Grier Forensics
> 
> 
> 

Reply via email to