Interestingly enough, we do use OpenDNS to filter undesirable content,
including parked content. In this case, however, the domain in question isn't
tagged in OpenDNS and is therefore "allowed", along with all its subdomains.
This particular domain is "hjsjp.com". It's Chinese-owned and the URLs appear
to all point to the same link-filled content, possibly a domain park site.
As Julien mentioned, partitioning and fetching by IP would help.
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, January 11, 2017 9:43 AM
Subject: RE: General question about subdomains
The only feasible method, as i see, is being able to detect these kinds of spam
sites as well as domain park sites, they produce lots of garbage as well. Once
you detect them, you can chose not to follow outlinks, or to mark them in a
We have seen these examples as well and they caused similar problems but we
lost track of them, those domains don't exist anymore. Can you send me the
domains that cause you trouble, we could use them for our classification
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Wednesday 11th January 2017 15:21
> To: email@example.com
> Subject: General question about subdomains
> This is more of a general question, not Nutch-specific:
> Our crawler discovered some URLs pointing to a number of subdomains of a
> Chinese-owned [strmy domain. It then proceeded to discover millions more URLs
> pointing to other subdomains (hosts) of the same domain. Most of the names
> appear to be gibberish but they do have robots.txt files and the URLs appear
> to serve HTML. A few days later I found that our crawler machine was no
> longer able to resolve these subdomains, as if it was blocked by their DNS
> servers, significantly slowing our crawl (due to DNS timeouts). This led me
> to investigate and find that 40% of all our known URLs were hosts on this
> same parent domain.
> Since the hosts are actually different, is Nutch able to prevent this
> trap-like behavior? Are there any established methods for preventing similar
> issues in web crawlers?
> Joe Naegele
> Grier Forensics