This is more of a general question, not Nutch-specific: Our crawler discovered some URLs pointing to a number of subdomains of a Chinese-owned [strmy domain. It then proceeded to discover millions more URLs pointing to other subdomains (hosts) of the same domain. Most of the names appear to be gibberish but they do have robots.txt files and the URLs appear to serve HTML. A few days later I found that our crawler machine was no longer able to resolve these subdomains, as if it was blocked by their DNS servers, significantly slowing our crawl (due to DNS timeouts). This led me to investigate and find that 40% of all our known URLs were hosts on this same parent domain.
Since the hosts are actually different, is Nutch able to prevent this trap-like behavior? Are there any established methods for preventing similar issues in web crawlers? Thanks --- Joe Naegele Grier Forensics

