This is more of a general question, not Nutch-specific:

Our crawler discovered some URLs pointing to a number of subdomains of a 
Chinese-owned [strmy domain. It then proceeded to discover millions more URLs 
pointing to other subdomains (hosts) of the same domain. Most of the names 
appear to be gibberish but they do have robots.txt files and the URLs appear to 
serve HTML. A few days later I found that our crawler machine was no longer 
able to resolve these subdomains, as if it was blocked by their DNS servers, 
significantly slowing our crawl (due to DNS timeouts). This led me to 
investigate and find that 40% of all our known URLs were hosts on this same 
parent domain.

Since the hosts are actually different, is Nutch able to prevent this trap-like 
behavior? Are there any established methods for preventing similar issues in 
web crawlers?

Thanks

---
Joe Naegele
Grier Forensics


Reply via email to