Hello Joseph, The only feasible method, as i see, is being able to detect these kinds of spam sites as well as domain park sites, they produce lots of garbage as well. Once you detect them, you can chose not to follow outlinks, or to mark them in a domain-blacklist urlfilter.
We have seen these examples as well and they caused similar problems but we lost track of them, those domains don't exist anymore. Can you send me the domains that cause you trouble, we could use them for our classification training sets. Regards, Markus -----Original message----- > From:Joseph Naegele <[email protected]> > Sent: Wednesday 11th January 2017 15:21 > To: [email protected] > Subject: General question about subdomains > > This is more of a general question, not Nutch-specific: > > Our crawler discovered some URLs pointing to a number of subdomains of a > Chinese-owned [strmy domain. It then proceeded to discover millions more URLs > pointing to other subdomains (hosts) of the same domain. Most of the names > appear to be gibberish but they do have robots.txt files and the URLs appear > to serve HTML. A few days later I found that our crawler machine was no > longer able to resolve these subdomains, as if it was blocked by their DNS > servers, significantly slowing our crawl (due to DNS timeouts). This led me > to investigate and find that 40% of all our known URLs were hosts on this > same parent domain. > > Since the hosts are actually different, is Nutch able to prevent this > trap-like behavior? Are there any established methods for preventing similar > issues in web crawlers? > > Thanks > > --- > Joe Naegele > Grier Forensics > > >

