Thanks Julien,

The subdomains do, in fact, point to the same IP address. In the end the issue 
was that our DNS service flagged our DNS traffic since we were resolving 
millions of subdomains using the domain's authoritative nameservers (we use 
OpenDNS to filter inappropriate content).

Partitioning and fetching by IP is definitely a step in the right direction.

---
Joe Naegele
Grier Forensics

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Wednesday, January 11, 2017 9:32 AM
To: user@nutch.apache.org
Subject: Re: General question about subdomains

Hi Joe,

Do these subdomains point to the same IP address? Did they blacklist your
server i.e. can you connect to these domains from the crawl server using a
different tool like curl?

Not a silver bullet but a way of preventing this is to group by IP or
domain (fetcher.queue.mode and partition.url.mode) so that the politeness
settings are applied to all the subdomains. This will reduce the risk of
being blacklisted - assuming you were - and slow down the discovery of URLs
for the TLD.

fetcher.max.exceptions.per.queue should also help by preventing a long tail
of fetch errors during the fetch step

HTH

Julien


On 11 January 2017 at 14:21, Joseph Naegele <jnaeg...@grierforensics.com>
wrote:

> This is more of a general question, not Nutch-specific:
>
> Our crawler discovered some URLs pointing to a number of subdomains of a
> Chinese-owned [strmy domain. It then proceeded to discover millions more
> URLs pointing to other subdomains (hosts) of the same domain. Most of the
> names appear to be gibberish but they do have robots.txt files and the URLs
> appear to serve HTML. A few days later I found that our crawler machine was
> no longer able to resolve these subdomains, as if it was blocked by their
> DNS servers, significantly slowing our crawl (due to DNS timeouts). This
> led me to investigate and find that 40% of all our known URLs were hosts on
> this same parent domain.
>
> Since the hosts are actually different, is Nutch able to prevent this
> trap-like behavior? Are there any established methods for preventing
> similar issues in web crawlers?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Reply via email to