By the way - Unbound looks very interesting indeed! I'll be sure to remember 
that as an option for DNS caching within the DC.

Thanks,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 17th February 2016 20:42
> To: [email protected]
> Subject: RE: DNS caching best practices
> 
> Hello Alexander - for Nutch and other JVM crawlers crawl speed does not 
> really matter. The JVM caches DNS lookups. The only thing we ever had to 
> worry about (when crawling large scale and at high speed) whether or not we'd 
> overflow the DNS server in our local DC.
> 
> In case of Nutch, i would not worry about having a machine-local DNS cache. 
> It eats memory and in case of Nutch won't have a very high hitrate. I'd 
> prefer a close but central powerful DNS server that can dedicate its memory 
> to DNS.
> 
> Markus 
>  
> -----Original message-----
> > From:Alexander Sibiryakov <[email protected]>
> > Sent: Tuesday 16th February 2016 13:57
> > To: [email protected]
> > Subject: Re: DNS caching best practices
> > 
> > Otis, Marcus,
> > it depends on the speed you operate your crawler. If it’s relatively slow, 
> > than that’s ok using ISP general purpose DNS for it.
> > 
> > I think below information could be useful, just to realize what kind of 
> > problems we cause to internet infrastructure.
> > 
> > I was talking with one of the guys from https://selectel.ru/ 
> > <https://selectel.ru/> (huge cloud and hosting provider) responsible for 
> > DNS service, and he said they built a dedicated DNS cache for various 
> > crawlers and bots, to help persist the cache in their main DNS server. 
> > Before that, during the night time (the crawlers time!) the cache were 
> > changing significantly and causing slow downs for typical users next day.
> > 
> > The recommendation from him was to use http://unbound.net/ 
> > <http://unbound.net/> as a local caching DNS service, and configuring it 
> > without upstream, so it will resolve DNS recursively on it’s own. It even 
> > provides a way to dump/load a cache on disk.
> > 
> > Linux OS has no internal DNS cache, so it makes sense if your crawler makes 
> > repetitive requests to the same website.
> > 
> > A.
> > 
> > > 1 февр. 2016 г., в 11:18, Markus Jelsma <[email protected]> 
> > > написал(а):
> > > 
> > > Otis - we tried local DNS caching when we did very large scale crawls but 
> > > decided to get rid of it as soon as possible because it got us too much 
> > > overhead. Instead, we relied on an, apparently, powerful DNS server put 
> > > available by the ISP in the network center. If the server is fast and has 
> > > a lot of RAM the mapper won't quickly overwhelm it.
> > > 
> > > Markus
> > > 
> > > 
> > > -----Original message-----
> > >> From:Otis Gospodnetić <[email protected]>
> > >> Sent: Sunday 31st January 2016 23:36
> > >> To: Nutch User List <[email protected]>
> > >> Subject: DNS caching best practices
> > >> 
> > >> Hi,
> > >> 
> > >> The first item on http://wiki.apache.org/nutch/OptimizingCrawls is DNS
> > >> caching.  Is this still something people regularly do?  Even when running
> > >> in EC2, which I assume has nameservers that are relatively close to
> > >> instances doing crawling and nameserver lookups?
> > >> 
> > >> If so, are there any recommendations for the best DNS caching 
> > >> server/config
> > >> to use?
> > >> 
> > >> Thanks,
> > >> Otis
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection
> > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> > >> 
> > 
> > 

Reply via email to