By the way - Unbound looks very interesting indeed! I'll be sure to remember that as an option for DNS caching within the DC.
Thanks, Markus -----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Wednesday 17th February 2016 20:42 > To: [email protected] > Subject: RE: DNS caching best practices > > Hello Alexander - for Nutch and other JVM crawlers crawl speed does not > really matter. The JVM caches DNS lookups. The only thing we ever had to > worry about (when crawling large scale and at high speed) whether or not we'd > overflow the DNS server in our local DC. > > In case of Nutch, i would not worry about having a machine-local DNS cache. > It eats memory and in case of Nutch won't have a very high hitrate. I'd > prefer a close but central powerful DNS server that can dedicate its memory > to DNS. > > Markus > > -----Original message----- > > From:Alexander Sibiryakov <[email protected]> > > Sent: Tuesday 16th February 2016 13:57 > > To: [email protected] > > Subject: Re: DNS caching best practices > > > > Otis, Marcus, > > it depends on the speed you operate your crawler. If it’s relatively slow, > > than that’s ok using ISP general purpose DNS for it. > > > > I think below information could be useful, just to realize what kind of > > problems we cause to internet infrastructure. > > > > I was talking with one of the guys from https://selectel.ru/ > > <https://selectel.ru/> (huge cloud and hosting provider) responsible for > > DNS service, and he said they built a dedicated DNS cache for various > > crawlers and bots, to help persist the cache in their main DNS server. > > Before that, during the night time (the crawlers time!) the cache were > > changing significantly and causing slow downs for typical users next day. > > > > The recommendation from him was to use http://unbound.net/ > > <http://unbound.net/> as a local caching DNS service, and configuring it > > without upstream, so it will resolve DNS recursively on it’s own. It even > > provides a way to dump/load a cache on disk. > > > > Linux OS has no internal DNS cache, so it makes sense if your crawler makes > > repetitive requests to the same website. > > > > A. > > > > > 1 февр. 2016 г., в 11:18, Markus Jelsma <[email protected]> > > > написал(а): > > > > > > Otis - we tried local DNS caching when we did very large scale crawls but > > > decided to get rid of it as soon as possible because it got us too much > > > overhead. Instead, we relied on an, apparently, powerful DNS server put > > > available by the ISP in the network center. If the server is fast and has > > > a lot of RAM the mapper won't quickly overwhelm it. > > > > > > Markus > > > > > > > > > -----Original message----- > > >> From:Otis Gospodnetić <[email protected]> > > >> Sent: Sunday 31st January 2016 23:36 > > >> To: Nutch User List <[email protected]> > > >> Subject: DNS caching best practices > > >> > > >> Hi, > > >> > > >> The first item on http://wiki.apache.org/nutch/OptimizingCrawls is DNS > > >> caching. Is this still something people regularly do? Even when running > > >> in EC2, which I assume has nameservers that are relatively close to > > >> instances doing crawling and nameserver lookups? > > >> > > >> If so, are there any recommendations for the best DNS caching > > >> server/config > > >> to use? > > >> > > >> Thanks, > > >> Otis > > >> -- > > >> Monitoring - Log Management - Alerting - Anomaly Detection > > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > >> > > > >

