On 08/06/2015 02:56, Jonathan de Boyne Pollard wrote:
I had to get rid of it for nosh; again, because what's fine on a
hobbyist PC isn't fine in a datacentre.  In the daemontools-style
avoid-restarting-too-often 1 second sleep that ensued whilst dnscache
was doing a restart (to quickly clear the cache of a bogus DNS
resource record set), application X processed several hundred
transaction requests.  Unfortunately, since application X was talking
to dnscache over the loopback interface, the UDP/IP subsystem merrily
informed the DNS client library that it couldn't reach port 53.  (On
a non-loopback interface, the ICMP messages would return too late.)
And thus instead of waiting and retransmitting, the DNS client
library immediately returned a failure to application X for all of
that 1 second's worth of requests.

 Sorry Jonathan, but this is just screaming unreliable design. If your
dnscache is so critical that you can't afford having it down for one
second, then why isn't there a backup ? Why don't you have several
"nameserver" lines in your /etc/resolv.conf, and when you restart one
of them, queries are still served by the other ones ?

 In datacenters, you do not ensure continuity of service by minimizing
process downtime (although this is of course a valuable secondary goal).
You ensure continuity of service by making sure it is not a problem at
all when a process goes down, and you give yourself a reasonable margin
of downtime for every process, which will help for outages as well as
rollouts.

 What is true for datacenters even more than for hobbyist PCs, however,
is that you definitely do not want cascading failure. And instant restart
is a recipe for cascading failure: if your dnscache cannot start for
some reason and dies instantly, and your supervisor restarts it
immediately, your CPU loses itself in that loop and now you have a whole
machine down instead of just one process down.


Sometimes, one does _not_ want these things.  If it's doing a
graceful restart, I want dnscache back up *right now*, not 1 second
from now.

 Not if it comes at the price of risking a cascading failure in some
cases, no you don't.


Application X, whose rate of continual DNS lookups is why
there's a local dnscache in the first place, needs as close to
uninterrrupted DNS service as it can get, even in the face of system
administrators who know that "we can just clear that problem out of
the local cache and get things fixed today by killing the DNS server
and letting it auto-restart, can't we?" and then terminate the
service twice.

 Sysadmins *should* be able to make that assumption, and even if the
restart is delayed by one second (zomg one second of downtime for one
process), they should never hesitate to go for the easy fix.

 I've been an SRE. Trust me, when you're an SRE, you *want* the easy
fixes. You need all your brain power to address the complex issues
without being bothered by something as trivial as a bogus cache
entry. And you also do *not* want to risk a cascading failure every
time you restart a freakin' cache.

 If your process is mission-critical, have more than one instance,
end of story. One second of downtime on one of your processes should
not be visible to the end users, *ever*.

--
 Laurent

Reply via email to