Public bug reported:
Binary package hint: bind
Bind lived up to its name this morning. I have a bind server that
effectively serves triple duty providing:
- Public zones
- Private zones (for the lan)
- Recursive lookups and caching for lan hosts
The first two are effectively the same thing except for some ACLs, but
that's really beside the point. Anyway, for the sake of this example,
assume the public zones and private zones are all slave zones. The
public zones load over public ip addresses, while the private (lan)
zones load over an ipsec connection to the master's network.
This morning the ipsec connection went away, as it occasionally does,
and shortly thereafter so did bind. What really boggled me was that
named would become completely unresponsive, even after rebooting the
server, within a couple minutes of startup. It would refuse to do any
lookups, would fail to resolve its own zones, and even failed to respond
to rndc for restarts. Basically, I'd have to kill -9 it, then restart.
It would work for a minute or two, then re-hang.
After poking around for some time, I broke out strace and had a look at
a few of the threads running under named. One thread looked like it was
waiting for input from the master over the down ipsec connection.
Another appeared to be in a bit of an infinite loop of the following:
clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT (Connection
t imed out)
gettimeofday({1198072310, 377730}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection
ti med out)
gettimeofday({1198072310, 392700}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection
ti med out)
gettimeofday({1198072310, 473689}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT (Connection
t imed out)
gettimeofday({1198072310, 976660}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT (Connection
t imed out)
gettimeofday({1198072311, 478563}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK
(T o be restarted)
I took the hint and commented out the couple private zones thats required the
master over ipsec. Following that, named has stayed up and running as normal.
Apparently somewhere in the bind code, if it doesn't hear back from a master it
will literally wait forever and stop serving all data. This, imo, is not good.
I also have the following additional observations to add:
- This is not the first time the ipsec connection has gone away, but it's the
first time I've seen this. It may also be the first time ipsec has been down
since upgrading to edgy, so the problem may be new in bind 9.4. It could also
be a bizarre coincidence.
- The public zones, which resolve over public ip addresses, did not cause a
failure even when their master was unreachable. This leads me to believe that
there is something about the way ipsec dealt with bind's queries that was
creating the condition, but I still think it's a condition bind should be able
to deal with.
** Affects: bind (Ubuntu)
Importance: Undecided
Status: New
** Description changed:
Binary package hint: bind
Bind lived up to its name this morning. I have a bind server that
- effectively serves triple duty surving:
+ effectively serves triple duty providing:
- Public zones
- Private zones (for the lan)
- Recursive lookups and caching for lan hosts
The first two are effectively the same thing except for some ACLs, but
that's really beside the point. Anyway, for the sake of this example,
assume the public zones and private zones are all slave zones. The
public zones load over public ip addresses, while the private (lan)
zones load over an ipsec connection to the master's network.
This morning the ipsec connection went away, as it occasionally does,
and shortly thereafter so did bind. What really boggled me was that
named would become completely unresponsive, even after rebooting the
server, within a couple minutes of startup. It would refuse to do any
lookups, would fail to resolve its own zones, and even failed to respond
to rndc for restarts. Basically, I'd have to kill -9 it, then restart.
It would work for a minute or two, then re-hang.
After poking around for some time, I broke out strace and had a look at
a few of the threads running under named. One thread looked like it was
waiting for input from the master over the down ipsec connection.
Another appeared to be in a bit of an infinite loop of the following:
clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT
(Connection t imed out)
gettimeofday({1198072310, 377730}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection
ti med out)
gettimeofday({1198072310, 392700}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection
ti med out)
gettimeofday({1198072310, 473689}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT
(Connection t imed out)
gettimeofday({1198072310, 976660}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT
(Connection t imed out)
gettimeofday({1198072311, 478563}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK
(T o be restarted)
I took the hint and commented out the couple private zones thats required the
master over ipsec. Following that, named has stayed up and running as normal.
Apparently somewhere in the bind code, if it doesn't hear back from a master it
will literally wait forever and stop serving all data. This, imo, is not good.
I also have the following additional observations to add:
- This is not the first time the ipsec connection has gone away, but it's the
first time I've seen this. It may also be the first time ipsec has been down
since upgrading to edgy, so the problem may be new in bind 9.4. It could also
be a bizarre coincidence.
- The public zones, which resolve over public ip addresses, did not cause a
failure even when their master was unreachable. This leads me to believe that
there is something about the way ipsec dealt with bind's queries that was
creating the condition, but I still think it's a condition bind should be able
to deal with.
--
loss of masters causing bind to become unresponsive
https://bugs.launchpad.net/bugs/177489
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs