════════════════════════════════════════════════════════════════
ADDENDUM — 2026-05-16: SECOND DEADLOCK: write_lock(ups) HELD
ACROSS USB/MODBUS I/O → CLOSE-WAIT ACCUMULATION WHEN A
USB TRANSACTION STALLS
════════════════════════════════════════════════════════════════
BACKGROUND
The fix described above was deployed on 2026-05-15 10:02 UTC. At
15:59 UTC the same day (5 hours 57 minutes later), a remote apcupsd
instance monitoring the NIS port declared comms lost and sent an alert.
Inspection revealed a qualitatively different failure mode, also caused
by blocking I/O under a mutex — but in the driver layer rather than the
NIS layer.
SYMPTOM
• Munin netstat_established gauge climbed steadily at ~3/min.
NOTE: Linux's CurrEstab (read by the munin netstat plugin from
/proc/net/snmp) includes both ESTABLISHED and CLOSE-WAIT sockets,
per RFC 4022. The gauge therefore reflects the new bug accurately
even though the connection state is CLOSE-WAIT, not ESTABLISHED.
• ss -tan showed:
2581 CLOSE-WAIT ← all on local port 3551 (NIS), Recv-Q=9 each
7 ESTAB
48 LISTEN
15 TIME-WAIT
Recv-Q=9 on every CLOSE-WAIT socket means the remote end closed
after sending its 9-byte NIS request (2-byte length prefix + 7-byte
"status\n"), but apcupsd had not read even the first byte — the
connection threads were blocked before reaching net_recv().
• apcupsd thread count: 2825. All NIS handler threads were in
futex_do_wait (confirmed via /proc/<pid>/task/<tid>/stack).
• All 2825 threads were waiting on the SAME futex address:
syscall 202 (futex) arg1=0x6194fc09a2e0 op=FUTEX_WAIT_PRIVATE val=2
The address falls in the process heap (confirmed via /proc/<pid>/maps),
identifying it as ups->mutex (a pthread_mutex allocated in new_ups()).
• The main thread (TID = PID) was not in futex_wait. Its kernel stack:
reap_as+0xb4/0x120
usbdev_do_ioctl+0x3e9/0x1510
usbdev_ioctl+0xe/0x20
__x64_sys_ioctl+0xa3/0x100
Its syscall: 16 (ioctl) fd=7 request=0x4008550c.
Decoding: _IOW('U', 12, void *) = USBDEVFS_REAPURB.
FD 7 = /dev/bus/usb/005/003 = the APC UPS (lsusb: 051d:0003
American Power Conversion UPS).
The main thread was permanently blocked waiting for a USB URB to
complete, while holding ups->mutex. Every NIS connection thread was
blocked acquiring that mutex.
ROOT CAUSE
The UPS in question is a Smart-UPS 1500 communicating via the apcupsd
MODBUS driver (DRIVER: MODBUS UPS Driver / src/drivers/modbus/).
apcupsd also links libusb-0.1 for the generic USB HID driver; both
drivers exhibit the same structural flaw.
─── Flaw 1: UPS I/O under write_lock ──────────────────────────────
read_volatile_data() and read_static_data() in src/drivers/usb/usb.c
(and their equivalents in src/drivers/modbus/modbus.cpp) hold
write_lock(_ups) — which is P(ups->mutex) — across every USB or MODBUS
I/O call:
MODBUS (modbus.cpp):
bool ModbusUpsDriver::read_volatile_data() {
write_lock(_ups); ← ups->mutex acquired
bool ret = UpdateCis(true); ← calls _comm->ReadRegister() per CI
_ups->poll_time = time(NULL);
write_unlock(_ups);
}
USB/libusb (usb.c):
bool UsbUpsDriver::read_volatile_data() {
write_lock(_ups); ← ups->mutex acquired
...
usb_update_value(ci); ← calls pusb_get_value() → GetReport()
...
write_unlock(_ups);
}
Additionally, GenericUsbUpsDriver::Open() in
src/drivers/usb/generic/generic-usb.c held write_lock(_ups) across
_hidups.Open(), which enumerates the USB bus and opens the device.
─── Flaw 2: libusb-0.1 timeout mechanism fails in multi-threaded
processes ─────────────────────────────────────────────
HidUps::GetReport() (src/libusbhid/HidUps.cpp) calls usb_control_msg()
with a 1000 ms timeout:
actlen = usb_control_msg(_fd, ..., 1000);
libusb-0.1 implements this timeout via SIGALRM. In a multi-threaded
process, SIGALRM is delivered to an arbitrary thread — not necessarily
the one blocked in USBDEVFS_REAPURB. The ioctl therefore never receives
EINTR, and the 1-second timeout silently misfires. When the UPS USB
interface stalls (firmware hiccup or USB bus error), USBDEVFS_REAPURB
blocks indefinitely.
This is a known limitation of libusb-0.1; libusb-1.0 addresses it with
a proper async API.
─── Combined effect ────────────────────────────────────────────────
1. A USB/MODBUS transaction fails to complete (cause unknown —
UPS, USB controller, or bus).
2. Main thread blocks in USBDEVFS_REAPURB while holding ups->mutex.
3. Every incoming NIS connection spawns a thread that immediately
blocks at P(ups->mutex) inside attach_ups().
4. Each blocked thread has accepted a socket FD. The remote end
eventually closes (sending FIN), transitioning the socket to
CLOSE-WAIT. The thread never calls net_recv() or net_close()
because it is blocked before that point.
5. CLOSE-WAIT sockets accumulate without bound. With RLIMIT_NOFILE
raised to 65536 (from the first fix), accumulation continued
undetected for ~14.5 hours before the daemon was restarted.
RELATIONSHIP TO THE FIRST FIX
The first fix resolved the NIS mutex deadlock and, as a necessary
consequence, RLIMIT_NOFILE was raised from 1024 to 65536.
This second bug existed prior to the first fix, but its effect was
masked: a USB stall would produce FD exhaustion (indistinguishable in
symptom from the write_nbytes bug) at 1024 FDs, much sooner. With the
NIS mutex bug fixed and the FD limit raised, the second bug became the
dominant failure mode and accumulated long enough to be diagnosed
cleanly.
The trigger itself is an independent event whose precise cause is
unknown. It is not caused by the first fix.
TIMELINE (all times UTC)
2026-05-15 10:02:32 apcupsd restarted with first fix deployed
2026-05-15 ~15:00 USB/MODBUS transaction stalls (cause unknown)
2026-05-15 15:59:36 Remote monitor declares NIS comms lost; alert sent
(seeder.siblings.net, 192.168.122.33 — the same
host visible in all CLOSE-WAIT connections)
2026-05-16 06:27:30 apcupsd restarted with second fix deployed;
2581 CLOSE-WAIT sockets cleared immediately
FIX
The same principle as the first fix: do not hold the mutex across
blocking I/O. Take write_lock only for the brief struct update after
I/O completes.
Fix A — src/drivers/modbus/modbus.cpp
UpdateCi(): acquire write_lock after _comm->ReadRegister() returns
and the data is fully parsed into local variables; release after the
switch statement that updates the _ups struct. Callers
read_volatile_data() and read_static_data() remove their outer
write_lock/_unlock wrappers entirely, since UpdateCi() now owns the
lock. read_volatile_data() takes a brief write_lock only to update
_ups->poll_time after the I/O loop completes.
Fix B — src/drivers/usb/usb.c
usb_update_value(): acquire write_lock after usb_get_value() (which
calls the driver's pusb_get_value() and therefore any USB I/O)
returns; release after usb_process_value() updates the _ups struct.
read_volatile_data() and read_static_data() remove their outer
write_lock/_unlock; read_volatile_data() takes a brief write_lock
for the initial _ups->poll_time and Status bookkeeping, then
releases before the I/O loop.
Fix C — src/drivers/usb/generic/generic-usb.c
Open(): move _hidups.Open() (USB bus enumeration and device open)
outside write_lock. Retain write_lock only for the two _ups struct
field assignments (_ups->fd, _ups->clear_slave()) that follow.
── Patch ──────────────────────────────────────────────────────────
--- src/drivers/modbus/modbus.cpp.orig
+++ src/drivers/modbus/modbus.cpp
@@ UpdateCi() — after "delete [] data;" @@
+ /* MODBUS I/O complete; hold write_lock only for the brief struct
+ * update below. Callers must not hold write_lock on entry. */
+ write_lock(_ups);
+
astring tmpstr;
struct tm tmp;
time_t date;
switch (info->ci)
{
...all cases unchanged...
}
+ write_unlock(_ups);
return true;
@@ read_static_data() @@
bool ModbusUpsDriver::read_static_data()
{
- write_lock(_ups);
- bool ret = UpdateCis(false);
- write_unlock(_ups);
- return ret;
+ /* UpdateCi() takes write_lock only for each struct update. */
+ return UpdateCis(false);
}
@@ read_volatile_data() @@
bool ModbusUpsDriver::read_volatile_data()
{
- write_lock(_ups);
- bool ret = UpdateCis(true);
- if (ret)
- {
- _ups->poll_time = time(NULL);
- }
- write_unlock(_ups);
- return ret;
+ bool ret = UpdateCis(true);
+ if (ret)
+ {
+ write_lock(_ups);
+ _ups->poll_time = time(NULL);
+ write_unlock(_ups);
+ }
+ return ret;
}
--- src/drivers/usb/usb.c.orig
+++ src/drivers/usb/usb.c
@@ usb_update_value() @@
bool UsbUpsDriver::usb_update_value(int ci)
{
USB_VALUE uval;
- if (!usb_get_value(ci, &uval))
- return false;
- usb_process_value(ci, &uval);
- return true;
+ /* USB I/O without write_lock: libusb-0.1's SIGALRM-based timeout
+ * is unreliable in multi-threaded processes; usb_control_msg() can
+ * block indefinitely if the UPS stalls, deadlocking NIS threads. */
+ if (!usb_get_value(ci, &uval))
+ return false;
+ write_lock(_ups);
+ usb_process_value(ci, &uval);
+ write_unlock(_ups);
+ return true;
}
@@ read_volatile_data() @@
- write_lock(_ups);
- _ups->poll_time = now;
- _ups->Status &= ~0xFF;
- for (int i=0; _known_info[i].usage_code; i++) {
- if (_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
- usb_update_value(_known_info[i].ci);
- }
- write_unlock(_ups);
+ /* Hold write_lock only for initial bookkeeping; usb_update_value()
+ * takes it briefly for each struct update after USB I/O completes. */
+ write_lock(_ups);
+ _ups->poll_time = now;
+ _ups->Status &= ~0xFF;
+ write_unlock(_ups);
+ for (int i=0; _known_info[i].usage_code; i++) {
+ if (_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
+ usb_update_value(_known_info[i].ci);
+ }
return 1;
@@ read_static_data() @@
bool UsbUpsDriver::read_static_data()
{
- write_lock(_ups);
- for (int i=0; _known_info[i].usage_code; i++) {
- if (!_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
- usb_update_value(_known_info[i].ci);
- }
- write_unlock(_ups);
+ /* usb_update_value() handles its own write_lock. */
+ for (int i=0; _known_info[i].usage_code; i++) {
+ if (!_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
+ usb_update_value(_known_info[i].ci);
+ }
return 1;
}
--- src/drivers/usb/generic/generic-usb.c.orig
+++ src/drivers/usb/generic/generic-usb.c
@@ Open() @@
bool GenericUsbUpsDriver::Open()
{
- write_lock(_ups);
- bool rc = _hidups.Open(_ups->device);
- _ups->fd = 1;
- _ups->clear_slave();
- write_unlock(_ups);
+ /* USB device enumeration can block; no lock during I/O. */
+ bool rc = _hidups.Open(_ups->device);
+ write_lock(_ups);
+ _ups->fd = 1;
+ _ups->clear_slave();
+ write_unlock(_ups);
return rc;
}
NOTES
• The comment in src/lib/apcstatus.c (the output_status() function)
explicitly acknowledges the long-lock problem from the NIS reader's
side ("Note that this creates the potential for a misbehaving client
to create long lock hold times, essentially locking the driver out")
and even suggests the correct fix ("making a local copy of UPSINFO
under the lock, then releasing the lock and feeding the client from
the copy"). The driver side has the symmetric problem, with no
analogous comment.
• check_state() in both the linux and generic USB drivers already
performs USB interrupt reads (select() + read()) without holding
write_lock, acquiring it only briefly when a changed value is found.
The polling path (read_volatile_data) should follow the same pattern,
and now does.
• check_state() in modbus.cpp calls UpdateCi(CI_STATUS) without holding
write_lock (a pre-existing omission). After this fix, UpdateCi()
takes write_lock internally, so that call is now also protected.
• The trigger is a USB/MODBUS transaction that fails to complete —
whether due to UPS firmware, USB controller behaviour, or some other
cause is unknown. It is an independent event, not reproducible on
demand in the way the first bug was. The fix is verified by code
inspection and by the clean disappearance of CLOSE-WAIT accumulation
after the patched binary was started. A recurrence would require
another such event.
• This analysis, test script, and patch were developed with the assistance
of Claude Code (Anthropic), model claude-sonnet-4-6, on 2026-05-16.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2152692
Title:
apcupsd NIS server deadlock: write_nbytes() has no send timeout, mutex
held across network I/O → FD exhaustion under broken TCP
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apcupsd/+bug/2152692/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs