Reproduced on Ubuntu 24.04.4 LTS (Noble) with apt 2.8.3 — same signature
as Walter's report. Three separate hosts across different mirror
infrastructures, same hang: two with full captured evidence (gdb / pcap
/ /proc, below), and a third (Host C) that pins down a clean,
deterministic environmental trigger — an AAAA-resolving mirror with no
working IPv6 route. Backport to Noble would be very welcome.


=== TWO CAPTURED INSTANCES ===

    Host A — active reproduction
      OS        Ubuntu 24.04.4 LTS (Noble)
      Kernel    6.8.0-110-generic
      apt       2.8.3
      Mirror    archive.ubuntu.com -> Cloudflare CDN (104.20.28.x)
      Trigger   tight loop with rm /var/lib/apt/lists/* between rounds,
                Acquire::http::Pipeline-Depth=0, Acquire::http::No-Cache=true
      Frozen    ~10 minutes when snapshot taken (then killed)
      Captured  gdb backtrace, pcap (full + SYN/FIN/RST trim), apt-debug.log,
                ps, lsof, /proc/<pid>/{wchan,stack,...}

    Host B — production zombie
      OS        Ubuntu 24.04.4 LTS (Noble)
      Kernel    6.8.0-106-generic
      apt       2.8.3
      Mirror    archive.ubuntu.com -> direct Canonical mirror
                (ubuntu-mirror-{2,3}.ps6.canonical.com)
      Trigger   spontaneous, started by apt.systemd.daily update cron
      Frozen    4 days 15 hours at snapshot time, still alive, holding
                /var/lib/apt/lists/lock (no operator action yet)
      Captured  kernel /proc/<pid>/stack, lsof (incl. CLOSE_WAIT sockets), ps,
                apt-logs/, pipes.txt (gdb couldn't be installed because the
                lock was held by the very zombie we wanted to debug)

Both show the same kernel-side signature:

    [<0>] do_select+0x6e6/0x890
    [<0>] core_sys_select+0x3f6/0x5f0
    [<0>] do_pselect.constprop.0+0xe9/0x190
    [<0>] __x64_sys_pselect6+0x68/0xa0

(syscall 270 = pselect6_time64 on x86_64.) wchan = do_select for parent
and every method process, CPU time 00:00:00, no spinning.


=== HOST A — GDB USERSPACE BACKTRACE ===

Parent apt-get update (PID 2383996):

    #0  __select (nfds=8, timeout={tv_sec=0, tv_nsec=381243623})
    #1  pkgAcquire::Run(int)                     <- libapt-pkg.so.6.0
    #2  AcquireUpdate(pkgAcquire&, int, bool, bool)
    #3  ListUpdate(pkgAcquireStatus&, pkgSourceList&, int)
    #4  DoUpdate()                               <- libapt-private.so.0.0

Child /usr/lib/apt/methods/http (PID 2384005):

    #0  __select (nfds=1, timeout=NULL)          <- blocked indefinitely
    #1  WaitFd(int, bool, unsigned long)         <- libapt-pkg.so.6.0


=== HOST B — PIPE TOPOLOGY + CLOSE_WAIT TCP ===

lsof on the four-day zombie shows the parent <-> workers IPC pipes
intact, the parent still holding the apt list lock, and two HTTP workers
still attached to TCP sockets the mirror has already half-closed:

    apt-get 315784 root  4uW REG /var/lib/apt/lists/lock     <- held since May 
2 00:14 UTC
    apt-get 315784 root  5r  pipe:[41374374]
    apt-get 315784 root  6r  pipe:[41374391]
    apt-get 315784 root  7r  pipe:[41374419]
    apt-get 315784 root  8w  pipe:[41374375]
    apt-get 315784 root 10w  pipe:[41374392]
    ...
    http    315793 _apt  3u  TCP …:51594 -> 
ubuntu-mirror-2.ps6.canonical.com:http  (CLOSE_WAIT)
    http    315794 _apt  3u  TCP …:35958 -> 
ubuntu-mirror-3.ps6.canonical.com:http  (CLOSE_WAIT)

So the canonical-side mirror sent FIN, the apt-method's read loop saw
select() wake up with readable=1, read() returned 0 — but instead of
closing/cleaning up and asking the parent for a new URI, the worker is
still in WaitFd(NULL) waiting forever for the next "URI Acquire" from
the parent. The parent's queue ordering means that next command never
comes. Lock stays held; cron retries fail; the host needs an operator
with kill -9 to recover.


=== HOST A — TCP-LEVEL EVIDENCE ===

During a 30-minute aggressive-loop capture (with Pipeline-Depth=0 and
No-Cache=true), tcpdump "tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0"
shows:

    [S]   client SYN        603
    [S.]  server SYN-ACK     591
    [F.]  FIN              1 243
    [R]   RST from peer      424

So roughly 70% of established connections were torn down by the server
side. Under those conditions the queue-ordering bug (post-MR !500) hits
within 1–10 rounds.


=== AT THE MOMENT OF EITHER SNAPSHOT ===

ss -tnp returned zero open sockets owned by apt-method processes (Host
A). Host B's lsof shows the sockets are still attached to the worker
process but in CLOSE_WAIT state — same root cause, just observed earlier
in the half-close lifecycle (the kernel hasn't garbage-collected them
yet because the apt-method still holds the fd).


=== HOST C — CLEAN ENVIRONMENTAL TRIGGER (NO IPv6 ROUTE + AAAA MIRROR) ===

A third production host (Ubuntu 24.04.4 LTS Noble, kernel
6.8.0-106-generic, apt 2.8.3) hit the same hang, again spontaneously via
apt.systemd.daily update, and sat as a 4-day-6-hour zombie holding
/var/lib/apt/lists/lock (apt-get -qq -y update, CPU time 00:00:00,
parent apt.systemd.daily lock_is_held update, child http/gpgv/store
methods all blocked) until an operator killed it. Same signature as
Hosts A and B.

What makes this instance useful is an unusually clean trigger: the host
has no IPv6 default route and no global IPv6 address, but its mirror
(archive.ubuntu.com / security.ubuntu.com, Cloudflare-fronted) returns
AAAA records (2606:4700:10::ac42:98b0). apt's method prefers IPv6, opens
a connection to an unroutable v6 address, and the same WaitFd() never
returns — so on a dual-stack-DNS / v4-only-routing host the bug fires
every single run, deterministically, with no need for server-side RST
churn.

Two things confirmed this as the proximate cause rather than a slow
mirror:

  - A plain Acquire::http::Timeout alone let apt-get update finish, but only
    after timing out each source:
        W: Failed to fetch http://security.ubuntu.com/.../InRelease  Connection 
timed out [IP: 104.20.28.246 80]
  - Adding Acquire::ForceIPv4 "true" made apt-get update return rc=0 with no
    warnings at all — apt stopped attempting the unroutable v6 connect entirely.

So for the substantial population of hosts that have AAAA-resolving
mirrors but no working IPv6 path (common behind NAT/CGNAT gateways), the
hang is not intermittent — it is the steady state, and ForceIPv4 removes
the trigger outright.


=== WORKAROUND IN PRODUCTION ===

While the proper fix is the queue-ordering patch from apt 3.1.3 — please
backport! — operators on Noble can avoid forever-hangs by giving the
apt-method workers a real timeout:

    # /etc/apt/apt.conf.d/99-timeouts
    Acquire::http::Timeout "30";
    Acquire::https::Timeout "30";
    Acquire::Retries "3";

This converts the eternal select(NULL) in WaitFd() into a bounded wait;
the failed round exits with a normal error and a cron retry usually
succeeds.

On hosts with AAAA-resolving mirrors but no working IPv6 route (Host C
above), also add:

    Acquire::ForceIPv4 "true";

This removes the trigger entirely rather than merely bounding it — apt
never opens the unroutable v6 connection in the first place, so the
update succeeds cleanly instead of timing out each source.


=== ATTACHED EVIDENCE ===

apt-hang-noble-2.8.3-evidence.tar.gz contains:

Host A (active reproduction, ~915 KB):
  - info.txt, ps.txt, gdb-backtrace.txt (41 KB, full "thread apply all bt full"
    for every PID — no debug symbols, but the libapt-pkg.so.6.0 symbol table is
    sufficient)
  - proc-<pid>/{wchan,stack,syscall,status,cmdline} per process
  - lsof.txt, ss-all.txt
  - apt-debug.log (187 KB — Debug::pkgAcquire(::Worker) + 
Debug::Acquire::http(s))
  - apt-hang-flags.pcap (239 KB — 30-min SYN/FIN/RST trim)

Host B (4-day production zombie, ~22 KB):
  - info.txt, ps.txt, locks.txt, pipes.txt
  - proc-<pid>/{wchan,stack,syscall,status,cmdline,fd} per process — kernel-side
    stack identical to Host A
  - lsof.txt — shows the CLOSE_WAIT sockets to canonical mirrors
  - ss-all.txt
  - apt-logs/ — last successful apt run was the upgrade that brought
    kmod 31+20240202-2ubuntu7.2 on 2026-05-01 06:31:49 UTC; the zombie started 
at
    2026-05-02 00:14 UTC during the next apt.systemd.daily update

Host C (4-day zombie, no-IPv6 trigger, ~5 KB):
  - info.txt — host state proving the trigger: no IPv6 default route / no global
    v6 address, mirror returns AAAA, connect() to the v6 address times out;
    apt-get update with only Acquire::http::Timeout warns "Connection timed out"
    per source, with Acquire::ForceIPv4 returns rc=0 clean
  - zombie-incident.txt — frozen process tree (4 d 6 h, CPU 00:00:00), lsof of
    the held lists/lock, the http methods' mirror endpoint. No gdb/kernel-stack:
    the live zombie was killed by the operator before capture

Full unfiltered pcap from Host A (101 MB) available on request.


** Attachment added: "apt-hang-noble-2.8.3-evidence.tar.gz"
   
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+attachment/5977615/+files/apt-hang-noble-2.8.3-evidence.tar.gz

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to