[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2024-03-25 Thread Adam Saponara
Following up, `Acquire::Retries=0` seems to be avoiding the bug in prod,
which strengthens my suspicion that it's related to retries.

As for the actual bug, when apt is hanging, I notice that the work queue
has items in it, but they have not been delegated to a worker. There is
a condition that is supposed to "unstuck" the queue here[1], but for
some reason `GetFetchAfter` is returning a zeroed time_point which
avoids that condition[2].

The way we're setting `FetchAfter` looks OK to me[3] though I'm not
familiar with std::chrono. Maybe there's some other place where we're
partially copying the item or instantiating a new item and failing to
copy the `FetchAfter` value.

A bandaid patch is attached. I'm not sure if there's any downside in
calling `Cycle` there. A real bugfix IMO would address the zeroed
`FetchAfter` value described above, or potentially refactor so that work
items are immediately delegated instead of having to rely on calling
`Cycle` at the right time. I am missing a lot of context so that maybe
that's not possible.

[1] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire.cc#L738-740

[2] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire.cc#L733-736

[3] https://salsa.debian.org/apt-
team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-
pkg/acquire-worker.cc#L651-655


** Attachment added: "patch"
   
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+attachment/5759339/+files/patch

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 

[Touch-packages] [Bug 2003851] Re: occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

2024-03-21 Thread Adam Saponara
Hello, also affected by this.

I'm able to reproduce the bug using Walter's mock server. I suspect it
may be related to the retry code that kicks in when a transient error is
encountered (like a 503). Retries are enabled by default since apt
2.3.2, which seems to fit the regression window people have mentioned
here.

Frustratingly, I can't reproduce while step debugging, I'm guessing
because the bug is clock-dependent.

Setting either `Acquire::Retries::Delay=false` or `Acquire::Retries=0`
seems to mitigate it. I'm curious if that works for others as well. I'll
be testing this in prod soon and will report back.

In the mean time I'll keep trying to find the actual bug.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to apt in Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

Status in apt package in Ubuntu:
  Confirmed

Bug description:
  Hi!

  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.

  
  We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```

  Task listing:
  ```
  root  153929  \_ /usr/sbin/CRON -f -P
  root  153942  \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root  153943  \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root  154026  \_ apt-get update
  _apt  154029  \_ /usr/lib/apt/methods/http
  _apt  154030  \_ /usr/lib/apt/methods/http
  _apt  154031  \_ /usr/lib/apt/methods/http
  _apt  154033  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
  tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http 
  tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http 
  tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http 
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).

  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [],
  NULL, {tv_sec=0, tv_nsec=5}, NULL).

  The http sockets in the children were at fd=3.

  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root4uW  REG8,10  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root5r  FIFO   0,13  0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root6r  FIFO   0,13  0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root7r  FIFO   0,13  0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root8w  FIFO   0,13  0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root9r  FIFO   0,13  0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13  0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13  0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13  0t0 4015234 pipe 154033,gpgv,0r
  http  154029 _apt0r  FIFO   0,13  0t0 4015177 pipe 
154026,apt-get,8w
  http  154029 _apt1w  FIFO   0,13  0t0 4015176 pipe 
154026,apt-get,5r
  http  154030 _apt0r  FIFO   0,13  0t0 4012449 pipe 
154026,apt-get,10w
  http  154030 _apt1w  FIFO   0,13  0t0 4012448 pipe 
154026,apt-get,6r
  http  154031 _apt0r  FIFO   0,13  0t0 4015193 pipe 
154026,apt-get,12w
  http  154031 _apt1w  FIFO   0,13  0t0 4015192 pipe 
154026,apt-get,7r
  gpgv  154033 _apt0r  FIFO   0,13  0t0 4015234 pipe 
154026,apt-get,14w
  gpgv  154033 _apt1w  FIFO   0,13  0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)

  Parent backtrace:
  ```
  #0  0x7f420116a74d in select ()
 from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x7f420153fb5d in pkgAcquire::Run(int) ()
 from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x7f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
 from