** Description changed:

+ [Impact]
+ APT can hang during `apt update` if a server that we request multiple 
repositories from produces a temporary failure such as a 503 for at least one 
of the InRelease file after successfully fetching another one.
+ 
+ The expected behavior is that APT retries here, and does not hang, and
+ eventually fails if the error is persistent.
+ 
+ [Test plan]
+ The extensive test suite of apt has been extended with a regression test 
+ 
+     test/integration/test-ubuntu-bug-2003851-retry-after
+ 
+ that reproduces the problematic scenario and ensures it is fixed.
+ 
+ Reproducing this with real-life deployments is tricky since it does
+ depend on network timing to some extend: The failing InRelease file must
+ be recorded as failing *before* we start queueing Packages from the good
+ repository.
+ 
+ [Where problems could occur]
+ The SRU addresses the problem by changing the ordering of items in the queue 
to enqueue them with decreasing order of retry-after times, rather than 
increasing ones. This ensures that queues that got stuck are detected as such, 
but it invariably changes the behavior of the retry functionality:
+ 
+ Previously the retry-after was file specific. If file A failed, we still
+ tried file B before trying file A again. Now files with a retry-after
+ come in decreasing order, so the longest retry time effectively is
+ applied to all files not yet requested from the server.
+ 
+ This can cause APT to operate slower, but on the flip side, it avoids
+ bombarding the server with follow-up requests for a good InRelease file
+ such as Packages which then likely also fail - failures in real life
+ aren't usually file specific, given that we serve fixed files; even on a
+ CDN, it's likely the entire backend fails rather than only one file.
+ 
+ There's some risk that APT could get stuck on other scenarios now,
+ albeit, given the nature of the fix, this seems reasonably low.
+ 
+ [Original bug report]
  Hi!
  
  Yesterday I spotted several machines of ours where a period `apt-get
  update` was stalled. The `http` children were hanging in `WaitFd`
  (waiting for parent instructions/queue). The parent was looping in
  `AcquireUpdate` every 500ms.
  
- 
- We have a cronjob that runs every few hours which calls `apt-get update` and 
does some post-processing. We noticed that several of them had stalled at some 
point in time. Killing the parent (apt-get) got it unstuck, removing the locks.
+ We have a cronjob that runs every few hours which calls `apt-get update`
+ and does some post-processing. We noticed that several of them had
+ stalled at some point in time. Killing the parent (apt-get) got it
+ unstuck, removing the locks.
  
  Example:
  ```
  # apt-get update
  Reading package lists... Done
  E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 
(apt-get)
  N: Be aware that removing the lock file is not a solution and may break your 
system.
  E: Unable to lock directory /var/lib/apt/lists/
  ```
  
  Task listing:
  ```
  root      153929  \_ /usr/sbin/CRON -f -P
  root      153942      \_ /bin/sh -c  [ -x /etc/zabbix/scripts/dpkg.updates ] 
&& /etc/zabbix/scripts/dpkg.updates --cron
  root      153943          \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
  root      154026              \_ apt-get update
  _apt      154029                  \_ /usr/lib/apt/methods/http
  _apt      154030                  \_ /usr/lib/apt/methods/http
  _apt      154031                  \_ /usr/lib/apt/methods/http
  _apt      154033                  \_ /usr/lib/apt/methods/gpgv
  ```
  Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
  ```
  # netstat -apn | grep -E '154026|154029|154030|154031|154033'
- tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http         
- tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http         
- tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http         
+ tcp  1  0  10.x.x.x:60868  217.x.x.x:80  CLOSE_WAIT  154030/http
+ tcp  1  0  10.x.x.x:40756  178.x.x.x:80  CLOSE_WAIT  154029/http
+ tcp  1  0  10.x.x.x:56818  185.x.x.x:80  CLOSE_WAIT  154031/http
  ```
  All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, 
NULL, NULL).
  
  The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL,
  {tv_sec=0, tv_nsec=500000000}, NULL).
  
  The http sockets in the children were at fd=3.
  
  Parent lsof:
  ```
  # lsof -p 154026 +E
  ...
  apt-get   154026 root    4uW  REG    8,1        0  262281 
/var/lib/apt/lists/lock
  apt-get   154026 root    5r  FIFO   0,13      0t0 4015176 pipe 154029,http,1w
  apt-get   154026 root    6r  FIFO   0,13      0t0 4012448 pipe 154030,http,1w
  apt-get   154026 root    7r  FIFO   0,13      0t0 4015192 pipe 154031,http,1w
  apt-get   154026 root    8w  FIFO   0,13      0t0 4015177 pipe 154029,http,0r
  apt-get   154026 root    9r  FIFO   0,13      0t0 4015233 pipe 154033,gpgv,1w
  apt-get   154026 root   10w  FIFO   0,13      0t0 4012449 pipe 154030,http,0r
  apt-get   154026 root   12w  FIFO   0,13      0t0 4015193 pipe 154031,http,0r
  apt-get   154026 root   14w  FIFO   0,13      0t0 4015234 pipe 154033,gpgv,0r
  http      154029 _apt    0r  FIFO   0,13      0t0 4015177 pipe 
154026,apt-get,8w
  http      154029 _apt    1w  FIFO   0,13      0t0 4015176 pipe 
154026,apt-get,5r
  http      154030 _apt    0r  FIFO   0,13      0t0 4012449 pipe 
154026,apt-get,10w
  http      154030 _apt    1w  FIFO   0,13      0t0 4012448 pipe 
154026,apt-get,6r
  http      154031 _apt    0r  FIFO   0,13      0t0 4015193 pipe 
154026,apt-get,12w
  http      154031 _apt    1w  FIFO   0,13      0t0 4015192 pipe 
154026,apt-get,7r
  gpgv      154033 _apt    0r  FIFO   0,13      0t0 4015234 pipe 
154026,apt-get,14w
  gpgv      154033 _apt    1w  FIFO   0,13      0t0 4015233 pipe 
154026,apt-get,9r
  ```
  So:
  - apt-get is waiting for any data written by any of its four children (at fd 
5/6/7/9)
  - http and gpgv are waiting for any data written by their parent (at their 
respective fd 0)
  
  Parent backtrace:
  ```
  #0  0x00007f420116a74d in select ()
-    from /lib/x86_64-linux-gnu/libc.so.6
+    from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x00007f420153fb5d in pkgAcquire::Run(int) ()
-    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
+    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
-    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
+    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #3  0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) 
()
-    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
+    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #4  0x00007f42016d127b in DoUpdate (CmdL=...)
-     at ./apt-private/private-update.cc:73
+     at ./apt-private/private-update.cc:73
  #5  0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch 
const*, bool) ()
-    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
- #6  0x00007f420169fa97 in DispatchCommandLine (CmdL=..., 
-     Cmds=std::vector of length 27, capacity 32 = {...})
-     at ./apt-private/private-cmndline.cc:588
+    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
+ #6  0x00007f420169fa97 in DispatchCommandLine (CmdL=...,
+     Cmds=std::vector of length 27, capacity 32 = {...})
+     at ./apt-private/private-cmndline.cc:588
  #7  0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
-     at ./cmdline/apt-get.cc:447
+     at ./cmdline/apt-get.cc:447
  ```
  Child backtrace:
  ```
  #0  0x00007f58b1c9b74d in select ()
-    from /lib/x86_64-linux-gnu/libc.so.6
+    from /lib/x86_64-linux-gnu/libc.so.6
  #1  0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) ()
-    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
+    from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
  #2  0x00005643e37bc7b3 in BaseHttpMethod::Loop (
-     this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
+     this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
  #3  main (argv=<optimized out>) at ./methods/http.cc:1052
  ```
  Where this is:
  ```
  (gdb) print *this
  $1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data 
fields>}, <aptConfigWrapperForMethods> = {
-         methodNames = std::vector of length 2, capacity 2 = {
-           "http", "http::XXXXXX.nl"}}, Binary = "http", 
-       SeccompFlags = 6}, 
-     authconfs = std::vector of length 1, capacity 1 = {
-       std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}}, 
-   Server = std::unique_ptr<ServerState> = {
-     get() = 0x5643e534fba0}, NextURI = "", 
-   AllowRedirect = true, Debug = false, PipelineDepth = 10, 
-   static FailFile = {static npos = 18446744073709551615, 
-     _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> 
= {<No data fields>}, <No data fields>}, 
-       _M_p = 0x5643e534d650 
"/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
 
-     _M_string_length = 85, {
-       _M_local_buf = "\226", '\000' <repeats 14 times>, 
-       _M_allocated_capacity = 150}}, static FailFd = -1, 
-   static FailTime = 1674071760}
+         methodNames = std::vector of length 2, capacity 2 = {
+           "http", "http::XXXXXX.nl"}}, Binary = "http",
+       SeccompFlags = 6},
+     authconfs = std::vector of length 1, capacity 1 = {
+       std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}},
+   Server = std::unique_ptr<ServerState> = {
+     get() = 0x5643e534fba0}, NextURI = "",
+   AllowRedirect = true, Debug = false, PipelineDepth = 10,
+   static FailFile = {static npos = 18446744073709551615,
+     _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> 
= {<No data fields>}, <No data fields>},
+       _M_p = 0x5643e534d650 
"/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
+     _M_string_length = 85, {
+       _M_local_buf = "\226", '\000' <repeats 14 times>,
+       _M_allocated_capacity = 150}}, static FailFd = -1,
+   static FailTime = 1674071760}
  ```
  
  Relevant code in parent:
  ```
  pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval)
  {
  ...
-    // Run till all things have been acquired
-    struct timeval tv = 
SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
-    while (ToFetch > 0)
+    // Run till all things have been acquired
+    struct timeval tv = 
SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
+    while (ToFetch > 0)
  ...
-       int Res;
-       do
-       {
-          Res = select(Highest+1,&RFds,&WFds,0,&tv);
-       }
-       while (Res < 0 && errno == EINTR);
+       int Res;
+       do
+       {
+          Res = select(Highest+1,&RFds,&WFds,0,&tv);
+       }
+       while (Res < 0 && errno == EINTR);
  ```
  Relevant code in child:
  ```
  int BaseHttpMethod::Loop()
  {
-    signal(SIGTERM,SigTerm);
-    signal(SIGINT,SigTerm);
-    
-    Server = 0;
-    
-    int FailCounter = 0;
-    while (1)
-    {      
-       // We have no commands, wait for some to arrive
-       if (Queue == 0)
-       {
-          if (WaitFd(STDIN_FILENO) == false)
-             return 0;
+    signal(SIGTERM,SigTerm);
+    signal(SIGINT,SigTerm);
+ 
+    Server = 0;
+ 
+    int FailCounter = 0;
+    while (1)
+    {
+       // We have no commands, wait for some to arrive
+       if (Queue == 0)
+       {
+          if (WaitFd(STDIN_FILENO) == false)
+             return 0;
  ```
  
  Versions:
  ```
  Ubuntu 22.04.1 LTS (Jammy Jellyfish)
  apt 2.4.8
  ```
  
  Preliminiary conclusion/notes:
  
  - I don't know why it stalls, but it does. And this was not a single
  incident. All the stalling apt processes had stopped at a different
  time/day.
  
  - Apparently there is a situation possible where there are no commands
  queued from the parent, and the parent isn't planning on sending any
  either.
  
  - Our apt-get is invoked by a cron job which calls either `update-
  notifier-common` or `apt-get update`: https://github.com/ossobv/zabbix-
  agent-
  
osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31
  -- on the machines where apt-get was hanging, there was no `update-
  notifier-common`.
  
  - I don't know if this is 2.4.x related (on Focal we have apt 2.0.x),
  but it sure looks like it. I did check a few other Focal machines that
  lacked `update-notifier-common`, but they were not having a stalled apt-
  get (although _not_ seeing a hung process proves nothing)
  
- 
- Let me know if there's anything I can get you. Should this be filed 
elsewhere? (Debian Salsa?)
+ Let me know if there's anything I can get you. Should this be filed
+ elsewhere? (Debian Salsa?)
  
  Cheers,
  Walter Doekes
  OSSO B.V.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2003851

Title:
  occasional hanging 'apt-get update' from daily cronjob since Jammy
  22.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apt/+bug/2003851/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to