Bug#442189: (infinite(?) loop during update, possible race condition)
Vincent McIntyre wrote: I also would be glad if you can check latest apt (0.7.18) whether it has this bug or not. I'll try to look at this in the coming weeks. Do you mean me to try a backport of this version or just try on a system running lenny? Don't matters for me. I'd suggest to set up a lenny chroot for this (via debootstrap) to don't touch your system. -- Eugene V. Lyubimkin aka JackYF, JID: jackyf.devel(maildog)gmail.com Ukrainian C++ developer, Debian Maintainer, APT contributor signature.asc Description: OpenPGP digital signature
Bug#442189: (infinite(?) loop during update, possible race condition)
Vincent McIntyre wrote: [snip] Possibly the problem is that we aren't using pdiffs in our repo? On the test machine /etc/apt/apt.conf has them turned off - Acquire { Pdiffsfalse; }; so I'm at a bit of a loss why apt would even ask for a Packages.decomp. Or maybe I'm just misunderstanding this. This is normal. Packages.decomp is just unarchived version of Packages.{bz2,gz}, apt should ask for new Packages if PDiffs are no there or this option is turned off in config. Since some time in the past, I've looked into code and, possibly, found the race condition. Bad news that it wouldn't be easy to fix it. -- Eugene V. Lyubimkin aka JackYF, JID: jackyf.devel(maildog)gmail.com Ukrainian C++ developer, Debian Maintainer, APT contributor signature.asc Description: OpenPGP digital signature
Bug#442189: (infinite(?) loop during update, possible race condition)
I forgot to list the other stuck processes: # ps -fade|grep apt root 18380 18327 0 Nov14 ?00:00:00 /usr/bin/apt-get update root 18382 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/http root 18384 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/gpgv root 18389 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/bzip2 root 18409 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/gzip and then I thought let's just check # strace -s 2048 -f -p 18380 Process 18380 attached - interrupt to quit select(10, [5 6 7 9], [], NULL, {0, 468000}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(10, [5 6 7 9], [], NULL, {0, 50} unfinished ... Process 18380 detached aha. It's stuck trying to get the proxied versions of our local repo's Packages file. Maybe Joey's on to something. the other child processes are stuck too. # strace -s 2048 -f -p 18382 Process 18382 attached - interrupt to quit select(1, [0], NULL, NULL, NULL unfinished ... Process 18382 detached # strace -s 2048 -f -p 18389 Process 18389 attached - interrupt to quit select(1, [0], NULL, NULL, NULL unfinished ... Process 18389 detached I have in fact having some problems with apt-proxy noticing that the local package repo has new packages. I end up having to go and delete the relevant Packages.gz in the apt-proxy cache directories. Possibly the problem is that we aren't using pdiffs in our repo? On the test machine /etc/apt/apt.conf has them turned off - Acquire { Pdiffsfalse; }; so I'm at a bit of a loss why apt would even ask for a Packages.decomp. Or maybe I'm just misunderstanding this. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#442189: (infinite(?) loop during update, possible race condition)
(For the bug report - I sent this to Eugene on 23rd December) Hi Eugene, thanks for your interest. can you also show strace output of gpgv and gzip processes when apt goes to infinite loop? they both seem to be stuck doing nothing... # ps -fade|grep gpgv root 18384 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/gpgv # strace -f -s2048 -o /tmp/gpgv.strace -p 18384 Process 18384 attached - interrupt to quit Process 18384 detached # more /tmp/gpgv.strace 18384 select(1, [0], NULL, NULL, NULL unfinished ... # ps -fade|grep gzip root 18409 18380 0 Nov14 ?00:00:00 /usr/lib/apt/methods/gzip # strace -f -s2048 -o /tmp/gzip.strace -p 18409 Process 18409 attached - interrupt to quit Process 18409 detached sachin:/u/mci156# cat /tmp/gzip.strace 18409 select(1, [0], NULL, NULL, NULL unfinished ... note the date of the process... # stat /var/lib/apt/lists/lock File: `/var/lib/apt/lists/lock' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fe02h/65026dInode: 215126 Links: 1 Access: (0640/-rw-r-) Uid: (0/root) Gid: (0/root) Access: 2007-10-22 08:26:27.0 +1000 Modify: 2008-12-23 10:36:26.0 +1100 Change: 2008-12-23 10:36:26.0 +1100 The modify time is recent because I noticed the machine was stuck when I tried to do an update - # aptitude update E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource temporarily unavailable) E: Couldn't lock list directory..are you root? # id -u 0 Regarding JoeyK's comments - I am using apt-proxy sources exclusively in /etc/apt/sources.list. One of these sources I am proxying is a local package repo, which is managed with reprepro. I also would be glad if you can check latest apt (0.7.18) whether it has this bug or not. I'll try to look at this in the coming weeks. Do you mean me to try a backport of this version or just try on a system running lenny? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#442189: infinite(?) loop during update, possible race condition
Package: apt Version: 0.6.46.4-0.1 Severity: important *** Please type your report below this line *** Hi, I could not see an existing report of this problem, e.g. I don't think this is #409336. Background -- I have a nightly cron job that installs security updates only, using a short sources list file: deb http://debian-archive.atnf.csiro.au:/security/ etch/updates main contrib non-free deb-src http://debian-archive.atnf.csiro.au:/security/ etch/updates main contrib non-free I'm using an apt-proxy to cache the packages spare your mirrors. This system has worked reliably since midway through the woody release. The script is breaking on one of the machines I have that is running etch. I have one other machine that exhibited the same underlying problem though (see below). It is working fine on our sarge machines. The sequence of events in the job is (leaving out various sanity checks): 1. /usr/bin/apt-get -o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only clean 2. /usr/bin/apt-get -o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only update || \ (sleep 5; /usr/bin/apt-get -o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only update) 3. [ if there are any packages to do, filter out certain classes of package such as kernels and database, then continue as follows ] 4. /usr/bin/apt-get -o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only upgrade 5. /usr/bin/apt-get -o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only clean 6. /usr/bin/apt-get update The first 'update' call (step 2) is failing with status of 100. The error messages given are: E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource temporarily unavailable) E: Unable to lock the list directory E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource temporarily unavailable) E: Unable to lock the list directory Analysis The failure above seems to be because of another apt-get process that is running, apparently indefinitely. I have seen one running for at least 8 days. Attaching to one such process with strace (it had been running for 2 days), the process appears to be in an infinite loop: # strace -p 9809 Process 9809 attached - interrupt to quit select(10, [5 6 7 9], [], NULL, {0, 168000}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_non-free_binary-i386_Packages.decomp, 0xbfce3fa0) = -1 ENOENT (No such file or directory) select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_non-free_binary-i386_Packages.decomp, 0xbfce3fa0) = -1 ENOENT (No such file or directory) select(10, [5 6 7 9], [], NULL, {0, 50} unfinished ... Process 9809 detached # I was able to get another apt-get process into this state, and saw the same output, but for a different file: # strace -p 30990 Process 30990 attached - interrupt to quit select(10, [5 6 7 9], [], NULL, {0, 12000}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive.atnf.csiro.au:_atnf_dists_etch_main_source_Sources.decomp, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout) stat64(/var/lib/apt/lists/partial/debian-archive.atnf.csiro.au:_atnf_dists_etch_main_source_Sources.decomp, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout) ... There were some inconsistencies between the normal sources.list and the truncated one shown above (which leads to differently named files in /var/lib/apt/lists and possibly a 'disappearing' file), but these naming inconsistencies also occur on other hosts that don't show this problem, so I don't think the inconsistency is the source of the problem. Following this, I cleaned up and tried to reproduce the problem. * I made the two files consistent wrt the naming of the proxy host (so the files in /var/lib/apt/lists will be consistent too) * killed all the apt-get processes (just kill $pid, not kill -9) * removed all the files in /var/lib/apt/lists and below (but I left the partial/ dir itself) * apt-get clean * apt-get update * apt-get upgrade (nothing to do) * ran the cron job. This time the script executed correctly. I checked other machines running etch. I noticed that there was one other with an apt-get process in the same state (infinite loop on a .decomp file). For a time I thought this might be related to libc6; the ones with the problem all had the latest libc6, while the others (also running etch) had an earlier version (2.3.6.ds1-13 vs 2.3.6.ds1-13etch2). However I first noticed this problem with a process that was started on 20 Aug 2007 and possibly it was occurring as early as 12 Aug 2007. On that host, libc6 was only updated to -13etch2 on 29 Aug 2007. I also wondered if it was related to the recently corrected memory corruption issue