Bug#442189: (infinite(?) loop during update, possible race condition)

2009-01-05 Thread Eugene V. Lyubimkin
Vincent McIntyre wrote:
 I also would be glad if you can check latest apt (0.7.18) whether it has 
 this bug or not.
 
 I'll try to look at this in the coming weeks.
 Do you mean me to try a backport of this version or just try on a
 system running lenny?
Don't matters for me. I'd suggest to set up a lenny chroot for this (via 
debootstrap) to
don't touch your system.

-- 
Eugene V. Lyubimkin aka JackYF, JID: jackyf.devel(maildog)gmail.com
Ukrainian C++ developer, Debian Maintainer, APT contributor



signature.asc
Description: OpenPGP digital signature


Bug#442189: (infinite(?) loop during update, possible race condition)

2009-01-05 Thread Eugene V. Lyubimkin
Vincent McIntyre wrote:
[snip]
 Possibly the problem is that we aren't using pdiffs in our repo?
 On the test machine /etc/apt/apt.conf has them turned off -
Acquire {
  Pdiffsfalse;
};
 so I'm at a bit of a loss why apt would even ask for a Packages.decomp.
 Or maybe I'm just misunderstanding this.
This is normal. Packages.decomp is just unarchived version of 
Packages.{bz2,gz}, apt
should ask for new Packages if PDiffs are no there or this option is turned off 
in config.

Since some time in the past, I've looked into code and, possibly, found the race
condition. Bad news that it wouldn't be easy to fix it.

-- 
Eugene V. Lyubimkin aka JackYF, JID: jackyf.devel(maildog)gmail.com
Ukrainian C++ developer, Debian Maintainer, APT contributor



signature.asc
Description: OpenPGP digital signature


Bug#442189: (infinite(?) loop during update, possible race condition)

2009-01-04 Thread Vincent McIntyre

I forgot to list the other stuck processes:

# ps -fade|grep apt
root 18380 18327  0 Nov14 ?00:00:00 /usr/bin/apt-get update
root 18382 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/http
root 18384 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/gpgv
root 18389 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/bzip2
root 18409 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/gzip

and then I thought let's just check

# strace -s 2048 -f -p 18380
Process 18380 attached - interrupt to quit
select(10, [5 6 7 9], [], NULL, {0, 468000}) = 0 (Timeout)
stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp,
 
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout)
stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp,
 
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout)
stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_main_binary-i386_Packages.decomp,
 
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
select(10, [5 6 7 9], [], NULL, {0, 50} unfinished ...
Process 18380 detached

aha. It's stuck trying to get the proxied versions of our local repo's 
Packages file. Maybe Joey's on to something.


the other child processes are stuck too.

# strace -s 2048 -f -p 18382
Process 18382 attached - interrupt to quit
select(1, [0], NULL, NULL, NULL unfinished ...
Process 18382 detached

# strace -s 2048 -f -p 18389
Process 18389 attached - interrupt to quit
select(1, [0], NULL, NULL, NULL unfinished ...
Process 18389 detached

I have in fact having some problems with apt-proxy noticing that the
local package repo has new packages. I end up having to go and delete
the relevant Packages.gz in the apt-proxy cache directories.

Possibly the problem is that we aren't using pdiffs in our repo?
On the test machine /etc/apt/apt.conf has them turned off -
   Acquire {
 Pdiffsfalse;
   };
so I'm at a bit of a loss why apt would even ask for a Packages.decomp.
Or maybe I'm just misunderstanding this.




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#442189: (infinite(?) loop during update, possible race condition)

2009-01-04 Thread Vincent McIntyre
(For the bug report - I sent this to Eugene on 23rd December)

Hi Eugene, thanks for your interest.

 can you also show strace output of
 gpgv and gzip processes when apt goes to infinite loop?

they both seem to be stuck doing nothing...

# ps -fade|grep gpgv
root 18384 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/gpgv
# strace -f -s2048 -o /tmp/gpgv.strace -p 18384
Process 18384 attached - interrupt to quit
Process 18384 detached
# more /tmp/gpgv.strace
18384 select(1, [0], NULL, NULL, NULL unfinished ...

# ps -fade|grep gzip
root 18409 18380  0 Nov14 ?00:00:00 /usr/lib/apt/methods/gzip
# strace -f -s2048 -o /tmp/gzip.strace -p 18409
Process 18409 attached - interrupt to quit
Process 18409 detached
sachin:/u/mci156# cat /tmp/gzip.strace
18409 select(1, [0], NULL, NULL, NULL unfinished ...

note the date of the process...

# stat /var/lib/apt/lists/lock
   File: `/var/lib/apt/lists/lock'
   Size: 0   Blocks: 0  IO Block: 4096   regular empty 
file
Device: fe02h/65026dInode: 215126  Links: 1
Access: (0640/-rw-r-)  Uid: (0/root)   Gid: (0/root)
Access: 2007-10-22 08:26:27.0 +1000
Modify: 2008-12-23 10:36:26.0 +1100
Change: 2008-12-23 10:36:26.0 +1100

The modify time is recent because I noticed the machine was stuck
when I tried to do an update -
  # aptitude update
  E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource
 temporarily unavailable)
  E: Couldn't lock list directory..are you root?
  # id -u
  0

Regarding JoeyK's comments -
I am using apt-proxy sources exclusively in /etc/apt/sources.list.

One of these sources I am proxying is a local package repo, which is 
managed with reprepro.

 I also would be glad if you can check latest apt (0.7.18) whether it has 
 this bug or not.

I'll try to look at this in the coming weeks.
Do you mean me to try a backport of this version or just try on a
system running lenny?




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#442189: infinite(?) loop during update, possible race condition

2007-09-13 Thread Vincent McIntyre
Package: apt
Version: 0.6.46.4-0.1
Severity: important

*** Please type your report below this line ***
Hi,

I could not see an existing report of this problem, e.g. I don't think
this is #409336.

Background
--
I have a nightly cron job that installs security updates only, using a
short sources list file:
  deb http://debian-archive.atnf.csiro.au:/security/  etch/updates main 
contrib non-free
  deb-src http://debian-archive.atnf.csiro.au:/security/  etch/updates main 
contrib non-free
I'm using an apt-proxy to cache the packages  spare your mirrors.
This system has worked reliably since midway through the woody release.

The script is breaking on one of the machines I have that is running etch.
I have one other machine that exhibited the same underlying problem though
(see below). It is working fine on our sarge machines.


The sequence of events in the job is (leaving out various sanity checks):
 1.  /usr/bin/apt-get 
-o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only clean
 2.  /usr/bin/apt-get 
-o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only update || \
 (sleep 5; /usr/bin/apt-get 
-o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only update)
 3.  [ if there are any packages to do, filter out certain classes of package
  such as kernels and database, then continue as follows ]
 4. /usr/bin/apt-get 
-o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only upgrade
 5. /usr/bin/apt-get 
-o=Dir::Etc::SourceList=/etc/apt/sources.list.security-only clean
 6. /usr/bin/apt-get update


The first 'update' call (step 2) is failing with status of 100.
The error messages given are:
  E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource temporarily 
unavailable)
  E: Unable to lock the list directory
  E: Could not get lock /var/lib/apt/lists/lock - open (11 Resource temporarily 
unavailable)
  E: Unable to lock the list directory


Analysis


The failure above seems to be because of another apt-get process that is
running, apparently indefinitely. I have seen one running for at least 8 days.
Attaching to one such process with strace (it had been running for 2 days),
the process appears to be in an infinite loop:
  # strace -p 9809
  Process 9809 attached - interrupt to quit
  select(10, [5 6 7 9], [], NULL, {0, 168000}) = 0 (Timeout)
  
stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_non-free_binary-i386_Packages.decomp,
 0xbfce3fa0) = -1 ENOENT (No such file or directory)
  select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout)
  
stat64(/var/lib/apt/lists/partial/debian-archive:_atnf_dists_etch_non-free_binary-i386_Packages.decomp,
 0xbfce3fa0) = -1 ENOENT (No such file or directory)
  select(10, [5 6 7 9], [], NULL, {0, 50} unfinished ...
  Process 9809 detached
  #

I was able to get another apt-get process into this state, and saw the same
output, but for a different file:
  # strace -p 30990
  Process 30990 attached - interrupt to quit
  select(10, [5 6 7 9], [], NULL, {0, 12000}) = 0 (Timeout)
  
stat64(/var/lib/apt/lists/partial/debian-archive.atnf.csiro.au:_atnf_dists_etch_main_source_Sources.decomp,
 {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
  select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout)
  
stat64(/var/lib/apt/lists/partial/debian-archive.atnf.csiro.au:_atnf_dists_etch_main_source_Sources.decomp,
 {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
select(10, [5 6 7 9], [], NULL, {0, 50}) = 0 (Timeout)
...

There were some inconsistencies between the normal sources.list and the
truncated one shown above (which leads to differently named files in
/var/lib/apt/lists and possibly  a 'disappearing' file), but these naming
inconsistencies also occur on other hosts that don't show this problem, so
I don't think the inconsistency is the source of the problem.

Following this, I cleaned up and tried to reproduce the problem.
 * I made the two files consistent wrt the naming of the proxy host
   (so the files in /var/lib/apt/lists will be consistent too)
 * killed all the apt-get processes
   (just kill $pid, not kill -9)
 * removed all the files in /var/lib/apt/lists and below
   (but I left the partial/ dir itself)
 * apt-get clean
 * apt-get update
 * apt-get upgrade (nothing to do)
 * ran the cron job.
This time the script executed correctly.

I checked other machines running etch. I noticed that there was one
other with an apt-get process in the same state (infinite loop on a
.decomp file).

For a time I thought this might be related to libc6; the ones with the
problem all had the latest libc6, while the others (also running etch)
had an earlier version (2.3.6.ds1-13 vs 2.3.6.ds1-13etch2).
However I first noticed this problem with a process that was started on
20 Aug 2007 and possibly it was occurring as early as 12 Aug 2007.
On that host, libc6 was only updated to -13etch2 on 29 Aug 2007.

I also wondered if it was related to the recently corrected memory
corruption issue