Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

N7DR Wed, 18 Jul 2007 08:35:12 -0700

OK, I can at least show an example of this bug happening now.

I instrumented the machine to print the following information once per
minute:


the time
the output of "atq"
the start of the output of "top"

So here is an example of three minutes of normal operation (i.e., when
everything is working OK):

----

Wed Jul 18 07:33:01 MDT 2007

4232    Wed Jul 18 07:29:00 2007 = n7dr
4233    Wed Jul 18 07:30:00 2007 = n7dr

top - 07:33:01 up 14 days,  1:12,  2 users,  load average: 2.37, 2.11, 2.04
Tasks: 243 total,   3 running, 238 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4014428k used,    32144k free,   564304k buffers
Swap:  1020116k total,  1005376k used,    14740k free,   548508k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27683 n7dr      39  14 28428  15m 1740 R   50  0.4   2:54.83 model2
27594 n7dr      39  14 32152  19m 1740 R   49  0.5   3:53.68 model2
27906 n7dr      15   0 10696 1316  880 R    1  0.0   0:00.01 top
    1 root      18   0  2640  504  468 S    0  0.0   0:04.61 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1


Wed Jul 18 07:34:01 MDT 2007

4232    Wed Jul 18 07:29:00 2007 = n7dr
4234    Wed Jul 18 07:34:00 2007 = n7dr

top - 07:34:02 up 14 days,  1:14,  2 users,  load average: 1.91, 2.04, 2.02
Tasks: 246 total,   3 running, 241 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4006700k used,    39872k free,   564616k buffers
Swap:  1020116k total,  1005372k used,    14744k free,   548916k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27594 n7dr      39  14 36128  22m 1740 R   48  0.6   4:53.11 model2
28014 n7dr      37  14 16476 3864 1740 R   45  0.1   0:00.47 model2
    1 root      16   0  2640  504  468 S    1  0.0   0:04.62 init
28018 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.01 top
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1


Wed Jul 18 07:35:01 MDT 2007

4235    Wed Jul 18 07:35:00 2007 = n7dr
4234    Wed Jul 18 07:34:00 2007 = n7dr

top - 07:35:02 up 14 days,  1:15,  2 users,  load average: 1.91, 2.02, 2.01
Tasks: 246 total,   5 running, 239 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4002512k used,    44060k free,   564988k buffers
Swap:  1020116k total,  1005372k used,    14744k free,   549564k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28014 n7dr      39  14 28824  15m 1740 R   45  0.4   0:58.55 model2
28149 n7dr      39  14 16204 3604 1740 R   42  0.1   0:00.44 model2
28144 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.02 top
28150 n7dr      21   4 44976 2992 2052 S    1  0.1   0:00.01 mail
    1 root      19   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0

----

You can see the CPU is being consumed by the "model2" jobs, which are
the jobs on the running batch queue.

Now look what happens a few minutes later:

----

Wed Jul 18 07:40:02 MDT 2007

4235    Wed Jul 18 07:35:00 2007 = n7dr
4234    Wed Jul 18 07:34:00 2007 = n7dr

top - 07:40:02 up 14 days,  1:20,  2 users,  load average: 2.69, 2.73, 2.34
Tasks: 247 total,   4 running, 240 sleeping,   0 stopped,   3 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4023040k used,    23532k free,   704060k buffers
Swap:  1020116k total,  1001744k used,    18372k free,   195968k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28014 n7dr      39  14 36444  22m 1740 R   50  0.6   5:40.84 model2
26130 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.02 top
    1 root      15   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1
    6 root      34  19     0    0    0 S    0  0.0   0:00.01 ksoftirqd/1

----

There is now only one model2 job consuming CPU resources, but "atq"
still lists two jobs as on the running queue.

And then, a minute later...

----

Wed Jul 18 07:41:02 MDT 2007

4235    Wed Jul 18 07:35:00 2007 = n7dr
4234    Wed Jul 18 07:34:00 2007 = n7dr

top - 07:41:02 up 14 days,  1:21,  2 users,  load average: 1.95, 2.53, 2.29
Tasks: 243 total,   2 running, 238 sleeping,   0 stopped,   3 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4024852k used,    21720k free,   612560k buffers
Swap:  1020116k total,  1001744k used,    18372k free,   159620k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27744 n7dr      15   0 10696 1316  880 R    1  0.0   0:00.02 top
    1 root      16   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1
    6 root      34  19     0    0    0 S    0  0.0   0:00.01 ksoftirqd/1
    7 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1

----

There are no model2 jobs, but "atq" thinks that they are still running.

an out-of-band check confirmed at this point that there were no
"model2" processes on the machine, and that the expected output from
them had been created.

So (when I noticed that we had triggered the bug, a few minutes later)
I had to manually remove jobs 4234 and 4235 from the queue (using
"atrm"), and then everything went back to normal operation again:

----

Wed Jul 18 08:27:01 MDT 2007

4236    Wed Jul 18 08:25:00 2007 = n7dr
4237    Wed Jul 18 08:26:00 2007 = n7dr

top - 08:27:01 up 14 days,  2:07,  2 users,  load average: 2.38, 1.48, 1.16
Tasks: 218 total,   4 running, 213 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.8% us,  1.2% sy, 79.9% ni, 14.3% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  2660228k used,  1386344k free,   203016k buffers
Swap:  1020116k total,    99044k used,   921072k free,   815020k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29473 n7dr      39  14 25460  11m 1740 R   50  0.3   0:56.02 model2
29389 n7dr      39  14 34168  19m 1740 R   45  0.5   1:49.27 model2
28779 n7dr      15   0  323m  59m  33m S    3  1.5   0:17.45 amarokapp
 4814 root      15   0     0    0    0 S    2  0.0   1:54.15 nfsd
29553 n7dr      15   0 10692 1284  880 R    1  0.0   0:00.01 top
    1 root      18   0  2640  504  468 S    0  0.0   0:04.63 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.40 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0

----

-- 
Batch jobs intermittently fail to leave "="queue when complete
https://bugs.launchpad.net/bugs/126204
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

Reply via email to