It is hard to say for sure. Indeed we have the same jump of the time. Though 
the problem on the lkml report seemed more like one task going into schedule 
and never really getting scheduled. While here it looks like a real locking 
issue. All CPUs (except CPU1) look to be waiting on some spinlock. Several on 
runqueue locks (which are trying to avoid potential a-b locks). CPU0 waits on 
some block request queue lock. From just the traces it is not possible to say 
but it feels like maybe one of the runqueue locks might have got not released 
on some not so often used special path.
I hope that the kernels with lock debug enabled may show something here. At 
least having new stack traces with those would be known to not have the 
task_group race. And it will help for the case where we need to think of a 
special way to catch / print info on the problem. For any further debug kernels 
I try to include something as mentioned on lkml to see whether this too avoids 
time jumps and issues. Though another suspect could be a higher rate of looking 
at /proc task info (which one of the CPUs runs on). That might be something 
that a database does more often than any other work load.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1011792

Title:
  Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1011792/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to