This gives me some headaches. So, I tried to figure out what would make
sense to pick from the newer code related to spinlocks. The current code
(our ec2 topic branch) seems at least to have a potentially dangerous
place in xen_spin_kick. There it only checks whether any other cpu spins
on the same lock on the top-level. If I understand the code right, they
have that chain, so it can handle a cpu to spin on a lock without
interrupts disabled and then get to spin on another one on the same cpu
in an interrupt section (which would have interrupts disabled).

However while trying to understand that whole thing, I realized that the
new code also defines a different raw_spinlock_t and in there is the
following comment:

/*
 * Xen versions prior to 3.2.x have a race condition with HYPERVISOR_poll().
 */

Checking for a XEN_COMPAT greater or equal to 3.2 there is some #define
magic which basically turns off *all* the ticket spinlock code to be
compatible with earlier hypervisors. And we compile with 3.0.2
compatibility. So if we would use that new code, spinlocks would be done
as real spinlocks again (meaning no tickets and no hypervisor / unlock
interrupt optimization).

Now, if that is true, then the observed hangs should all have happened
on a host running Xen lesser than 3.2. If not, well by the amount of
changes that are there it still leaves opportunity for having the bug in
there. And then this also opens up several paths.

a) trying to figure out the minimal change which will require likely a few 
iterations to get right.
b) take the complete new code related to spinlocks. however that will result in 
the drop of usage of ticket locks as long as we need to be compatible with xen 
<3.2 and I have at least seen instances running on such hosts in the past. so 
we could as well
c) just pick the non-ticket implementation. of course that could cause some 
performance regressions
d) make sure no AWS host is running xen <3.2 anymore and pick a compat level of 
3.2 (or at least pick the spinlock code in a 
way using ticket locking because I am not really confident that changing the 
compat level overall would not have side effects)

But anyway I'd be quite interested in finding out whether the hangs are
on Xen before 3.2 or not.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to