On 06/02/2020 11:05, Sergey Dyasli wrote:
> On 06/02/2020 09:57, Jürgen Groß wrote:
>> On 05.02.20 17:03, Sergey Dyasli wrote:
>>> Hello,
>>>
>>> I'm currently investigating a Live-Patch application failure in core-
>>> scheduling mode and this is an example of what I usually get:
>>> (it's easily reproducible)
>>>
>>>      (XEN) [  342.528305] livepatch: lp: CPU8 - IPIing the other 15 CPUs
>>>      (XEN) [  342.558340] livepatch: lp: Timed out on semaphore in CPU 
>>> quiesce phase 13/15
>>>      (XEN) [  342.558343] bad cpus: 6 9
>>>
>>>      (XEN) [  342.559293] CPU:    6
>>>      (XEN) [  342.559562] Xen call trace:
>>>      (XEN) [  342.559565]    [<ffff82d08023f304>] R 
>>> common/schedule.c#sched_wait_rendezvous_in+0xa4/0x270
>>>      (XEN) [  342.559568]    [<ffff82d08023f8aa>] F 
>>> common/schedule.c#schedule+0x17a/0x260
>>>      (XEN) [  342.559571]    [<ffff82d080240d5a>] F 
>>> common/softirq.c#__do_softirq+0x5a/0x90
>>>      (XEN) [  342.559574]    [<ffff82d080278ec5>] F 
>>> arch/x86/domain.c#guest_idle_loop+0x35/0x60
>>>
>>>      (XEN) [  342.559761] CPU:    9
>>>      (XEN) [  342.560026] Xen call trace:
>>>      (XEN) [  342.560029]    [<ffff82d080241661>] R _spin_lock_irq+0x11/0x40
>>>      (XEN) [  342.560032]    [<ffff82d08023f323>] F 
>>> common/schedule.c#sched_wait_rendezvous_in+0xc3/0x270
>>>      (XEN) [  342.560036]    [<ffff82d08023f8aa>] F 
>>> common/schedule.c#schedule+0x17a/0x260
>>>      (XEN) [  342.560039]    [<ffff82d080240d5a>] F 
>>> common/softirq.c#__do_softirq+0x5a/0x90
>>>      (XEN) [  342.560042]    [<ffff82d080279db5>] F 
>>> arch/x86/domain.c#idle_loop+0x55/0xb0
>>>
>>> The first HT sibling is waiting for the second in the LP-application
>>> context while the second waits for the first in the scheduler context.
>>>
>>> Any suggestions on how to improve this situation are welcome.
>>
>> Can you test the attached patch, please? It is only tested to boot, so
>> I did no livepatch tests with it.
>
> Thank you for the patch! It seems to fix the issue in my manual testing.
> I'm going to submit automatic LP testing for both thread/core modes.

Andrew suggested to test late ucode loading as well and so I did.
It uses stop_machine() to rendezvous cpus and it failed with a similar
backtrace for a problematic CPU. But in this case the system crashed
since there is no timeout involved:

    (XEN) [  155.025168] Xen call trace:
    (XEN) [  155.040095]    [<ffff82d0802417f2>] R _spin_unlock_irq+0x22/0x30
    (XEN) [  155.069549]    [<ffff82d08023f3c2>] S 
common/schedule.c#sched_wait_rendezvous_in+0xa2/0x270
    (XEN) [  155.109696]    [<ffff82d08023f728>] F 
common/schedule.c#sched_slave+0x198/0x260
    (XEN) [  155.145521]    [<ffff82d080240e1a>] F 
common/softirq.c#__do_softirq+0x5a/0x90
    (XEN) [  155.180223]    [<ffff82d0803716f6>] F 
x86_64/entry.S#process_softirqs+0x6/0x20

It looks like your patch provides a workaround for LP case, but other
cases like stop_machine() remain broken since the underlying issue with
the scheduler is still there.

--
Thanks,
Sergey

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to