On Mon, Jun 14, 2021 at 06:01:17PM +0200, Jan Beulich wrote:
> On 14.06.2021 15:27, Roger Pau Monné wrote:
> > On Mon, Jun 14, 2021 at 01:53:09PM +0200, Jan Beulich wrote:
> >> On 14.06.2021 13:15, Igor Druzhinin wrote:
> >>> Hi, Boris, Stephen, Roger,
> >>>
> >>> We have stress tested recent changes on staging-4.13 which includes a
> >>> backport of the subject. Since the backport is identical to the
> >>> master branch and all of the pre-reqs are in place - we have no reason
> >>> to believe the issue is not the same on master.
> >>>
> >>> Here is what we got by running heavy stress testing including multiple
> >>> repeated VM lifecycle operations with storage and network load:
> >>>
> >>>
> >>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> >>> ----[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]----
> >>> CPU:    17
> >>> RIP:    e008:[<ffff82d080246b65>] common/timer.c#active_timer+0xc/0x1b
> >>> RFLAGS: 0000000000010046   CONTEXT: hypervisor (d675v0)
> >>> rax: 0000000000000000   rbx: ffff83137a8ed300   rcx: 0000000000000000
> >>> rdx: ffff83303fff7fff   rsi: ffff83303fff2549   rdi: ffff83137a8ed300
> >>> rbp: ffff83303fff7cf8   rsp: ffff83303fff7cf8   r8:  0000000000000001
> >>> r9:  0000000000000000   r10: 0000000000000011   r11: 0000168b0cc08083
> >>> r12: 0000000000000000   r13: ffff82d0805cf300   r14: ffff82d0805cf300
> >>> r15: 0000000000000292   cr0: 0000000080050033   cr4: 00000000000426e0
> >>> cr3: 00000013c1a32000   cr2: 0000000000000000
> >>> fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> >>> ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> >>> Xen code around <ffff82d080246b65> (common/timer.c#active_timer+0xc/0x1b):
> >>>   0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 
> >>> 55
> >>> Xen stack trace from rsp=ffff83303fff7cf8:
> >>>     ffff83303fff7d48 ffff82d0802479f1 0000168b0192b846 ffff83137a8ed328
> >>>     000000001d0776eb ffff83137a8ed2c0 ffff83133ee47568 ffff83133ee47000
> >>>     ffff83133ee47560 ffff832b1a0cd000 ffff83303fff7d78 ffff82d08031e74e
> >>>     ffff83102d898000 ffff83133ee47000 ffff83102db8d000 0000000000000011
> >>>     ffff83303fff7dc8 ffff82d08027df19 0000000000000000 ffff83133ee47060
> >>>     ffff82d0805d0088 ffff83102d898000 ffff83133ee47000 0000000000000011
> >>>     0000000000000001 0000000000000011 ffff83303fff7e08 ffff82d0802414e0
> >>>     ffff83303fff7df8 0000168b0192b846 ffff83102d8a4660 0000168b0192b846
> >>>     ffff83102d8a4720 0000000000000011 ffff83303fff7ea8 ffff82d080241d6c
> >>>     ffff83133ee47000 ffff831244137a50 ffff83303fff7e48 ffff82d08031b5b8
> >>>     ffff83133ee47000 ffff832b1a0cd000 ffff83303fff7e68 ffff82d080312b65
> >>>     ffff83133ee47000 0000000000000000 ffff83303fff7ee8 ffff83102d8a4678
> >>>     ffff83303fff7ee8 ffff82d0805d6380 ffff82d0805d5b00 ffffffffffffffff
> >>>     ffff83303fff7fff 0000000000000000 ffff83303fff7ed8 ffff82d0802431f5
> >>>     ffff83133ee47000 0000000000000000 0000000000000000 0000000000000000
> >>>     ffff83303fff7ee8 ffff82d08024324a 00007ccfc00080e7 ffff82d08033930b
> >>>     ffffffffb0ebd5a0 000000000000000d 0000000000000062 0000000000000097
> >>>     000000000000001e 000000000000001f ffffffffb0ebd5ad 0000000000000000
> >>>     0000000000000005 000000000003d91d 0000000000000000 0000000000000000
> >>>     00000000000003d5 000000000000001e 0000000000000000 0000beef0000beef
> >>> Xen call trace:
> >>>     [<ffff82d080246b65>] R common/timer.c#active_timer+0xc/0x1b
> >>>     [<ffff82d0802479f1>] F stop_timer+0xf5/0x188
> >>>     [<ffff82d08031e74e>] F pt_save_timer+0x45/0x8a
> >>>     [<ffff82d08027df19>] F context_switch+0xf9/0xee0
> >>>     [<ffff82d0802414e0>] F 
> >>> common/schedule.c#sched_context_switch+0x146/0x151
> >>>     [<ffff82d080241d6c>] F common/schedule.c#schedule+0x28a/0x299
> >>>     [<ffff82d0802431f5>] F common/softirq.c#__do_softirq+0x85/0x90
> >>>     [<ffff82d08024324a>] F do_softirq+0x13/0x15
> >>>     [<ffff82d08033930b>] F vmx_asm_do_vmentry+0x2b/0x30
> >>>
> >>> ****************************************
> >>> Panic on CPU 17:
> >>> Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
> >>> ****************************************
> >>
> >> Since this suggests a timer was found on the list without ever having been
> >> initialized, I've spotted a case where this indeed could now happen. Could
> >> you give the patch below a try?
> >>
> >> Jan
> >>
> >> x86/vpt: fully init timers before putting onto list
> >>
> >> With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
> >> iterating the list and acting on the timers of the list entries will no
> >> longer be kept from entering their loops by create_periodic_time()'s
> >> holding of that lock. Therefore at least init_timer() needs calling
> >> ahead of list insertion, but keep this and set_timer() together.
> >>
> >> Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some 
> >> cases")
> >> Reported-by: Igor Druzhinin <igor.druzhi...@citrix.com>
> >> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> > 
> > Thanks for looking into this so quickly, and sorry for not realizing
> > myself when relaxing the locking. Adding the timer to the list without
> > it being fully initialized was a latent issue even if protected by the
> > lock initially.
> > 
> > Provided testing shows the issue is fixed:
> 
> I guess the change here is needed anyway, even if testing finds there's
> still something amiss?

Indeed, just wondered whether there might be other instances using a
similar pattern, but I'm not able to spot any.

It might even be better to fix other issues (if any) on a different
commit.

Thanks.

Reply via email to