Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
On Tue, Aug 25, 2020 at 8:04 PM Alexander Graf wrote: > > Hi Andy, > > On 26.08.20 01:41, Andy Lutomirski wrote: > > > > On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf wrote: > >> > >> Hi Thomas, > >> > >> On 25.08.20 12:28, Thomas Gleixner wrote: > >>> void irq_complete_move(struct irq_cfg *cfg) > > { > > __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); > > } > > > >>> Alex, > >>> > >>> On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: > I'm currently trying to understand a performance regression with > ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably > bisected down to this commit: > > https://github.com/scylladb/scylla/issues/7036 > > What we're seeing is that syscalls such as membarrier() take forever > (0-10 µs would be normal): > >>> ... > That again seems to stem from a vastly slowed down > smp_call_function_many_cond(): > > Samples: 218K of event 'cpu-clock', 4000 Hz > Overhead Shared ObjectSymbol > 94.51% [kernel] [k] smp_call_function_many_cond > 0.76% [kernel] [k] __do_softirq > 0.32% [kernel] [k] native_queued_spin_lock_slowpath > [...] > > which is stuck in > > │ csd_lock_wait(): > │ smp_cond_load_acquire(&csd->flags, !(VAL & > 0.00 │ mov0x8(%rcx),%edx > 0.00 │ and$0x1,%edx > │ ↓ je 2b9 > │ rep_nop(): > 0.70 │2af: pause > │ csd_lock_wait(): > 92.82 │ mov0x8(%rcx),%edx > 6.48 │ and$0x1,%edx > 0.00 │ ↑ jne2af > 0.00 │2b9: ↑ jmp282 > > > Given the patch at hand I was expecting lost IPIs, but I can't quite see > anything getting lost. > >>> > >>> I have no idea how that patch should be related to IPI and smp function > >>> calls. It's changing the way how regular device interrupts and their > >>> spurious counterpart are handled and not the way how IPIs are > >>> handled. They are handled by direct vectors and do not go through > >>> do_IRQ() at all. > >>> > >>> Aside of that the commit just changes the way how the interrupt vector > >>> of a regular device interrupt is stored and conveyed. The extra read and > >>> write on the cache hot stack is hardly related to anything IPI. > >> > >> I am as puzzled as you are, but the bisect is very clear: 79b9c183021e > >> works fast and 633260fa1 (as well as mainline) shows the weird behavior > >> above. > >> > >> It gets even better. This small (demonstrative only, mangled) patch on > >> top of 633260fa1 also resolves the performance issue: > >> > >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > >> index c766936..7e91e9a 100644 > >> --- a/arch/x86/kernel/irq.c > >> +++ b/arch/x86/kernel/irq.c > >> @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs > >> *regs, unsigned long vector) > >> * lower 8 bits. > >> */ > >> vector &= 0xFF; > >> + regs->orig_ax = ~vector; > >> > >> /* entering_irq() tells RCU that we're not quiescent. Check it. > >> */ > >> RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); > >> > >> > >> To me that sounds like some irq exit code somewhere must still be > >> looking at orig_ax to decide on something - and that something is wrong > >> now that we removed the negation of the vector. It also seems to have an > >> impact on remote function calls. > >> > >> I'll have a deeper look tomorrow again if I can find any such place, but > >> I wouldn't mind if anyone could point me into the right direction > >> earlier :). > > > > How about this: > > > > void irq_complete_move(struct irq_cfg *cfg) > > { > > __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); > > } > > > > in arch/x86/kernel/apic/vector.c. > > > > Thanks a lot, I stumbled over the same thing just after I sent the email > as well and had been trying to see if I can quickly patch it up before I > fall asleep :). > > The code path above is used by the APIC vector move (irqbalance) logic, > which explains why not everyone was seeing issues. > > So with 633260fa1 applied, we never get out of moving state for our IRQ > because orig_ax is always -1. That means we send an IPI to the cleanup > vector on every single device interrupt, completely occupying the poor > CPU that we moved the IRQ from. > > I've confirmed that the patch below fixes the issue and will send a > proper, more complete patch on top of mainline with fancy description > and stable tag tomorrow. > > > Alex > > > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index e7434cd..a474e6e 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -734,7 +734,6 @@ SYM_CODE_START_LOCAL(common_spurious) > callinterrup
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
On Tue, Aug 25 2020 at 16:41, Andy Lutomirski wrote: > How about this: > > void irq_complete_move(struct irq_cfg *cfg) > { > __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); > } > > in arch/x86/kernel/apic/vector.c. Duh
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
Hi Andy, On 26.08.20 01:41, Andy Lutomirski wrote: On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf wrote: Hi Thomas, On 25.08.20 12:28, Thomas Gleixner wrote: void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); } Alex, On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: I'm currently trying to understand a performance regression with ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably bisected down to this commit: https://github.com/scylladb/scylla/issues/7036 What we're seeing is that syscalls such as membarrier() take forever (0-10 µs would be normal): ... That again seems to stem from a vastly slowed down smp_call_function_many_cond(): Samples: 218K of event 'cpu-clock', 4000 Hz Overhead Shared ObjectSymbol 94.51% [kernel] [k] smp_call_function_many_cond 0.76% [kernel] [k] __do_softirq 0.32% [kernel] [k] native_queued_spin_lock_slowpath [...] which is stuck in │ csd_lock_wait(): │ smp_cond_load_acquire(&csd->flags, !(VAL & 0.00 │ mov0x8(%rcx),%edx 0.00 │ and$0x1,%edx │ ↓ je 2b9 │ rep_nop(): 0.70 │2af: pause │ csd_lock_wait(): 92.82 │ mov0x8(%rcx),%edx 6.48 │ and$0x1,%edx 0.00 │ ↑ jne2af 0.00 │2b9: ↑ jmp282 Given the patch at hand I was expecting lost IPIs, but I can't quite see anything getting lost. I have no idea how that patch should be related to IPI and smp function calls. It's changing the way how regular device interrupts and their spurious counterpart are handled and not the way how IPIs are handled. They are handled by direct vectors and do not go through do_IRQ() at all. Aside of that the commit just changes the way how the interrupt vector of a regular device interrupt is stored and conveyed. The extra read and write on the cache hot stack is hardly related to anything IPI. I am as puzzled as you are, but the bisect is very clear: 79b9c183021e works fast and 633260fa1 (as well as mainline) shows the weird behavior above. It gets even better. This small (demonstrative only, mangled) patch on top of 633260fa1 also resolves the performance issue: diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index c766936..7e91e9a 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs *regs, unsigned long vector) * lower 8 bits. */ vector &= 0xFF; + regs->orig_ax = ~vector; /* entering_irq() tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); To me that sounds like some irq exit code somewhere must still be looking at orig_ax to decide on something - and that something is wrong now that we removed the negation of the vector. It also seems to have an impact on remote function calls. I'll have a deeper look tomorrow again if I can find any such place, but I wouldn't mind if anyone could point me into the right direction earlier :). How about this: void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); } in arch/x86/kernel/apic/vector.c. Thanks a lot, I stumbled over the same thing just after I sent the email as well and had been trying to see if I can quickly patch it up before I fall asleep :). The code path above is used by the APIC vector move (irqbalance) logic, which explains why not everyone was seeing issues. So with 633260fa1 applied, we never get out of moving state for our IRQ because orig_ax is always -1. That means we send an IPI to the cleanup vector on every single device interrupt, completely occupying the poor CPU that we moved the IRQ from. I've confirmed that the patch below fixes the issue and will send a proper, more complete patch on top of mainline with fancy description and stable tag tomorrow. Alex diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index e7434cd..a474e6e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -734,7 +734,6 @@ SYM_CODE_START_LOCAL(common_spurious) callinterrupt_entry UNWIND_HINT_REGS indirect=1 movqORIG_RAX(%rdi), %rsi/* get vector from stack */ - movq$-1, ORIG_RAX(%rdi) /* no syscall to restart */ callsmp_spurious_interrupt /* rdi points to pt_regs */ jmp ret_from_intr SYM_CODE_END(common_spurious) @@ -746,7 +745,6 @@ SYM_CODE_START_LOCAL(common_interrupt) callinterrupt_entry UNWIND_HINT_REGS indirect=1 movqORIG_RAX(%rdi), %rsi/* get vector from stack */ - movq$-1, ORIG_RAX(%rdi) /* no syscall to restart */ calldo_IRQ
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf wrote: > > Hi Thomas, > > On 25.08.20 12:28, Thomas Gleixner wrote: > >void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); } > > Alex, > > > > On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: > >> I'm currently trying to understand a performance regression with > >> ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably > >> bisected down to this commit: > >> > >> https://github.com/scylladb/scylla/issues/7036 > >> > >> What we're seeing is that syscalls such as membarrier() take forever > >> (0-10 µs would be normal): > > ... > >> That again seems to stem from a vastly slowed down > >> smp_call_function_many_cond(): > >> > >> Samples: 218K of event 'cpu-clock', 4000 Hz > >> Overhead Shared ObjectSymbol > >> 94.51% [kernel] [k] smp_call_function_many_cond > >> 0.76% [kernel] [k] __do_softirq > >> 0.32% [kernel] [k] native_queued_spin_lock_slowpath > >> [...] > >> > >> which is stuck in > >> > >> │ csd_lock_wait(): > >> │ smp_cond_load_acquire(&csd->flags, !(VAL & > >> 0.00 │ mov0x8(%rcx),%edx > >> 0.00 │ and$0x1,%edx > >> │ ↓ je 2b9 > >> │ rep_nop(): > >> 0.70 │2af: pause > >> │ csd_lock_wait(): > >>92.82 │ mov0x8(%rcx),%edx > >> 6.48 │ and$0x1,%edx > >> 0.00 │ ↑ jne2af > >> 0.00 │2b9: ↑ jmp282 > >> > >> > >> Given the patch at hand I was expecting lost IPIs, but I can't quite see > >> anything getting lost. > > > > I have no idea how that patch should be related to IPI and smp function > > calls. It's changing the way how regular device interrupts and their > > spurious counterpart are handled and not the way how IPIs are > > handled. They are handled by direct vectors and do not go through > > do_IRQ() at all. > > > > Aside of that the commit just changes the way how the interrupt vector > > of a regular device interrupt is stored and conveyed. The extra read and > > write on the cache hot stack is hardly related to anything IPI. > > I am as puzzled as you are, but the bisect is very clear: 79b9c183021e > works fast and 633260fa1 (as well as mainline) shows the weird behavior > above. > > It gets even better. This small (demonstrative only, mangled) patch on > top of 633260fa1 also resolves the performance issue: > > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > index c766936..7e91e9a 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs > *regs, unsigned long vector) > * lower 8 bits. > */ > vector &= 0xFF; > + regs->orig_ax = ~vector; > > /* entering_irq() tells RCU that we're not quiescent. Check it. */ > RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); > > > To me that sounds like some irq exit code somewhere must still be > looking at orig_ax to decide on something - and that something is wrong > now that we removed the negation of the vector. It also seems to have an > impact on remote function calls. > > I'll have a deeper look tomorrow again if I can find any such place, but > I wouldn't mind if anyone could point me into the right direction > earlier :). How about this: void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); } in arch/x86/kernel/apic/vector.c.
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
Hi Thomas, On 25.08.20 12:28, Thomas Gleixner wrote: Alex, On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: I'm currently trying to understand a performance regression with ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably bisected down to this commit: https://github.com/scylladb/scylla/issues/7036 What we're seeing is that syscalls such as membarrier() take forever (0-10 µs would be normal): ... That again seems to stem from a vastly slowed down smp_call_function_many_cond(): Samples: 218K of event 'cpu-clock', 4000 Hz Overhead Shared ObjectSymbol 94.51% [kernel] [k] smp_call_function_many_cond 0.76% [kernel] [k] __do_softirq 0.32% [kernel] [k] native_queued_spin_lock_slowpath [...] which is stuck in │ csd_lock_wait(): │ smp_cond_load_acquire(&csd->flags, !(VAL & 0.00 │ mov0x8(%rcx),%edx 0.00 │ and$0x1,%edx │ ↓ je 2b9 │ rep_nop(): 0.70 │2af: pause │ csd_lock_wait(): 92.82 │ mov0x8(%rcx),%edx 6.48 │ and$0x1,%edx 0.00 │ ↑ jne2af 0.00 │2b9: ↑ jmp282 Given the patch at hand I was expecting lost IPIs, but I can't quite see anything getting lost. I have no idea how that patch should be related to IPI and smp function calls. It's changing the way how regular device interrupts and their spurious counterpart are handled and not the way how IPIs are handled. They are handled by direct vectors and do not go through do_IRQ() at all. Aside of that the commit just changes the way how the interrupt vector of a regular device interrupt is stored and conveyed. The extra read and write on the cache hot stack is hardly related to anything IPI. I am as puzzled as you are, but the bisect is very clear: 79b9c183021e works fast and 633260fa1 (as well as mainline) shows the weird behavior above. It gets even better. This small (demonstrative only, mangled) patch on top of 633260fa1 also resolves the performance issue: diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index c766936..7e91e9a 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs *regs, unsigned long vector) * lower 8 bits. */ vector &= 0xFF; + regs->orig_ax = ~vector; /* entering_irq() tells RCU that we're not quiescent. Check it. */ RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); To me that sounds like some irq exit code somewhere must still be looking at orig_ax to decide on something - and that something is wrong now that we removed the negation of the vector. It also seems to have an impact on remote function calls. I'll have a deeper look tomorrow again if I can find any such place, but I wouldn't mind if anyone could point me into the right direction earlier :). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
Alex, On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: > I'm currently trying to understand a performance regression with > ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably > bisected down to this commit: > >https://github.com/scylladb/scylla/issues/7036 > > What we're seeing is that syscalls such as membarrier() take forever > (0-10 µs would be normal): ... > That again seems to stem from a vastly slowed down > smp_call_function_many_cond(): > > Samples: 218K of event 'cpu-clock', 4000 Hz > Overhead Shared ObjectSymbol >94.51% [kernel] [k] smp_call_function_many_cond > 0.76% [kernel] [k] __do_softirq > 0.32% [kernel] [k] native_queued_spin_lock_slowpath > [...] > > which is stuck in > > │ csd_lock_wait(): > │ smp_cond_load_acquire(&csd->flags, !(VAL & >0.00 │ mov0x8(%rcx),%edx >0.00 │ and$0x1,%edx > │ ↓ je 2b9 > │ rep_nop(): >0.70 │2af: pause > │ csd_lock_wait(): > 92.82 │ mov0x8(%rcx),%edx >6.48 │ and$0x1,%edx >0.00 │ ↑ jne2af >0.00 │2b9: ↑ jmp282 > > > Given the patch at hand I was expecting lost IPIs, but I can't quite see > anything getting lost. I have no idea how that patch should be related to IPI and smp function calls. It's changing the way how regular device interrupts and their spurious counterpart are handled and not the way how IPIs are handled. They are handled by direct vectors and do not go through do_IRQ() at all. Aside of that the commit just changes the way how the interrupt vector of a regular device interrupt is stored and conveyed. The extra read and write on the cache hot stack is hardly related to anything IPI. Thanks, tglx
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
Hi Thomas, On 21.05.20 22:05, Thomas Gleixner wrote: From: Thomas Gleixner Device interrupts which go through do_IRQ() or the spurious interrupt handler have their separate entry code on 64 bit for no good reason. Both 32 and 64 bit transport the vector number through ORIG_[RE]AX in pt_regs. Further the vector number is forced to fit into an u8 and is complemented and offset by 0x80 so it's in the signed character range. Otherwise GAS would expand the pushq to a 5 byte instruction for any vector > 0x7F. Treat the vector number like an error code and hand it to the C function as argument. This allows to get rid of the extra entry code in a later step. Simplify the error code push magic by implementing the pushq imm8 via a '.byte 0x6a, vector' sequence so GAS is not able to screw it up. As the pushq imm8 is sign extending the resulting error code needs to be truncated to 8 bits in C code. Originally-by: Andy Lutomirski Signed-off-by: Thomas Gleixner Acked-by: Andy Lutomirski I'm currently trying to understand a performance regression with ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably bisected down to this commit: https://github.com/scylladb/scylla/issues/7036 What we're seeing is that syscalls such as membarrier() take forever (0-10 µs would be normal): % time seconds usecs/call callserrors syscall -- --- --- - - 53.26 12.458881 18595367 membarrier 15.793.693651 17843 20749 read 11.172.613350 6700839 io_pgetevents 10.892.547772 11795 216 timerfd_settime 6.911.616802 11073 146 rt_sigprocmask 1.390.325955354292 timer_settime 0.360.083691 526 159 io_submit 0.220.051399 53596 write 0.000.000783 3721 sendmsg 0.000.57 318 9 ioctl -- --- --- - - 100.00 23.392341 106158 total That again seems to stem from a vastly slowed down smp_call_function_many_cond(): Samples: 218K of event 'cpu-clock', 4000 Hz Overhead Shared ObjectSymbol 94.51% [kernel] [k] smp_call_function_many_cond 0.76% [kernel] [k] __do_softirq 0.32% [kernel] [k] native_queued_spin_lock_slowpath [...] which is stuck in │ csd_lock_wait(): │ smp_cond_load_acquire(&csd->flags, !(VAL & 0.00 │ mov0x8(%rcx),%edx 0.00 │ and$0x1,%edx │ ↓ je 2b9 │ rep_nop(): 0.70 │2af: pause │ csd_lock_wait(): 92.82 │ mov0x8(%rcx),%edx 6.48 │ and$0x1,%edx 0.00 │ ↑ jne2af 0.00 │2b9: ↑ jmp282 Given the patch at hand I was expecting lost IPIs, but I can't quite see anything getting lost. Do you have any further pointers I could look at? Thanks, Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs
Thomas Gleixner writes: > + .align 8 > +SYM_CODE_START(irq_entries_start) > +vector=FIRST_EXTERNAL_VECTOR > +pos = . > +.rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR) > + UNWIND_HINT_IRET_REGS > + .byte 0x6a, vector > + jmp common_interrupt > + .nops (pos + 8) - . Boris pointed out that .nops requires binutils >=2.31 ... Sigh! The below delta patch works for the purpose. Thanks, tglx --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -475,7 +475,9 @@ SYM_CODE_START(irq_entries_start) UNWIND_HINT_IRET_REGS .byte 0x6a, vector jmp asm_common_interrupt - .nops (pos + 8) - . + nop + /* Ensure that the above is 8 bytes max */ + . = pos + 8 pos=pos+8 vector=vector+1 .endr @@ -490,7 +492,9 @@ SYM_CODE_START(spurious_entries_start) UNWIND_HINT_IRET_REGS .byte 0x6a, vector jmp asm_spurious_interrupt - .nops (pos + 8) - . + nop + /* Ensure that the above is 8 bytes max */ + . = pos + 8 pos=pos+8 vector=vector+1 .endr