[tip: irq/core] genirq: Reduce irqdebug cacheline bouncing
The following commit has been merged into the irq/core branch of tip: Commit-ID: 7c07012eb1be8b4a95d3502fd30795849007a40e Gitweb: https://git.kernel.org/tip/7c07012eb1be8b4a95d3502fd30795849007a40e Author:Nicholas Piggin AuthorDate:Fri, 02 Apr 2021 23:20:37 +10:00 Committer: Thomas Gleixner CommitterDate: Sat, 10 Apr 2021 13:35:54 +02:00 genirq: Reduce irqdebug cacheline bouncing note_interrupt() increments desc->irq_count for each interrupt even for percpu interrupt handlers, even when they are handled successfully. This causes cacheline bouncing and limits scalability. Instead of incrementing irq_count every time, only start incrementing it after seeing an unhandled irq, which should avoid the cache line bouncing in the common path. This actually should give better consistency in handling misbehaving irqs too, because instead of the first unhandled irq arriving at an arbitrary point in the irq_count cycle, its arrival will begin the irq_count cycle. Cédric reports the result of his IPI throughput test: Millions of IPIs/s --- -- upstream upstream patched chips cpus defaultnoirqdebug default (irqdebug) --- - 1 0-15 4.061 4.153 4.084 0-31 7.937 8.186 8.158 0-4711.018 11.392 11.233 0-6311.460 13.907 14.022 2 0-79 8.376 18.105 18.084 0-95 7.338 22.101 22.266 0-1116.716 25.306 25.473 0-1276.223 27.814 28.029 Signed-off-by: Nicholas Piggin Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20210402132037.574661-1-npig...@gmail.com --- kernel/irq/spurious.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index f865e5f..c481d84 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -403,6 +403,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret) desc->irqs_unhandled -= ok; } + if (likely(!desc->irqs_unhandled)) + return; + + /* Now getting into unhandled irq detection */ desc->irq_count++; if (likely(desc->irq_count < 10)) return;
[tip: sched/core] sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size
The following commit has been merged into the sched/core branch of tip: Commit-ID: 873d7c4c6a920d43ff82e44121e54053d4edba93 Gitweb: https://git.kernel.org/tip/873d7c4c6a920d43ff82e44121e54053d4edba93 Author:Nicholas Piggin AuthorDate:Wed, 17 Mar 2021 17:54:27 +10:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Mar 2021 09:32:30 +01:00 sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size The page waitqueue hash is a bit small (256 entries) on very big systems. A 16 socket 1536 thread POWER9 system was found to encounter hash collisions and excessive time in waitqueue locking at times. This was intermittent and hard to reproduce easily with the setup we had (very little real IO capacity). The theory is that sometimes (depending on allocation luck) important pages would happen to collide a lot in the hash, slowing down page locking, causing the problem to snowball. An small test case was made where threads would write and fsync different pages, generating just a small amount of contention across many pages. Increasing page waitqueue hash size to 262144 entries increased throughput by 182% while also reducing standard deviation 3x. perf before the increase: 36.23% [k] _raw_spin_lock_irqsave- - | |--34.60%--wake_up_page_bit | 0 | iomap_write_end.isra.38 | iomap_write_actor | iomap_apply | iomap_file_buffered_write | xfs_file_buffered_aio_write | new_sync_write 17.93% [k] native_queued_spin_lock_slowpath - - | |--16.74%--_raw_spin_lock_irqsave | | | --16.44%--wake_up_page_bit | iomap_write_end.isra.38 | iomap_write_actor | iomap_apply | iomap_file_buffered_write | xfs_file_buffered_aio_write This patch uses alloc_large_system_hash to allocate a bigger system hash that scales somewhat with memory size. The bit/var wait-queue is also changed to keep code matching, albiet with a smaller scale factor. A very small CONFIG_BASE_SMALL option is also added because these are two of the biggest static objects in the image on very small systems. This hash could be made per-node, which may help reduce remote accesses on well localised workloads, but that adds some complexity with indexing and hotplug, so until we get a less artificial workload to test with, keep it simple. Signed-off-by: Nicholas Piggin Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Link: https://lore.kernel.org/r/20210317075427.587806-1-npig...@gmail.com --- kernel/sched/wait_bit.c | 30 +++--- mm/filemap.c| 24 +--- 2 files changed, 44 insertions(+), 10 deletions(-) diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c index 02ce292..dba73de 100644 --- a/kernel/sched/wait_bit.c +++ b/kernel/sched/wait_bit.c @@ -2,19 +2,24 @@ /* * The implementation of the wait_bit*() and related waiting APIs: */ +#include #include "sched.h" -#define WAIT_TABLE_BITS 8 -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) - -static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned; +#define BIT_WAIT_TABLE_SIZE (1 << bit_wait_table_bits) +#if CONFIG_BASE_SMALL +static const unsigned int bit_wait_table_bits = 3; +static wait_queue_head_t bit_wait_table[BIT_WAIT_TABLE_SIZE] __cacheline_aligned; +#else +static unsigned int bit_wait_table_bits __ro_after_init; +static wait_queue_head_t *bit_wait_table __ro_after_init; +#endif wait_queue_head_t *bit_waitqueue(void *word, int bit) { const int shift = BITS_PER_LONG == 32 ? 5 : 6; unsigned long val = (unsigned long)word << shift | bit; - return bit_wait_table + hash_long(val, WAIT_TABLE_BITS); + return bit_wait_table + hash_long(val, bit_wait_table_bits); } EXPORT_SYMBOL(bit_waitqueue); @@ -152,7 +157,7 @@ EXPORT_SYMBOL(wake_up_bit); wait_queue_head_t *__var_waitqueue(void *p) { - return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS); + return bit_wait_table + hash_ptr(p, bit_wait_table_bits); } EXPORT_SYMBOL(__var_waitqueue); @@ -246,6 +251,17 @@ void __init wait_bit_init(void) { int i; - for (i = 0; i < WAIT_TABLE_SIZE; i++) + if (!CONFIG_BASE_SMALL) { + bit_wait_table = alloc_large_system_hash("bit waitqueue hash", + sizeof(wait_queue_head_t), + 0, + 22, + 0, +
[tip: locking/core] lockdep: Only trace IRQ edges
The following commit has been merged into the locking/core branch of tip: Commit-ID: 044d0d6de9f50192f9697583504a382347ee95ca Gitweb: https://git.kernel.org/tip/044d0d6de9f50192f9697583504a382347ee95ca Author:Nicholas Piggin AuthorDate:Thu, 23 Jul 2020 20:56:14 +10:00 Committer: Peter Zijlstra CommitterDate: Wed, 26 Aug 2020 12:41:56 +02:00 lockdep: Only trace IRQ edges Problem: raw_local_irq_save(); // software state on local_irq_save(); // software state off ... local_irq_restore(); // software state still off, because we don't enable IRQs raw_local_irq_restore(); // software state still off, *whoopsie* existing instances: - lock_acquire() raw_local_irq_save() __lock_acquire() arch_spin_lock(_lock) pv_wait() := kvm_wait() (same or worse for Xen/HyperV) local_irq_save() - trace_clock_global() raw_local_irq_save() arch_spin_lock() pv_wait() := kvm_wait() local_irq_save() - apic_retrigger_irq() raw_local_irq_save() apic->send_IPI() := default_send_IPI_single_phys() local_irq_save() Possible solutions: A) make it work by enabling the tracing inside raw_*() B) make it work by keeping tracing disabled inside raw_*() C) call it broken and clean it up now Now, given that the only reason to use the raw_* variant is because you don't want tracing. Therefore A) seems like a weird option (although it can be done). C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just because there is one raw user, this strips the validation/tracing off for all the other users. So we pick B) and declare any code that ends up doing: raw_local_irq_save() local_irq_save() lockdep_assert_irqs_disabled(); broken. AFAICT this problem has existed forever, the only reason it came up is because commit: 859d069ee1dd ("lockdep: Prepare for NMI IRQ state tracking") changed IRQ tracing vs lockdep recursion and the first instance is fairly common, the other cases hardly ever happen. Signed-off-by: Nicholas Piggin [rewrote changelog] Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Steven Rostedt (VMware) Reviewed-by: Thomas Gleixner Acked-by: Rafael J. Wysocki Tested-by: Marco Elver Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npig...@gmail.com --- arch/powerpc/include/asm/hw_irq.h | 11 --- include/linux/irqflags.h | 15 +++ 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h index 3a0db7b..35060be 100644 --- a/arch/powerpc/include/asm/hw_irq.h +++ b/arch/powerpc/include/asm/hw_irq.h @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void) #define powerpc_local_irq_pmu_save(flags) \ do { \ raw_local_irq_pmu_save(flags); \ - trace_hardirqs_off(); \ + if (!raw_irqs_disabled_flags(flags))\ + trace_hardirqs_off(); \ } while(0) #define powerpc_local_irq_pmu_restore(flags) \ do {\ - if (raw_irqs_disabled_flags(flags)) { \ - raw_local_irq_pmu_restore(flags); \ - trace_hardirqs_off(); \ - } else {\ + if (!raw_irqs_disabled_flags(flags))\ trace_hardirqs_on();\ - raw_local_irq_pmu_restore(flags); \ - } \ + raw_local_irq_pmu_restore(flags); \ } while(0) #else #define powerpc_local_irq_pmu_save(flags) \ diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 00d553d..3ed4e87 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -191,25 +191,24 @@ do { \ #define local_irq_disable()\ do {\ + bool was_disabled = raw_irqs_disabled();\ raw_local_irq_disable();\ - trace_hardirqs_off(); \ + if (!was_disabled) \ + trace_hardirqs_off(); \ } while (0) #define local_irq_save(flags) \ do {\ raw_local_irq_save(flags); \ - trace_hardirqs_off(); \ + if (!raw_irqs_disabled_flags(flags))\ +