[tip: irq/core] genirq: Reduce irqdebug cacheline bouncing

2021-04-10 Thread tip-bot2 for Nicholas Piggin
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 7c07012eb1be8b4a95d3502fd30795849007a40e
Gitweb:
https://git.kernel.org/tip/7c07012eb1be8b4a95d3502fd30795849007a40e
Author:Nicholas Piggin 
AuthorDate:Fri, 02 Apr 2021 23:20:37 +10:00
Committer: Thomas Gleixner 
CommitterDate: Sat, 10 Apr 2021 13:35:54 +02:00

genirq: Reduce irqdebug cacheline bouncing

note_interrupt() increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.

Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.

This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.

Cédric reports the result of his IPI throughput test:

   Millions of IPIs/s
 ---   --
   upstream   upstream   patched
 chips  cpus   defaultnoirqdebug default (irqdebug)
 ---   -
 1  0-15 4.061  4.153  4.084
0-31 7.937  8.186  8.158
0-4711.018 11.392 11.233
0-6311.460 13.907 14.022
 2  0-79 8.376 18.105 18.084
0-95 7.338 22.101 22.266
0-1116.716 25.306 25.473
0-1276.223 27.814 28.029

Signed-off-by: Nicholas Piggin 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20210402132037.574661-1-npig...@gmail.com

---
 kernel/irq/spurious.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index f865e5f..c481d84 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -403,6 +403,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t 
action_ret)
desc->irqs_unhandled -= ok;
}
 
+   if (likely(!desc->irqs_unhandled))
+   return;
+
+   /* Now getting into unhandled irq detection */
desc->irq_count++;
if (likely(desc->irq_count < 10))
return;


[tip: sched/core] sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size

2021-03-17 Thread tip-bot2 for Nicholas Piggin
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 873d7c4c6a920d43ff82e44121e54053d4edba93
Gitweb:
https://git.kernel.org/tip/873d7c4c6a920d43ff82e44121e54053d4edba93
Author:Nicholas Piggin 
AuthorDate:Wed, 17 Mar 2021 17:54:27 +10:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Mar 2021 09:32:30 +01:00

sched/wait_bit, mm/filemap: Increase page and bit waitqueue hash size

The page waitqueue hash is a bit small (256 entries) on very big systems. A
16 socket 1536 thread POWER9 system was found to encounter hash collisions
and excessive time in waitqueue locking at times. This was intermittent and
hard to reproduce easily with the setup we had (very little real IO
capacity). The theory is that sometimes (depending on allocation luck)
important pages would happen to collide a lot in the hash, slowing down page
locking, causing the problem to snowball.

An small test case was made where threads would write and fsync different
pages, generating just a small amount of contention across many pages.

Increasing page waitqueue hash size to 262144 entries increased throughput
by 182% while also reducing standard deviation 3x. perf before the increase:

  36.23%  [k] _raw_spin_lock_irqsave-  -
  |
  |--34.60%--wake_up_page_bit
  |  0
  |  iomap_write_end.isra.38
  |  iomap_write_actor
  |  iomap_apply
  |  iomap_file_buffered_write
  |  xfs_file_buffered_aio_write
  |  new_sync_write

  17.93%  [k] native_queued_spin_lock_slowpath  -  -
  |
  |--16.74%--_raw_spin_lock_irqsave
  |  |
  |   --16.44%--wake_up_page_bit
  | iomap_write_end.isra.38
  | iomap_write_actor
  | iomap_apply
  | iomap_file_buffered_write
  | xfs_file_buffered_aio_write

This patch uses alloc_large_system_hash to allocate a bigger system hash
that scales somewhat with memory size. The bit/var wait-queue is also
changed to keep code matching, albiet with a smaller scale factor.

A very small CONFIG_BASE_SMALL option is also added because these are two
of the biggest static objects in the image on very small systems.

This hash could be made per-node, which may help reduce remote accesses
on well localised workloads, but that adds some complexity with indexing
and hotplug, so until we get a less artificial workload to test with,
keep it simple.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Ingo Molnar 
Acked-by: Peter Zijlstra 
Link: https://lore.kernel.org/r/20210317075427.587806-1-npig...@gmail.com
---
 kernel/sched/wait_bit.c | 30 +++---
 mm/filemap.c| 24 +---
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 02ce292..dba73de 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -2,19 +2,24 @@
 /*
  * The implementation of the wait_bit*() and related waiting APIs:
  */
+#include 
 #include "sched.h"
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
+#define BIT_WAIT_TABLE_SIZE (1 << bit_wait_table_bits)
+#if CONFIG_BASE_SMALL
+static const unsigned int bit_wait_table_bits = 3;
+static wait_queue_head_t bit_wait_table[BIT_WAIT_TABLE_SIZE] 
__cacheline_aligned;
+#else
+static unsigned int bit_wait_table_bits __ro_after_init;
+static wait_queue_head_t *bit_wait_table __ro_after_init;
+#endif
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
unsigned long val = (unsigned long)word << shift | bit;
 
-   return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_long(val, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(bit_waitqueue);
 
@@ -152,7 +157,7 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
-   return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS);
+   return bit_wait_table + hash_ptr(p, bit_wait_table_bits);
 }
 EXPORT_SYMBOL(__var_waitqueue);
 
@@ -246,6 +251,17 @@ void __init wait_bit_init(void)
 {
int i;
 
-   for (i = 0; i < WAIT_TABLE_SIZE; i++)
+   if (!CONFIG_BASE_SMALL) {
+   bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
+   
sizeof(wait_queue_head_t),
+   0,
+   22,
+   0,
+

[tip: locking/core] lockdep: Only trace IRQ edges

2020-08-27 Thread tip-bot2 for Nicholas Piggin
The following commit has been merged into the locking/core branch of tip:

Commit-ID: 044d0d6de9f50192f9697583504a382347ee95ca
Gitweb:
https://git.kernel.org/tip/044d0d6de9f50192f9697583504a382347ee95ca
Author:Nicholas Piggin 
AuthorDate:Thu, 23 Jul 2020 20:56:14 +10:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 26 Aug 2020 12:41:56 +02:00

lockdep: Only trace IRQ edges

Problem:

  raw_local_irq_save(); // software state on
  local_irq_save(); // software state off
  ...
  local_irq_restore(); // software state still off, because we don't enable IRQs
  raw_local_irq_restore(); // software state still off, *whoopsie*

existing instances:

 - lock_acquire()
 raw_local_irq_save()
 __lock_acquire()
   arch_spin_lock(_lock)
 pv_wait() := kvm_wait() (same or worse for Xen/HyperV)
   local_irq_save()

 - trace_clock_global()
 raw_local_irq_save()
 arch_spin_lock()
   pv_wait() := kvm_wait()
 local_irq_save()

 - apic_retrigger_irq()
 raw_local_irq_save()
 apic->send_IPI() := default_send_IPI_single_phys()
   local_irq_save()

Possible solutions:

 A) make it work by enabling the tracing inside raw_*()
 B) make it work by keeping tracing disabled inside raw_*()
 C) call it broken and clean it up now

Now, given that the only reason to use the raw_* variant is because you don't
want tracing. Therefore A) seems like a weird option (although it can be done).
C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just
because there is one raw user, this strips the validation/tracing off for all
the other users.

So we pick B) and declare any code that ends up doing:

raw_local_irq_save()
local_irq_save()
lockdep_assert_irqs_disabled();

broken. AFAICT this problem has existed forever, the only reason it came
up is because commit: 859d069ee1dd ("lockdep: Prepare for NMI IRQ
state tracking") changed IRQ tracing vs lockdep recursion and the
first instance is fairly common, the other cases hardly ever happen.

Signed-off-by: Nicholas Piggin 
[rewrote changelog]
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Steven Rostedt (VMware) 
Reviewed-by: Thomas Gleixner 
Acked-by: Rafael J. Wysocki 
Tested-by: Marco Elver 
Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npig...@gmail.com
---
 arch/powerpc/include/asm/hw_irq.h | 11 ---
 include/linux/irqflags.h  | 15 +++
 2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 3a0db7b..35060be 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
 #define powerpc_local_irq_pmu_save(flags)  \
 do {   \
raw_local_irq_pmu_save(flags);  \
-   trace_hardirqs_off();   \
+   if (!raw_irqs_disabled_flags(flags))\
+   trace_hardirqs_off();   \
} while(0)
 #define powerpc_local_irq_pmu_restore(flags)   \
do {\
-   if (raw_irqs_disabled_flags(flags)) {   \
-   raw_local_irq_pmu_restore(flags);   \
-   trace_hardirqs_off();   \
-   } else {\
+   if (!raw_irqs_disabled_flags(flags))\
trace_hardirqs_on();\
-   raw_local_irq_pmu_restore(flags);   \
-   }   \
+   raw_local_irq_pmu_restore(flags);   \
} while(0)
 #else
 #define powerpc_local_irq_pmu_save(flags)  \
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 00d553d..3ed4e87 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -191,25 +191,24 @@ do {  \
 
 #define local_irq_disable()\
do {\
+   bool was_disabled = raw_irqs_disabled();\
raw_local_irq_disable();\
-   trace_hardirqs_off();   \
+   if (!was_disabled)  \
+   trace_hardirqs_off();   \
} while (0)
 
 #define local_irq_save(flags)  \
do {\
raw_local_irq_save(flags);  \
-   trace_hardirqs_off();   \
+   if (!raw_irqs_disabled_flags(flags))\
+