Re: [PATCH v1 3/3] swiotlb: Split up single swiotlb lock

2022-06-29 Thread Chao Gao
On Tue, Jun 28, 2022 at 03:01:34PM +0800, Chao Gao wrote:
>From: Andi Kleen 
>
>Traditionally swiotlb was not performance critical because it was only
>used for slow devices. But in some setups, like TDX confidential
>guests, all IO has to go through swiotlb. Currently swiotlb only has a
>single lock. Under high IO load with multiple CPUs this can lead to
>signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
>time in locks in some extreme cases.
>
>This patch splits the swiotlb into individual areas which have their
>own lock. Each CPU tries to allocate in its own area first. Only if
>that fails does it search other areas. On freeing the allocation is
>freed into the area where the memory was originally allocated from.
>
>To avoid doing a full modulo in the main path the number of swiotlb
>areas is always rounded to the next power of two. I believe that's
>not really needed anymore on modern CPUs (which have fast enough
>dividers), but still a good idea on older parts.
>
>The number of areas can be set using the swiotlb option. But to avoid
>every user having to set this option set the default to the number of
>available CPUs. Unfortunately on x86 swiotlb is initialized before
>num_possible_cpus() is available, that is why it uses a custom hook
>called from the early ACPI code.
>
>Signed-off-by: Andi Kleen 
>[ rebase and fix warnings of checkpatch.pl ]
>Signed-off-by: Chao Gao 

Just noticed that Tianyu already posted a variant of this patch.
Will drop this one from my series.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v1 3/3] swiotlb: Split up single swiotlb lock

2022-06-28 Thread Chao Gao
From: Andi Kleen 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
signifiant lock contention on the swiotlb lock. We've seen 20+% CPU
time in locks in some extreme cases.

This patch splits the swiotlb into individual areas which have their
own lock. Each CPU tries to allocate in its own area first. Only if
that fails does it search other areas. On freeing the allocation is
freed into the area where the memory was originally allocated from.

To avoid doing a full modulo in the main path the number of swiotlb
areas is always rounded to the next power of two. I believe that's
not really needed anymore on modern CPUs (which have fast enough
dividers), but still a good idea on older parts.

The number of areas can be set using the swiotlb option. But to avoid
every user having to set this option set the default to the number of
available CPUs. Unfortunately on x86 swiotlb is initialized before
num_possible_cpus() is available, that is why it uses a custom hook
called from the early ACPI code.

Signed-off-by: Andi Kleen 
[ rebase and fix warnings of checkpatch.pl ]
Signed-off-by: Chao Gao 
---
 .../admin-guide/kernel-parameters.txt |   4 +-
 arch/x86/kernel/acpi/boot.c   |   4 +
 include/linux/swiotlb.h   |  30 +++-
 kernel/dma/swiotlb.c  | 168 --
 4 files changed, 180 insertions(+), 26 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 8090130b544b..5d46271982d5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5869,8 +5869,10 @@
it if 0 is given (See 
Documentation/admin-guide/cgroup-v1/memory.rst)
 
swiotlb=[ARM,IA-64,PPC,MIPS,X86]
-   Format: {  | force | noforce }
+   Format: {  [,] | force | noforce }
 -- Number of I/O TLB slabs
+-- Second integer after comma. Number of swiotlb
+areas with their own lock. Must be power of 2.
force -- force using of bounce buffers even if they
 wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 907cc98b1938..f05e55b2d50c 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1131,6 +1132,9 @@ static int __init acpi_parse_madt_lapic_entries(void)
return count;
}
 
+   /* This does not take overrides into consideration */
+   swiotlb_hint_cpus(max(count, x2count));
+
x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC_NMI,
acpi_parse_x2apic_nmi, 0);
count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI,
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 012a6fde873b..cf36c5cc7584 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -69,7 +69,25 @@ struct io_tlb_slot {
 };
 
 /**
- * struct io_tlb_mem - IO TLB Memory Pool Descriptor
+ * struct io_tlb_area - IO TLB memory area descriptor
+ *
+ * This is a single area with a single lock.
+ *
+ * @used:  The number of used IO TLB block.
+ * @list:  The free list describing the number of free entries available
+ * from each index.
+ * @lock:  The lock to protect the above data structures in the map and
+ * unmap calls.
+ */
+
+struct io_tlb_area {
+   unsigned long used;
+   struct list_head free_slots;
+   spinlock_t lock;
+};
+
+/**
+ * struct io_tlb_mem - io tlb memory pool descriptor
  *
  * @start: The start address of the swiotlb memory pool. Used to do a quick
  * range check to see if the memory was in fact allocated by this
@@ -87,8 +105,6 @@ struct io_tlb_slot {
  * @index: The index to start searching in the next round.
  * @orig_addr: The original address corresponding to a mapped entry.
  * @alloc_size:Size of the allocated buffer.
- * @lock:  The lock to protect the above data structures in the map and
- * unmap calls.
  * @debugfs:   The dentry to debugfs.
  * @late_alloc:%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
@@ -101,13 +117,11 @@ struct io_tlb_mem {
phys_addr_t end;
void *vaddr;
unsigned long nslabs;
-   unsigned long used;
-   struct list_head free_slots;
-