Re: [PATCH v6 0/9] Allow dynamic allocation of software IO TLB bounce buffers

2023-08-01 Thread Petr Tesarik
On 7/31/2023 9:46 PM, Petr Tesařík wrote:
> V Mon, 31 Jul 2023 18:04:09 +0200
> Christoph Hellwig  napsáno:
> 
>> I was just going to apply this, but patch 1 seems to have a non-trivial
>> conflict with the is_swiotlb_active removal in pci-dma.c.  Can you resend
>> against the current dma-mapping for-next tree?
> 
> Sure thing, will re-send tomorrow morning.

After commit f9a38ea5172a ("x86: always initialize xen-swiotlb when
xen-pcifront is enabling") removed that call to swiotlb_init_late(),
there is nothing to patch, and the hunk can be dropped.

I have just sent v7.

Petr T




Re: [PATCH v6 0/9] Allow dynamic allocation of software IO TLB bounce buffers

2023-07-31 Thread Petr Tesařík
V Mon, 31 Jul 2023 18:04:09 +0200
Christoph Hellwig  napsáno:

> I was just going to apply this, but patch 1 seems to have a non-trivial
> conflict with the is_swiotlb_active removal in pci-dma.c.  Can you resend
> against the current dma-mapping for-next tree?

Sure thing, will re-send tomorrow morning.

Petr T



Re: [PATCH v6 0/9] Allow dynamic allocation of software IO TLB bounce buffers

2023-07-31 Thread Christoph Hellwig
I was just going to apply this, but patch 1 seems to have a non-trivial
conflict with the is_swiotlb_active removal in pci-dma.c.  Can you resend
against the current dma-mapping for-next tree?



[PATCH v6 0/9] Allow dynamic allocation of software IO TLB bounce buffers

2023-07-27 Thread Petr Tesarik
From: Petr Tesarik 

Motivation
==

The software IO TLB was designed with these assumptions:

1) It would not be used much. Small systems (little RAM) don't need it, and
   big systems (lots of RAM) would have modern DMA controllers and an IOMMU
   chip to handle legacy devices.
2) A small fixed memory area (64 MiB by default) is sufficient to
   handle the few cases which require a bounce buffer.
3) 64 MiB is little enough that it has no impact on the rest of the
   system.
4) Bounce buffers require large contiguous chunks of low memory. Such
   memory is precious and can be allocated only early at boot.

It turns out they are not always true:

1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy
   32-bit peripheral busses and/or DMA controllers.
2) CoCo VMs use bounce buffers for all I/O but may need substantially more
   than 64 MiB.
3) Embedded developers put as many features as possible into the available
   memory. A few dozen "missing" megabytes may limit what features can be
   implemented.
4) If CMA is available, it can allocate large continuous chunks even after
   the system has run for some time.

Goals
=

The goal of this work is to start with a small software IO TLB at boot and
expand it later when/if needed.

Design
==

This version of the patch series retains the current slot allocation
algorithm with multiple areas to reduce lock contention, but additional
slots can be added when necessary.

These alternatives have been considered:

- Allocate and free buffers as needed using direct DMA API. This works
  quite well, except in CoCo VMs where each allocation/free requires
  decrypting/encrypting memory, which is a very expensive operation.

- Allocate a very large software IO TLB at boot, but allow to migrate pages
  to/from it (like CMA does). For systems with CMA, this would mean two big
  allocations at boot. Finding the balance between CMA, SWIOTLB and rest of
  available RAM can be challenging. More importantly, there is no clear
  benefit compared to allocating SWIOTLB memory pools from the CMA.

Implementation Constraints
==

These constraints have been taken into account:

1) Minimize impact on devices which do not benefit from the change.
2) Minimize the number of memory decryption/encryption operations.
3) Avoid contention on a lock or atomic variable to preserve parallel
   scalability.

Additionally, the software IO TLB code is also used to implement restricted
DMA pools. These pools are restricted to a pre-defined physical memory
region and must not use any other memory. In other words, dynamic
allocation of memory pools must be disabled for restricted DMA pools.

Data Structures
===

The existing struct io_tlb_mem is the central type for a SWIOTLB allocator,
but it now contains multiple memory pools::

  io_tlb_mem
  +-+   io_tlb_pool
  | SWIOTLB |   +---+   +---+   +---+
  |allocator|-->|default|-->|dynamic|-->|dynamic|-->...
  | |   |memory |   |memory |   |memory |
  +-+   | pool  |   | pool  |   | pool  |
+---+   +---+   +---+

The allocator structure contains global state (such as flags and counters)
and structures needed to schedule new allocations. Each memory pool
contains the actual buffer slots and metadata. The first memory pool in the
list is the default memory pool allocated statically at early boot.

New memory pools are allocated from a kernel worker thread. That's because
bounce buffers are allocated when mapping a DMA buffer, which may happen in
interrupt context where large atomic allocations would probably fail.
Allocation from process context is much more likely to succeed, especially
if it can use CMA.

Nonetheless, the onset of a load spike may fill up the SWIOTLB before the
worker has a chance to run. In that case, try to allocate a small transient
memory pool to accommodate the request. If memory is encrypted and the
device cannot do DMA to encrypted memory, this buffer is allocated from the
coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore
require increasing the size of the coherent pool with the "coherent_pool"
command-line parameter.

Performance
===

All testing compared a vanilla v6.4-rc6 kernel with a fully patched
kernel. The kernel was booted with "swiotlb=force" to allow stress-testing
the software IO TLB on a high-performance device that would otherwise not
need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of
popular distribution kernels; it is understood that parallel workloads
suffer from contention on the recently added debugfs atomic counters.

These benchmarks were run:

- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.

In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but
wasteful). The "default" pair of columns shows performance impact when
booted