** Description changed:
+ SRU Justification:
+
+ [Impact]
+
+ * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
+ (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
+ to a different dma-iommu implementation.
+
+ * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
+ (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
+ option should now be set to 'yes' by default for s390x.
+
+ * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
+ are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
+ set to "no" by default, which was upstream done by b2b97a62f055
+ "Revert "s390: update defconfigs"".
+
+ * These changes are all upstream, but were not picked up by the Ubuntu
+ kernel config.
+
+ * And not having these config options set properly is causing significant
+ PCI-related network throughput degradation (up to -72%).
+
+ * This shows for almost all workloads and numbers of connections,
+ deteriorating with the number of connections increasing.
+
+ * Especially drastic is the drop for a high number of parallel connections
+ (50 and 250) and for small and medium-size transactional workloads.
+ However, also for streaming-type workloads the degradation is clearly
+ visible (up to 48% degradation).
+
+ [Fix]
+
+ * The (upstream) fix is to set
+ IOMMU_DEFAULT_DMA_STRICT=no
+ and
+ IOMMU_DEFAULT_DMA_LAZY=y
+ (which is needed for the changed DAM IOMMU implementation since v6.7).
+
+ [Test Case]
+
+ * Setup two Ubuntu Server 24.04 LPARs (with kernel 6.8)
+ (one acting as server and as client)
+ that have (PCIe attached) RoCE Express devices attached
+ and that are connected to each other.
+
+ * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
+ <?xml version="1.0"?>
+ <profile name="TCP_RR">
+ <group nprocs="250">
+ <transaction iterations="1">
+ <flowop type="connect" options="remotehost=<remote
IP> protocol=tcp tcp_nodelay" />
+ </transaction>
+ <transaction duration="300">
+ <flowop type="write" options="size=200"/>
+ <flowop type="read" options="size=1000"/>
+ </transaction>
+ <transaction iterations="1">
+ <flowop type="disconnect" />
+ </transaction>
+ </group>
+ </profile>
+
+ * Install uperf on both systems, client and server.
+
+ * Start uperf at server: uperf -s
+
+ * Start uperf at client: uperf -vai 5 -m uperf-profile.xml
+
+ * Switch from strict to lazy mode
+ either using the new kernel (or the test build below)
+ or using kernel cmd-line parameter iommu.strict=0.
+
+ * Restart uperf on server and client, like before.
+
+ * Verification will be performed by IBM.
+
+ [Regression Potential]
+
+ * The is a certain regression potential, since the behavior with
+ the two modified kernel config options will change significantly.
+
+ * This may solve the (network) throughput issue with PCI devices,
+ but may also come with side-effects on other PCIe based devices
+ (the old compression adapters or the new NVMe carrier cards).
+
+ [Other]
+
+ * CCW devices are not affected.
+
+ * This is s390x-specific only, hence will not affect any other
+ architecture.
+
+ __________
+
Symptom:
Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu
22.04, all of our PCI-related network measurements on LPAR show massive
throughput degradations (up to -72%). This shows for almost all workloads and
numbers of connections, detereorating with the number of connections
increasing. Especially drastic is the drop for a high number of parallel
connections (50 and 250) and for small and medium-size transactional workloads.
However, also for streaming-type workloads the degradation is clearly visible
(up to 48% degradation).
Problem:
With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode
changed from lazy to strict, causing these massive degradations.
Behavior can also be changed with a kernel commandline parameter
(iommu.strict) for easy verification.
The issue is known and was quickly fixed upstream in December 2023, after
being present for little less than two weeks.
Upstream fix:
https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a
Repro:
rr1c-200x1000-250 with rr1c-200x1000-250.xml:
<?xml version="1.0"?>
<profile name="TCP_RR">
- <group nprocs="250">
- <transaction iterations="1">
- <flowop type="connect" options="remotehost=<remote
IP> protocol=tcp tcp_nodelay" />
- </transaction>
- <transaction duration="300">
- <flowop type="write" options="size=200"/>
- <flowop type="read" options="size=1000"/>
- </transaction>
- <transaction iterations="1">
- <flowop type="disconnect" />
- </transaction>
- </group>
+ <group nprocs="250">
+ <transaction iterations="1">
+ <flowop type="connect" options="remotehost=<remote
IP> protocol=tcp tcp_nodelay" />
+ </transaction>
+ <transaction duration="300">
+ <flowop type="write" options="size=200"/>
+ <flowop type="read" options="size=1000"/>
+ </transaction>
+ <transaction iterations="1">
+ <flowop type="disconnect" />
+ </transaction>
+ </group>
</profile>
0) Install uperf on both systems, client and server.
1) Start uperf at server: uperf -s
2) Start uperf at client: uperf -vai 5 -m uperf-profile.xml
3) Switch from strict to lazy mode using kernel commandline parameter
iommu.strict=0.
4) Repeat steps 1) and 2).
Example:
For the following example, we chose the workload named above
(rr1c-200x1000-250):
iommu.strict=1 (strict): 233464.914 TPS
iommu.strict=0 (lazy): 835123.193 TPS
** Changed in: ubuntu-z-systems
Status: New => In Progress
** Changed in: linux (Ubuntu)
Status: New => In Progress
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2071471
Title:
[UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive
throughput degradation for PCI-related network workloads
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2071471/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs