Public bug reported:
1)
=== System Metadata ===
OS Release:
Description: Ubuntu 26.04 LTS
Release: 26.04
Kernel Version: 7.0.0-22-generic
CPU Model: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Loaded iscsi modules: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Network controller: 5e:00.0 Ethernet controller: Mellanox Technologies MT27800
Family [ConnectX-5]
5e:00.1 Ethernet controller: Mellanox Technologies MT27800
Family [ConnectX-5]
2)
# apt-cache policy linux-image-generic
linux-image-generic:
Installed: 7.0.0-22.22
3)
Expected same or better performance over iscsi LUN volume mounts.
4)
Performance got 40% to 50% worst than the baseline test on Ubuntu 22.04.5 LTS
with kernel 5.15
# Kernel Bug Report: iscsi_tcp Sequential Throughput Regression (5.15 → 6.8 /
7.0)
## Summary
A significant sequential I/O throughput regression has been identified
in the `iscsi_tcp` kernel module when comparing Ubuntu kernel 5.15
(Ubuntu 22.04) against kernels 6.8 (Ubuntu 24.04) and 7.0 (Ubuntu
26.04). Sequential read throughput drops by approximately 40-50% on the
newer kernels under identical hardware, network, and storage backend
conditions. All externally-tunable parameters have been exhaustively
tested and eliminated as the cause.
## Affected Versions
| Distribution | Kernel Version | Status |
|-------------|---------------|--------|
| Ubuntu 22.04 LTS | 5.15.0-143-generic | Working (baseline) |
| Ubuntu 24.04 LTS | 6.8.0-88-generic | **Regressed** |
| Ubuntu 26.04 LTS | 7.0.0-22-generic | **Regressed** |
## Hardware Configuration (identical across all hosts)
- **CPU**: 96 cores (dual-socket Intel/AMD server)
- **Memory**: 750 GiB
- **NIC**: Mellanox ConnectX (25 GbE, dual-port, LACP bond, mlx5 driver)
- **Storage Backend**: NetApp ONTAP SAN (iSCSI, ontap-san-economy driver via
Trident CSI 25.06)
- **Network**: Jumbo frames (MTU 9000), VLAN-tagged storage network
- **Multipath**: dm-multipath with ALUA, service-time path selector, 2 paths
per LUN
## Test Environment
- **Volume**: 100 GiB iSCSI LUN provisioned via Trident CSI (PVC with
`Filesystem` volumeMode)
- **Test Pod**: Kubernetes pod with fio 3.6, volume mounted at `/mnt/data`
- **fio Parameters**: `--direct=1 --ioengine=libaio --iodepth=64 --numjobs=1
--runtime=30 --time_based --group_reporting --directory=/mnt/data`
- **Block Sizes Tested**: 128K, 256K, 1M
## Results
### Sequential Read Throughput (256K block size)
| Kernel | Bandwidth | IOPS | Avg Latency |
|--------|-----------|------|-------------|
| 5.15.0-143 | **1542 MiB/s** | 6168 | 10.2 ms |
| 6.8.0-88 | **740 MiB/s** | 2958 | 21.3 ms |
| 7.0.0-22 | **1034 MiB/s** | 4136 | 15.4 ms |
### Sequential Read Throughput (1M block size)
| Kernel | Bandwidth | IOPS | Avg Latency |
|--------|-----------|------|-------------|
| 5.15.0-143 | **1444 MiB/s** | 1444 | 43.6 ms |
| 6.8.0-88 | **815 MiB/s** | 814 | 77.3 ms |
| 7.0.0-22 | **851 MiB/s** | 851 | 73.9 ms |
### Regression Magnitude
- **Kernel 6.8 vs 5.15**: 44-52% throughput reduction (sequential reads)
- **Kernel 7.0 vs 5.15**: 33-41% throughput reduction (sequential reads)
## iSCSI / SCSI Parameters (verified identical across all kernels)
| Parameter | Value |
|-----------|-------|
| `can_queue` (scsi_host) | 113 |
| `cmd_per_lun` (scsi_host) | 32 |
| `sg_tablesize` (scsi_host) | 4096 |
| `queue_depth` (per LUN) | 32 |
| `max_hw_sectors_kb` | 32767 |
| iSCSI `MaxRecvDataSegmentLength` | 262144 |
| iSCSI `FirstBurstLength` | 65536 (negotiated by target) |
| iSCSI `MaxBurstLength` | 1048576 |
| iSCSI `MaxOutstandingR2T` | 1 |
| iSCSI `ImmediateData` | Yes |
| iSCSI `InitialR2T` | Yes (negotiated by target) |
| TCP congestion control | BBR |
| MTU | 9000 |
| TCP rmem_max / wmem_max | 134217728 |
## Eliminated Causes
The following parameters/settings were systematically tuned on the 7.0
kernel with no measurable impact on throughput:
| Tuning Attempted | Result |
|-----------------|--------|
| Disable WBT (`wbt_lat_usec=0`) | No change |
| Increase read-ahead (`read_ahead_kb=16384`) | No change (expected: direct IO)
|
| IO scheduler: `none` (passthrough) | No change |
| IO scheduler: `mq-deadline` (match 5.15 default) | No change |
| Reduce `max_sectors_kb` to 64 (match 5.15 value) | No change |
| Increase `nr_requests` to 512 | No change |
| Enable `recv_from_iscsi_q=Y` (kernel 7.0 parameter) | No change |
| Increase `netdev_budget` (1200→4800) and `netdev_budget_usecs` (8000→32000) |
No change |
| Renice `iscsid` process to -20 | No change |
| Enable RPS on storage VLAN interface (`rps_cpus=ffffffff`) | No change |
| Enable RFS (`rps_sock_flow_entries=32768`) | No change |
| Enable `quickack` on iSCSI storage routes | No change |
| Set `tcp_low_latency=1` | No change |
| Increase `gro_max_size` / `gso_max_size` | Failed (not supported on VLAN
interface) |
| Multiple fio jobs (numjobs=2) | No change / slightly worse |
## Analysis
### Per-IO Latency Comparison
With `queue_depth=32` and `iodepth=64` (saturating the device queue),
throughput is governed by:
```
throughput = queue_depth / avg_latency_per_IO
```
For 256K sequential reads:
- **Kernel 5.15**: 32 / 0.0052s = ~6150 IOPS → 1538 MiB/s (matches observed)
- **Kernel 6.8**: 32 / 0.0108s = ~2962 IOPS → 741 MiB/s (matches observed)
- **Kernel 7.0**: 32 / 0.0077s = ~4156 IOPS → 1039 MiB/s (matches observed)
The per-IO latency for the same 256K read operation is:
- **5.15**: ~5.2 ms average
- **6.8**: ~10.8 ms average (2.1x higher)
- **7.0**: ~7.7 ms average (1.5x higher)
### TCP Connection Health (not the bottleneck)
TCP socket statistics captured during testing confirm the network path
is not limiting:
- All connections show healthy `cwnd`, full-speed `delivery_rate`, and
sub-0.2ms RTT
- The 25 GbE NIC is operating well below capacity (~6-12 Gbps observed vs 25
Gbps available)
- No retransmissions or congestion events during testing
### CPU Utilization (not the bottleneck)
- Softirq CPU usage remains low during testing
- RPS/multi-queue distribution does not improve throughput
- The `iscsi_tcp` workqueue threads run at nice -20 (highest priority)
- Switching between softirq and workqueue processing (`recv_from_iscsi_q`) has
no effect
### Conclusion
The regression is internal to the `iscsi_tcp` / `libiscsi` kernel module
data path. The per-SCSI-command processing latency is 50-110% higher on
kernels 6.8 and 7.0 compared to 5.15, for identical iSCSI PDU sizes,
network conditions, and queue depths. This suggests changes in the iSCSI
receive/transmit path, SCSI mid-layer command completion, or interaction
with the block layer's multi-queue infrastructure introduced between
5.15 and 6.8 are adding overhead per I/O operation.
## Steps to Reproduce
### Prerequisites
- Two bare-metal hosts: one running kernel 5.15 (Ubuntu 22.04), one running
kernel 6.8+ (Ubuntu 24.04 or 26.04)
- iSCSI target (e.g., NetApp ONTAP, LIO, or targetcli) accessible via 10GbE+
network with jumbo frames
- `open-iscsi` package installed with default configuration
- `dm-multipath` configured (or single-path is sufficient to reproduce)
- Kubernetes with Trident CSI is NOT required; direct iSCSI LUN attachment
reproduces the issue
### Reproduction Steps
1. **Provision a 100 GiB iSCSI LUN** on the target and present it to
both hosts.
2. **Discover and login** on both hosts:
```bash
iscsiadm -m discovery -t sendtargets -p <target_ip>:3260
iscsiadm -m node --login
```
3. **Identify the device**:
```bash
# For multipath:
multipath -ll
# Note the dm-X device
# For single path:
lsblk --scsi
```
4. **Create a filesystem and mount**:
```bash
mkfs.ext4 /dev/dm-0 # or /dev/sdX for single path
mkdir /mnt/iscsi-test
mount /dev/dm-0 /mnt/iscsi-test
```
5. **Run fio benchmark** (identical on both hosts):
```bash
# 256K Sequential Read
fio --name=seq-read-256k \
--rw=read \
--bs=256k \
--size=1G \
--numjobs=1 \
--iodepth=64 \
--direct=1 \
--ioengine=libaio \
--runtime=30 \
--time_based \
--group_reporting \
--directory=/mnt/iscsi-test
# 1M Sequential Read
fio --name=seq-read-1M \
--rw=read \
--bs=1M \
--size=1G \
--numjobs=1 \
--iodepth=64 \
--direct=1 \
--ioengine=libaio \
--runtime=30 \
--time_based \
--group_reporting \
--directory=/mnt/iscsi-test
```
6. **Compare results**: The host running kernel 6.8+ will show 40-50%
lower sequential read bandwidth compared to the host running kernel
5.15.
### Verification
Confirm the test is hitting the iSCSI device (not page cache or overlay):
```bash
# During the test, verify disk utilization:
iostat -x 1 | grep dm-0 # Should show ~99% util
# Verify the mount is on iSCSI:
lsblk -o NAME,TYPE,TRAN,SIZE,MOUNTPOINT | grep -A5 dm-0
```
## Additional Diagnostics
### 1. Transparent Huge Pages (THP)
**Default state**: `enabled=madvise`, `defrag=madvise`
| THP Setting | 256K Seq Read (kernel 7.0) | Change |
|-------------|---------------------------|--------|
| madvise (default) | 1034 MiB/s | baseline |
| never | 882 MiB/s | -15% (worse) |
**Conclusion**: THP is not contributing to the regression. Disabling it
actually slightly reduces throughput, likely due to losing THP benefits
for fio's memory allocations.
### 2. NUMA Locality Impact
**Hardware topology**:
- 96 cores: node 0 = even CPUs (0,2,4,...,94), node 1 = odd CPUs (1,3,5,...,95)
- NIC `ens3f0np0` (active iSCSI bond slave): **NUMA node 0**
- NIC `ens5f1np1` (secondary bond slave): **NUMA node 1**
**NUMA-pinned fio results** (256K sequential read, direct on
`/dev/dm-0`):
| CPU Pinning | Throughput | Avg Latency | Delta |
|-------------|-----------|-------------|-------|
| Node 0 (NIC-local) | **1148 MiB/s** | 13.9 ms | baseline |
| Node 1 (cross-socket) | **943 MiB/s** | 16.9 ms | -18% |
**Finding**: The iSCSI MSI-X interrupt (IRQ 338, `mlx5_comp62`) is
pinned to **CPU29 (NUMA node 1)** despite the NIC residing on **NUMA
node 0**. The server has two 25 GbE Mellanox NICs, one per NUMA node,
bonded via 802.3ad LACP (`xmit_hash_policy=layer2`). The storage VLAN
runs on top of this bond. Due to LACP hashing, iSCSI traffic flows
through the node-0 NIC but the receive interrupt is processed on a
node-1 CPU, adding ~3ms per-IO latency from cross-socket memory access.
However, even the optimally-pinned node-0 result (1148 MiB/s) is still
**25% below kernel 5.15** (1542 MiB/s), confirming the regression is not
solely NUMA-related.
### 3. perf Profiling (Kernel CPU Time Breakdown)
Captured via `perf record -a -g -- sleep 10` during 256K sequential
reads at iodepth=64:
```
Top-level call stack (kernel 7.0.0-22-generic):
47.97% ret_from_fork_asm
└─ 47.96% kthread
└─ 46.05% worker_thread
└─ 44.97% process_one_work
├─ 42.77% iscsi_xmitworker ← TX path
│ └─ 42.74% iscsi_data_xmit
│ └─ 42.24% iscsi_xmit_task
│ └─ 41.79% iscsi_tcp_task_xmit
│ └─ 41.74% iscsi_sw_tcp_pdu_xmit
│ └─ 41.62% iscsi_sw_tcp_xmit_segment
│ └─ 41.34% sock_sendmsg
│ └─ 41.08% tcp_sendmsg
│ ├─ 33.73%
release_sock ← CRITICAL
│ │ └─ 33.57%
__release_sock
│ │ └─ 33.42%
tcp_v4_do_rcv
│ │ └─
33.29% tcp_rcv_established
│ │
└─ 32.24% tcp_data_queue
│ │
└─ 31.84% tcp_data_ready
│ │
└─ 31.78% iscsi_sw_tcp_data_ready
│ │
├─ 28.03% tcp_read_sock
│ │
│ └─ 23.45% iscsi_sw_tcp_recv
│ │
│ └─ 23.15% iscsi_tcp_recv_skb
│ │
│ └─ 2.35% iscsi_tcp_segment_recv
│ │
└─ 4.32% native_queued_spin_lock_slowpath ← LOCK CONTENTION
│ └─
(tcp_sendmsg_locked, etc.)
└─ 1.78% blk_mq_run_work_fn
```
**Critical Findings**:
1. **TX/RX path serialization via `release_sock`**: 33.73% of total CPU
time is spent inside `release_sock()` during the transmit path. When the
xmit worker sends a SCSI command via `tcp_sendmsg`, the socket lock
release triggers processing of all queued incoming data in the **same
thread context** — this includes `iscsi_sw_tcp_data_ready` →
`iscsi_tcp_recv_skb` (23.15% of CPU).
2. **Spinlock contention**: `native_queued_spin_lock_slowpath` accounts
for **4.32%** of total CPU time — indicating measurable contention on
the socket lock between the transmit workqueue and network softirq
receive path.
3. **Single-threaded bottleneck**: The entire iSCSI IO path (TX command
+ RX data) is serialized through a single `iscsi_xmitworker` workqueue
thread. Data receive happens as a side-effect of the transmit path's
lock release, not in parallel.
### 4. Interrupt Footprint
During a 5-second sample of active 256K sequential reads:
| IRQ | Queue | CPU | Delta (5s) | Rate |
|-----|-------|-----|-----------|------|
| 338 | `mlx5_comp62@0000:5e:00.0` | CPU29 | +15,979 | 3,196/s |
| 325 | `mlx5_comp49@0000:5e:00.0` | CPU3 | +8 | ~0/s |
**Finding**: All iSCSI receive traffic is concentrated on a single MSI-X
vector (IRQ 338 → CPU29). This is expected for a single TCP flow (RSS
hashes to one queue), but confirms that the entire iSCSI data path is
single-CPU-bound. The interrupt is NOT bottlenecking on CPU0 — it's
correctly distributed via MSI-X, but still limited to one core.
### 5. Lock Contention (lockstat)
`CONFIG_LOCK_STAT` is **not enabled** in the Ubuntu 7.0.0-22-generic
kernel. This data point is not available without a custom kernel build
with `CONFIG_LOCK_STAT=y`.
### 6. blktrace / iostat Latency Decomposition
Captured via `blktrace` and `iostat -xmt` on dm-0 and underlying sdb
during active 256K sequential reads:
**Raw blktrace** (10s capture on `/dev/dm-0`): Completions arrive every
30-40 µs interval on the SCSI device, confirming fast backend response.
**iostat latency breakdown** (steady-state averages over 10 samples):
| Device | r_await (ms) | aqu-sz | Throughput |
|--------|-------------|--------|-----------|
| dm-0 (multipath) | **13.3 ms** | 243 | ~1.15 GB/s |
| sdb (SCSI/iSCSI) | **1.7 ms** | 31 | ~1.15 GB/s |
**Latency decomposition**:
| Component | Time | % of Total |
|-----------|------|-----------|
| DM/block queue wait (Q2D) | **11.6 ms** | 87% |
| SCSI device service time (D2C) | **1.7 ms** | 13% |
| **Total r_await** | **13.3 ms** | 100% |
**Interpretation**: The actual iSCSI network + storage backend time is
only 1.7 ms per 64KB read. The remaining 87% of per-IO latency is spent
**waiting in the device-mapper queue** because the SCSI device queue
depth is limited to 32 (`cmd_per_lun=32`). With `iodepth=64` from fio,
the DM layer queues 243 outstanding IOs but can only dispatch 32 at a
time to the underlying SCSI device.
On kernel 5.15, the same `cmd_per_lun=32` limit exists but achieves 1542 MiB/s
at 256K, implying either:
- The DM/blk-mq queue management has higher overhead in 6.8+ (longer Q2D time
per IO)
- The SCSI device service time was lower on 5.15 (different TCP/socket handling
in `release_sock`)
- Or both, as the perf profile showing 33% of time in `release_sock` correlates
with the serialized TX/RX pattern adding per-IO overhead
**Comparative iostat from kernel 5.15 (U22) under identical test**:
| Metric | U22 (kernel 5.15) | U26 (kernel 7.0) | Ratio |
|--------|-------------------|-------------------|-------|
| IOPS (256K) | **~49,000** | ~18,200 | U22 2.7× more |
| Throughput | **~3,050 MiB/s** | ~1,150 MiB/s | U22 2.7× faster |
| r_await (dm) | **5.5 ms** | 13.3 ms | U26 2.4× slower per IO |
| aqu-sz (dm) | 268-286 | 243 | Similar queue depth |
The per-IO latency through the DM layer is **2.4× higher on kernel 7.0**
(13.3 ms vs 5.5 ms), directly explaining the throughput difference. Both
kernels saturate at 100% device utilization with similar application
queue depths (~260-280), confirming the bottleneck is per-IO processing
efficiency in the kernel iSCSI/SCSI stack, not device or network
capacity.
---
## Suggested Investigation Areas
1. **`release_sock()` in `tcp_sendmsg` processing RX data inline (kernel
7.0)**: The perf profile shows that 33.73% of time is spent in
`release_sock` → `tcp_v4_do_rcv` → `iscsi_sw_tcp_data_ready` during the
**transmit** path. This serializes TX and RX. Investigate whether kernel
5.15 handled the receive callback differently (e.g., via softirq/tasklet
rather than inline in `release_sock`).
2. **Socket lock contention in `iscsi_sw_tcp_data_ready`**: The 4.32%
`native_queued_spin_lock_slowpath` indicates the socket lock is
contended. In kernel 5.15, the receive path may have used
`sk->sk_data_ready` differently or with less contention.
3. **`iscsi_xmitworker` single-threaded design**: All SCSI command
dispatch and completion happens through one workqueue worker. If kernel
6.8+ changed workqueue scheduling (e.g., unbound → bound, or different
WQ flags), this would add latency.
4. **SCSI mid-layer `blk-mq` tag allocation on NUMA**: On 96-core dual-
socket systems, cross-NUMA blk-mq tag allocation adds measurable
latency. The 18% NUMA penalty observed may be amplified by changes in
how the SCSI mid-layer allocates tags in 6.8+.
5. **TCP small-queue (TSQ) or pacing changes**: `tcp_sendmsg` in newer
kernels may hold the socket lock longer due to TSQ or pacing changes,
increasing the window during which `release_sock` processes RX data
inline.
## Environment Details
```
# Modules involved:
iscsi_tcp 24576
libiscsi_tcp 32768
libiscsi 81920
scsi_transport_iscsi 176128
# Module parameters (kernel 7.0):
iscsi_tcp: max_lun, recv_from_iscsi_q, debug_iscsi_tcp
```
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Tags: block-layer iscsi kernel-da-key regression-update
** Attachment added: "lspci-vnvn.log"
https://bugs.launchpad.net/bugs/2157924/+attachment/5978472/+files/lspci-vnvn.log
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2157924
Title:
iscsi_tcp 40-50% sequential read performance regression (5.15 vs
6.8/7.0) due to release_sock serialization bottleneck
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2157924/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs