[Bug 2157924] [NEW] iscsi_tcp 40-50% sequential read performance regression (5.15 vs 6.8/7.0) due to release_sock serialization bottleneck

Juan Morete Mon, 22 Jun 2026 15:25:38 -0700

Public bug reported:

1)
=== System Metadata ===
OS Release:
Description:    Ubuntu 26.04 LTS
Release:        26.04
Kernel Version: 7.0.0-22-generic
CPU Model: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Loaded iscsi modules: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Network controller: 5e:00.0 Ethernet controller: Mellanox Technologies MT27800 
Family [ConnectX-5]
                    5e:00.1 Ethernet controller: Mellanox Technologies MT27800 
Family [ConnectX-5]
2)
# apt-cache policy linux-image-generic
linux-image-generic:
  Installed: 7.0.0-22.22


3)
Expected same or better performance over iscsi LUN volume mounts.

4)
Performance got 40% to 50% worst than the baseline test on Ubuntu 22.04.5 LTS 
with kernel 5.15


# Kernel Bug Report: iscsi_tcp Sequential Throughput Regression (5.15 → 6.8 / 
7.0)

## Summary

A significant sequential I/O throughput regression has been identified
in the `iscsi_tcp` kernel module when comparing Ubuntu kernel 5.15
(Ubuntu 22.04) against kernels 6.8 (Ubuntu 24.04) and 7.0 (Ubuntu
26.04). Sequential read throughput drops by approximately 40-50% on the
newer kernels under identical hardware, network, and storage backend
conditions. All externally-tunable parameters have been exhaustively
tested and eliminated as the cause.

## Affected Versions

| Distribution | Kernel Version | Status |
|-------------|---------------|--------|
| Ubuntu 22.04 LTS | 5.15.0-143-generic | Working (baseline) |
| Ubuntu 24.04 LTS | 6.8.0-88-generic | **Regressed** |
| Ubuntu 26.04 LTS | 7.0.0-22-generic | **Regressed** |

## Hardware Configuration (identical across all hosts)

- **CPU**: 96 cores (dual-socket Intel/AMD server)
- **Memory**: 750 GiB
- **NIC**: Mellanox ConnectX (25 GbE, dual-port, LACP bond, mlx5 driver)
- **Storage Backend**: NetApp ONTAP SAN (iSCSI, ontap-san-economy driver via 
Trident CSI 25.06)
- **Network**: Jumbo frames (MTU 9000), VLAN-tagged storage network
- **Multipath**: dm-multipath with ALUA, service-time path selector, 2 paths 
per LUN

## Test Environment

- **Volume**: 100 GiB iSCSI LUN provisioned via Trident CSI (PVC with 
`Filesystem` volumeMode)
- **Test Pod**: Kubernetes pod with fio 3.6, volume mounted at `/mnt/data`
- **fio Parameters**: `--direct=1 --ioengine=libaio --iodepth=64 --numjobs=1 
--runtime=30 --time_based --group_reporting --directory=/mnt/data`
- **Block Sizes Tested**: 128K, 256K, 1M

## Results

### Sequential Read Throughput (256K block size)

| Kernel | Bandwidth | IOPS | Avg Latency |
|--------|-----------|------|-------------|
| 5.15.0-143 | **1542 MiB/s** | 6168 | 10.2 ms |
| 6.8.0-88 | **740 MiB/s** | 2958 | 21.3 ms |
| 7.0.0-22 | **1034 MiB/s** | 4136 | 15.4 ms |

### Sequential Read Throughput (1M block size)

| Kernel | Bandwidth | IOPS | Avg Latency |
|--------|-----------|------|-------------|
| 5.15.0-143 | **1444 MiB/s** | 1444 | 43.6 ms |
| 6.8.0-88 | **815 MiB/s** | 814 | 77.3 ms |
| 7.0.0-22 | **851 MiB/s** | 851 | 73.9 ms |

### Regression Magnitude

- **Kernel 6.8 vs 5.15**: 44-52% throughput reduction (sequential reads)
- **Kernel 7.0 vs 5.15**: 33-41% throughput reduction (sequential reads)

## iSCSI / SCSI Parameters (verified identical across all kernels)

| Parameter | Value |
|-----------|-------|
| `can_queue` (scsi_host) | 113 |
| `cmd_per_lun` (scsi_host) | 32 |
| `sg_tablesize` (scsi_host) | 4096 |
| `queue_depth` (per LUN) | 32 |
| `max_hw_sectors_kb` | 32767 |
| iSCSI `MaxRecvDataSegmentLength` | 262144 |
| iSCSI `FirstBurstLength` | 65536 (negotiated by target) |
| iSCSI `MaxBurstLength` | 1048576 |
| iSCSI `MaxOutstandingR2T` | 1 |
| iSCSI `ImmediateData` | Yes |
| iSCSI `InitialR2T` | Yes (negotiated by target) |
| TCP congestion control | BBR |
| MTU | 9000 |
| TCP rmem_max / wmem_max | 134217728 |

## Eliminated Causes

The following parameters/settings were systematically tuned on the 7.0
kernel with no measurable impact on throughput:

| Tuning Attempted | Result |
|-----------------|--------|
| Disable WBT (`wbt_lat_usec=0`) | No change |
| Increase read-ahead (`read_ahead_kb=16384`) | No change (expected: direct IO) 
|
| IO scheduler: `none` (passthrough) | No change |
| IO scheduler: `mq-deadline` (match 5.15 default) | No change |
| Reduce `max_sectors_kb` to 64 (match 5.15 value) | No change |
| Increase `nr_requests` to 512 | No change |
| Enable `recv_from_iscsi_q=Y` (kernel 7.0 parameter) | No change |
| Increase `netdev_budget` (1200→4800) and `netdev_budget_usecs` (8000→32000) | 
No change |
| Renice `iscsid` process to -20 | No change |
| Enable RPS on storage VLAN interface (`rps_cpus=ffffffff`) | No change |
| Enable RFS (`rps_sock_flow_entries=32768`) | No change |
| Enable `quickack` on iSCSI storage routes | No change |
| Set `tcp_low_latency=1` | No change |
| Increase `gro_max_size` / `gso_max_size` | Failed (not supported on VLAN 
interface) |
| Multiple fio jobs (numjobs=2) | No change / slightly worse |

## Analysis

### Per-IO Latency Comparison

With `queue_depth=32` and `iodepth=64` (saturating the device queue),
throughput is governed by:

```
throughput = queue_depth / avg_latency_per_IO
```

For 256K sequential reads:
- **Kernel 5.15**: 32 / 0.0052s = ~6150 IOPS → 1538 MiB/s (matches observed)
- **Kernel 6.8**: 32 / 0.0108s = ~2962 IOPS → 741 MiB/s (matches observed)
- **Kernel 7.0**: 32 / 0.0077s = ~4156 IOPS → 1039 MiB/s (matches observed)

The per-IO latency for the same 256K read operation is:
- **5.15**: ~5.2 ms average
- **6.8**: ~10.8 ms average (2.1x higher)
- **7.0**: ~7.7 ms average (1.5x higher)

### TCP Connection Health (not the bottleneck)

TCP socket statistics captured during testing confirm the network path
is not limiting:

- All connections show healthy `cwnd`, full-speed `delivery_rate`, and 
sub-0.2ms RTT
- The 25 GbE NIC is operating well below capacity (~6-12 Gbps observed vs 25 
Gbps available)
- No retransmissions or congestion events during testing

### CPU Utilization (not the bottleneck)

- Softirq CPU usage remains low during testing
- RPS/multi-queue distribution does not improve throughput
- The `iscsi_tcp` workqueue threads run at nice -20 (highest priority)
- Switching between softirq and workqueue processing (`recv_from_iscsi_q`) has 
no effect

### Conclusion

The regression is internal to the `iscsi_tcp` / `libiscsi` kernel module
data path. The per-SCSI-command processing latency is 50-110% higher on
kernels 6.8 and 7.0 compared to 5.15, for identical iSCSI PDU sizes,
network conditions, and queue depths. This suggests changes in the iSCSI
receive/transmit path, SCSI mid-layer command completion, or interaction
with the block layer's multi-queue infrastructure introduced between
5.15 and 6.8 are adding overhead per I/O operation.

## Steps to Reproduce

### Prerequisites

- Two bare-metal hosts: one running kernel 5.15 (Ubuntu 22.04), one running 
kernel 6.8+ (Ubuntu 24.04 or 26.04)
- iSCSI target (e.g., NetApp ONTAP, LIO, or targetcli) accessible via 10GbE+ 
network with jumbo frames
- `open-iscsi` package installed with default configuration
- `dm-multipath` configured (or single-path is sufficient to reproduce)
- Kubernetes with Trident CSI is NOT required; direct iSCSI LUN attachment 
reproduces the issue

### Reproduction Steps

1. **Provision a 100 GiB iSCSI LUN** on the target and present it to
both hosts.

2. **Discover and login** on both hosts:
   ```bash
   iscsiadm -m discovery -t sendtargets -p <target_ip>:3260
   iscsiadm -m node --login
   ```

3. **Identify the device**:
   ```bash
   # For multipath:
   multipath -ll
   # Note the dm-X device

   # For single path:
   lsblk --scsi
   ```

4. **Create a filesystem and mount**:
   ```bash
   mkfs.ext4 /dev/dm-0   # or /dev/sdX for single path
   mkdir /mnt/iscsi-test
   mount /dev/dm-0 /mnt/iscsi-test
   ```

5. **Run fio benchmark** (identical on both hosts):
   ```bash
   # 256K Sequential Read
   fio --name=seq-read-256k \
       --rw=read \
       --bs=256k \
       --size=1G \
       --numjobs=1 \
       --iodepth=64 \
       --direct=1 \
       --ioengine=libaio \
       --runtime=30 \
       --time_based \
       --group_reporting \
       --directory=/mnt/iscsi-test

   # 1M Sequential Read
   fio --name=seq-read-1M \
       --rw=read \
       --bs=1M \
       --size=1G \
       --numjobs=1 \
       --iodepth=64 \
       --direct=1 \
       --ioengine=libaio \
       --runtime=30 \
       --time_based \
       --group_reporting \
       --directory=/mnt/iscsi-test
   ```

6. **Compare results**: The host running kernel 6.8+ will show 40-50%
lower sequential read bandwidth compared to the host running kernel
5.15.

### Verification

Confirm the test is hitting the iSCSI device (not page cache or overlay):
```bash
# During the test, verify disk utilization:
iostat -x 1 | grep dm-0  # Should show ~99% util

# Verify the mount is on iSCSI:
lsblk -o NAME,TYPE,TRAN,SIZE,MOUNTPOINT | grep -A5 dm-0
```

## Additional Diagnostics

### 1. Transparent Huge Pages (THP)

**Default state**: `enabled=madvise`, `defrag=madvise`

| THP Setting | 256K Seq Read (kernel 7.0) | Change |
|-------------|---------------------------|--------|
| madvise (default) | 1034 MiB/s | baseline |
| never | 882 MiB/s | -15% (worse) |

**Conclusion**: THP is not contributing to the regression. Disabling it
actually slightly reduces throughput, likely due to losing THP benefits
for fio's memory allocations.

### 2. NUMA Locality Impact

**Hardware topology**:
- 96 cores: node 0 = even CPUs (0,2,4,...,94), node 1 = odd CPUs (1,3,5,...,95)
- NIC `ens3f0np0` (active iSCSI bond slave): **NUMA node 0**
- NIC `ens5f1np1` (secondary bond slave): **NUMA node 1**

**NUMA-pinned fio results** (256K sequential read, direct on
`/dev/dm-0`):

| CPU Pinning | Throughput | Avg Latency | Delta |
|-------------|-----------|-------------|-------|
| Node 0 (NIC-local) | **1148 MiB/s** | 13.9 ms | baseline |
| Node 1 (cross-socket) | **943 MiB/s** | 16.9 ms | -18% |

**Finding**: The iSCSI MSI-X interrupt (IRQ 338, `mlx5_comp62`) is
pinned to **CPU29 (NUMA node 1)** despite the NIC residing on **NUMA
node 0**. The server has two 25 GbE Mellanox NICs, one per NUMA node,
bonded via 802.3ad LACP (`xmit_hash_policy=layer2`). The storage VLAN
runs on top of this bond. Due to LACP hashing, iSCSI traffic flows
through the node-0 NIC but the receive interrupt is processed on a
node-1 CPU, adding ~3ms per-IO latency from cross-socket memory access.
However, even the optimally-pinned node-0 result (1148 MiB/s) is still
**25% below kernel 5.15** (1542 MiB/s), confirming the regression is not
solely NUMA-related.

### 3. perf Profiling (Kernel CPU Time Breakdown)

Captured via `perf record -a -g -- sleep 10` during 256K sequential
reads at iodepth=64:

```
Top-level call stack (kernel 7.0.0-22-generic):

47.97%  ret_from_fork_asm
  └─ 47.96% kthread
       └─ 46.05% worker_thread
             └─ 44.97% process_one_work
                   ├─ 42.77% iscsi_xmitworker           ← TX path
                   │    └─ 42.74% iscsi_data_xmit
                   │         └─ 42.24% iscsi_xmit_task
                   │              └─ 41.79% iscsi_tcp_task_xmit
                   │                   └─ 41.74% iscsi_sw_tcp_pdu_xmit
                   │                        └─ 41.62% iscsi_sw_tcp_xmit_segment
                   │                             └─ 41.34% sock_sendmsg
                   │                                  └─ 41.08% tcp_sendmsg
                   │                                       ├─ 33.73% 
release_sock    ← CRITICAL
                   │                                       │    └─ 33.57% 
__release_sock
                   │                                       │         └─ 33.42% 
tcp_v4_do_rcv
                   │                                       │              └─ 
33.29% tcp_rcv_established
                   │                                       │                   
└─ 32.24% tcp_data_queue
                   │                                       │                    
    └─ 31.84% tcp_data_ready
                   │                                       │                    
         └─ 31.78% iscsi_sw_tcp_data_ready
                   │                                       │                    
              ├─ 28.03% tcp_read_sock
                   │                                       │                    
              │    └─ 23.45% iscsi_sw_tcp_recv
                   │                                       │                    
              │         └─ 23.15% iscsi_tcp_recv_skb
                   │                                       │                    
              │              └─ 2.35% iscsi_tcp_segment_recv
                   │                                       │                    
              └─ 4.32% native_queued_spin_lock_slowpath ← LOCK CONTENTION
                   │                                       └─ 
(tcp_sendmsg_locked, etc.)
                   └─ 1.78% blk_mq_run_work_fn
```

**Critical Findings**:

1. **TX/RX path serialization via `release_sock`**: 33.73% of total CPU
time is spent inside `release_sock()` during the transmit path. When the
xmit worker sends a SCSI command via `tcp_sendmsg`, the socket lock
release triggers processing of all queued incoming data in the **same
thread context** — this includes `iscsi_sw_tcp_data_ready` →
`iscsi_tcp_recv_skb` (23.15% of CPU).

2. **Spinlock contention**: `native_queued_spin_lock_slowpath` accounts
for **4.32%** of total CPU time — indicating measurable contention on
the socket lock between the transmit workqueue and network softirq
receive path.

3. **Single-threaded bottleneck**: The entire iSCSI IO path (TX command
+ RX data) is serialized through a single `iscsi_xmitworker` workqueue
thread. Data receive happens as a side-effect of the transmit path's
lock release, not in parallel.

### 4. Interrupt Footprint

During a 5-second sample of active 256K sequential reads:

| IRQ | Queue | CPU | Delta (5s) | Rate |
|-----|-------|-----|-----------|------|
| 338 | `mlx5_comp62@0000:5e:00.0` | CPU29 | +15,979 | 3,196/s |
| 325 | `mlx5_comp49@0000:5e:00.0` | CPU3 | +8 | ~0/s |

**Finding**: All iSCSI receive traffic is concentrated on a single MSI-X
vector (IRQ 338 → CPU29). This is expected for a single TCP flow (RSS
hashes to one queue), but confirms that the entire iSCSI data path is
single-CPU-bound. The interrupt is NOT bottlenecking on CPU0 — it's
correctly distributed via MSI-X, but still limited to one core.

### 5. Lock Contention (lockstat)

`CONFIG_LOCK_STAT` is **not enabled** in the Ubuntu 7.0.0-22-generic
kernel. This data point is not available without a custom kernel build
with `CONFIG_LOCK_STAT=y`.

### 6. blktrace / iostat Latency Decomposition

Captured via `blktrace` and `iostat -xmt` on dm-0 and underlying sdb
during active 256K sequential reads:

**Raw blktrace** (10s capture on `/dev/dm-0`): Completions arrive every
30-40 µs interval on the SCSI device, confirming fast backend response.

**iostat latency breakdown** (steady-state averages over 10 samples):

| Device | r_await (ms) | aqu-sz | Throughput |
|--------|-------------|--------|-----------|
| dm-0 (multipath) | **13.3 ms** | 243 | ~1.15 GB/s |
| sdb (SCSI/iSCSI) | **1.7 ms** | 31 | ~1.15 GB/s |

**Latency decomposition**:

| Component | Time | % of Total |
|-----------|------|-----------|
| DM/block queue wait (Q2D) | **11.6 ms** | 87% |
| SCSI device service time (D2C) | **1.7 ms** | 13% |
| **Total r_await** | **13.3 ms** | 100% |

**Interpretation**: The actual iSCSI network + storage backend time is
only 1.7 ms per 64KB read. The remaining 87% of per-IO latency is spent
**waiting in the device-mapper queue** because the SCSI device queue
depth is limited to 32 (`cmd_per_lun=32`). With `iodepth=64` from fio,
the DM layer queues 243 outstanding IOs but can only dispatch 32 at a
time to the underlying SCSI device.

On kernel 5.15, the same `cmd_per_lun=32` limit exists but achieves 1542 MiB/s 
at 256K, implying either:
- The DM/blk-mq queue management has higher overhead in 6.8+ (longer Q2D time 
per IO)
- The SCSI device service time was lower on 5.15 (different TCP/socket handling 
in `release_sock`)
- Or both, as the perf profile showing 33% of time in `release_sock` correlates 
with the serialized TX/RX pattern adding per-IO overhead

**Comparative iostat from kernel 5.15 (U22) under identical test**:

| Metric | U22 (kernel 5.15) | U26 (kernel 7.0) | Ratio |
|--------|-------------------|-------------------|-------|
| IOPS (256K) | **~49,000** | ~18,200 | U22 2.7× more |
| Throughput | **~3,050 MiB/s** | ~1,150 MiB/s | U22 2.7× faster |
| r_await (dm) | **5.5 ms** | 13.3 ms | U26 2.4× slower per IO |
| aqu-sz (dm) | 268-286 | 243 | Similar queue depth |

The per-IO latency through the DM layer is **2.4× higher on kernel 7.0**
(13.3 ms vs 5.5 ms), directly explaining the throughput difference. Both
kernels saturate at 100% device utilization with similar application
queue depths (~260-280), confirming the bottleneck is per-IO processing
efficiency in the kernel iSCSI/SCSI stack, not device or network
capacity.

---

## Suggested Investigation Areas

1. **`release_sock()` in `tcp_sendmsg` processing RX data inline (kernel
7.0)**: The perf profile shows that 33.73% of time is spent in
`release_sock` → `tcp_v4_do_rcv` → `iscsi_sw_tcp_data_ready` during the
**transmit** path. This serializes TX and RX. Investigate whether kernel
5.15 handled the receive callback differently (e.g., via softirq/tasklet
rather than inline in `release_sock`).

2. **Socket lock contention in `iscsi_sw_tcp_data_ready`**: The 4.32%
`native_queued_spin_lock_slowpath` indicates the socket lock is
contended. In kernel 5.15, the receive path may have used
`sk->sk_data_ready` differently or with less contention.

3. **`iscsi_xmitworker` single-threaded design**: All SCSI command
dispatch and completion happens through one workqueue worker. If kernel
6.8+ changed workqueue scheduling (e.g., unbound → bound, or different
WQ flags), this would add latency.

4. **SCSI mid-layer `blk-mq` tag allocation on NUMA**: On 96-core dual-
socket systems, cross-NUMA blk-mq tag allocation adds measurable
latency. The 18% NUMA penalty observed may be amplified by changes in
how the SCSI mid-layer allocates tags in 6.8+.

5. **TCP small-queue (TSQ) or pacing changes**: `tcp_sendmsg` in newer
kernels may hold the socket lock longer due to TSQ or pacing changes,
increasing the window during which `release_sock` processes RX data
inline.

## Environment Details

```
# Modules involved:
iscsi_tcp              24576
libiscsi_tcp           32768
libiscsi               81920
scsi_transport_iscsi   176128

# Module parameters (kernel 7.0):
iscsi_tcp: max_lun, recv_from_iscsi_q, debug_iscsi_tcp
```

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: block-layer iscsi kernel-da-key regression-update

** Attachment added: "lspci-vnvn.log"
   
https://bugs.launchpad.net/bugs/2157924/+attachment/5978472/+files/lspci-vnvn.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2157924

Title:
  iscsi_tcp 40-50% sequential read performance regression (5.15 vs
  6.8/7.0) due to release_sock serialization bottleneck

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2157924/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2157924] [NEW] iscsi_tcp 40-50% sequential read performance regression (5.15 vs 6.8/7.0) due to release_sock serialization bottleneck

Reply via email to