Public bug reported:

A regression in the mm/maple_tree.c subsystem causes recurring kernel oops
on Ubuntu 25.10 (questing) on multiple revisions of the 6.17.0-x-generic
kernel under sustained memory-intensive workload. The bug manifests on both
6.17.0-29-generic (more aggressive: 17 oops events in ~11h, RIP at
mas_walk+0x1dc/0x490 during page fault VMA lookup) and 6.17.0-23-generic
(less aggressive: 3 oops events in similar window, RIP at
mast_fill_bnode+0x42/0x590 during maple tree node insertion). Reverting to
-23 reduces frequency ~66% but does NOT eliminate the bug.

The system is a workstation-class AI/Inference host (MSI MPG Z890 CARBON
WIFI, Intel Core Ultra 9 285 Arrow Lake-S 24c, 128 GB DDR5, NVIDIA RTX 5080
16 GB) running a stack of Docker containers (RAGflow, Ollama, ComfyUI,
JupyterLab, ~15-20 active SCIAs simultaneously). The triggering workloads
share a common pattern: heavy fork/mmap/exec/brk activity from:
  - runc (Docker container spawn for new tasks)
  - sh (subprocess forking)
  - ollama (LLM worker process spawn for model inference)
  - python3 (multi-process embeddings, batch RAG ingestion, ML pipelines)
  - node + tokio-runtime-w (memos / Node.js workloads with thread pools)

When 5+ such processes simultaneously call into the maple tree subsystem
(page fault VMA lookup or VMA range insertion), at least one hits an invalid
pointer state and panics. With kernel.panic=0 and kernel.hung_task_panic=0,
the host does NOT auto-reboot; instead the UI freezes (Wayland/Xorg
unresponsive while desktop renders frozen) and physical reset is required.
This has occurred 3 times in 24h (24-05 02:08, 02:12, 13:07) on -29 prior
to revert, and 3 oops without full freeze on -23 since revert.

The bug is in the kernel MM core, not in any out-of-tree module. The only
OOT module loaded is NVIDIA (nvidia, nvidia_uvm, nvidia_modeset, nvidia_drm
via DKMS 580.126.20), but stack traces show kernel-space RIP within
mm/maple_tree.c with no NVIDIA frames involved.

The bug appears specific to this hardware platform (Arrow Lake-S + Z890
chipset, BIOS AMI 1.A90 12/29/2025) or to this specific kernel rev range,
as the system ran kernel 6.17.0-22-generic from 2026-05-02 to 2026-05-05
without observed freezes (1.5 days uptime previous to upgrade chain).

Workaround applied: GRUB_DEFAULT pinned to 6.17.0-23-generic +
unattended-upgrades disabled + 92 kernel/NVIDIA/CUDA/python packages held
via apt-mark. This stabilizes the system but the bug is observed to persist
with reduced frequency on -23, so a kernel fix is needed.

Severity rationale (HIGH): on a production AI workstation processing
technical PDF documents, model fine-tuning, and embeddings indexing for an
acoustic consulting business, every freeze requires physical reset
(workstation is in a different room from the operator), risks data
corruption on stopped containers (MinIO/MySQL/Elasticsearch volumes), and
disrupts client-facing deliverables. A reliable fix is required to safely
restore normal workload (RAGflow + concurrent multi-process Python
pipelines).

First oops stack trace verbatim, kernel 6.17.0-29-generic, oops [#2] at
2026-05-24 13:01:33 CEST (signature: 5 distinct processes hit mas_walk
simultaneously, indicating kernel bug not application bug):

Oops: invalid opcode: 0000 [#2] SMP NOPTI
CPU: 13 UID: 0 PID: 945947 Comm: python3 Tainted: G D OE 6.17.0-29-generic 
#29-Ubuntu PREEMPT(voluntary)
Hardware name: Micro-Star International Co., Ltd. MS-7E17/MPG Z890 CARBON WIFI 
(MS-7E17), BIOS 1.A90 12/29/2025
RIP: 0010:mas_walk+0x1dc/0x490
Call Trace:
 <TASK>
 lock_vma_under_rcu+0x60/0x230
 do_user_addr_fault+0x1ec/0x6c0
 exc_page_fault+0x7f/0x1b0
 asm_exc_page_fault+0x27/0x30
 </TASK>

First oops stack trace verbatim, kernel 6.17.0-23-generic, oops [#1] at
2026-05-24 23:03:51 CEST (different code path, same maple_tree subsystem):

Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 13 UID: 0 PID: 137359 Comm: runc Tainted: G OE 6.17.0-23-generic 
#23-Ubuntu PREEMPT(voluntary)
Hardware name: Micro-Star International Co., Ltd. MS-7E17/MPG Z890 CARBON WIFI 
(MS-7E17), BIOS 1.A90 12/29/2025
RIP: 0010:mast_fill_bnode+0x42/0x590
Call Trace:
 <TASK>
 mas_split+0x551/0xcd0
 ? xas_load+0x11/0x100
 mas_wr_bnode+0x7e/0x130
 </TASK>

Full stack traces for all 17 oops events on -29 and all 3 oops events on
-23, complete hardware inventory (dmidecode + lspci + nvidia-smi + smartctl
+ DKMS status + /proc/cmdline + /etc/default/grub + dpkg + apt-mark holds),
victim process tables, and frequency comparison data are available as
attachments. Please request and I will attach to this bug.

Attachments prepared (~190 KB total):
  - dmesg_oops_kernel-6.17.0-29_boot-2.log (121 KB / 1205 lines verbatim 17 
events)
  - dmesg_oops_kernel-6.17.0-23_boot0.log (47 KB / 333 lines verbatim 3 events)
  - hardware_inventory.txt (19 KB / complete system inventory)
  - oops_summary.txt (2.4 KB / events table + victim processes + frequency 
comparison)

Reporter: Kim ([email protected]), I+D NoMasRuido (acoustic engineering).
Hardware platform: workstation BMetal (192.168.1.220), MSI Z890 + RTX 5080.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: apport-bug arrow-lake hwe-24.04 kernel-oops maple-tree mm needs-triage 
ollama questing regression runc vma z890

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154186

Title:
  Kernel oops in mm/maple_tree.c (mas_walk, mast_fill_bnode, mas_split)
  under fork-heavy workload on Ubuntu 25.10, affects 6.17.0-29 and
  6.17.0-23 kernels

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2154186/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to