Public bug reported: A regression in the mm/maple_tree.c subsystem causes recurring kernel oops on Ubuntu 25.10 (questing) on multiple revisions of the 6.17.0-x-generic kernel under sustained memory-intensive workload. The bug manifests on both 6.17.0-29-generic (more aggressive: 17 oops events in ~11h, RIP at mas_walk+0x1dc/0x490 during page fault VMA lookup) and 6.17.0-23-generic (less aggressive: 3 oops events in similar window, RIP at mast_fill_bnode+0x42/0x590 during maple tree node insertion). Reverting to -23 reduces frequency ~66% but does NOT eliminate the bug.
The system is a workstation-class AI/Inference host (MSI MPG Z890 CARBON WIFI, Intel Core Ultra 9 285 Arrow Lake-S 24c, 128 GB DDR5, NVIDIA RTX 5080 16 GB) running a stack of Docker containers (RAGflow, Ollama, ComfyUI, JupyterLab, ~15-20 active SCIAs simultaneously). The triggering workloads share a common pattern: heavy fork/mmap/exec/brk activity from: - runc (Docker container spawn for new tasks) - sh (subprocess forking) - ollama (LLM worker process spawn for model inference) - python3 (multi-process embeddings, batch RAG ingestion, ML pipelines) - node + tokio-runtime-w (memos / Node.js workloads with thread pools) When 5+ such processes simultaneously call into the maple tree subsystem (page fault VMA lookup or VMA range insertion), at least one hits an invalid pointer state and panics. With kernel.panic=0 and kernel.hung_task_panic=0, the host does NOT auto-reboot; instead the UI freezes (Wayland/Xorg unresponsive while desktop renders frozen) and physical reset is required. This has occurred 3 times in 24h (24-05 02:08, 02:12, 13:07) on -29 prior to revert, and 3 oops without full freeze on -23 since revert. The bug is in the kernel MM core, not in any out-of-tree module. The only OOT module loaded is NVIDIA (nvidia, nvidia_uvm, nvidia_modeset, nvidia_drm via DKMS 580.126.20), but stack traces show kernel-space RIP within mm/maple_tree.c with no NVIDIA frames involved. The bug appears specific to this hardware platform (Arrow Lake-S + Z890 chipset, BIOS AMI 1.A90 12/29/2025) or to this specific kernel rev range, as the system ran kernel 6.17.0-22-generic from 2026-05-02 to 2026-05-05 without observed freezes (1.5 days uptime previous to upgrade chain). Workaround applied: GRUB_DEFAULT pinned to 6.17.0-23-generic + unattended-upgrades disabled + 92 kernel/NVIDIA/CUDA/python packages held via apt-mark. This stabilizes the system but the bug is observed to persist with reduced frequency on -23, so a kernel fix is needed. Severity rationale (HIGH): on a production AI workstation processing technical PDF documents, model fine-tuning, and embeddings indexing for an acoustic consulting business, every freeze requires physical reset (workstation is in a different room from the operator), risks data corruption on stopped containers (MinIO/MySQL/Elasticsearch volumes), and disrupts client-facing deliverables. A reliable fix is required to safely restore normal workload (RAGflow + concurrent multi-process Python pipelines). First oops stack trace verbatim, kernel 6.17.0-29-generic, oops [#2] at 2026-05-24 13:01:33 CEST (signature: 5 distinct processes hit mas_walk simultaneously, indicating kernel bug not application bug): Oops: invalid opcode: 0000 [#2] SMP NOPTI CPU: 13 UID: 0 PID: 945947 Comm: python3 Tainted: G D OE 6.17.0-29-generic #29-Ubuntu PREEMPT(voluntary) Hardware name: Micro-Star International Co., Ltd. MS-7E17/MPG Z890 CARBON WIFI (MS-7E17), BIOS 1.A90 12/29/2025 RIP: 0010:mas_walk+0x1dc/0x490 Call Trace: <TASK> lock_vma_under_rcu+0x60/0x230 do_user_addr_fault+0x1ec/0x6c0 exc_page_fault+0x7f/0x1b0 asm_exc_page_fault+0x27/0x30 </TASK> First oops stack trace verbatim, kernel 6.17.0-23-generic, oops [#1] at 2026-05-24 23:03:51 CEST (different code path, same maple_tree subsystem): Oops: Oops: 0002 [#1] SMP NOPTI CPU: 13 UID: 0 PID: 137359 Comm: runc Tainted: G OE 6.17.0-23-generic #23-Ubuntu PREEMPT(voluntary) Hardware name: Micro-Star International Co., Ltd. MS-7E17/MPG Z890 CARBON WIFI (MS-7E17), BIOS 1.A90 12/29/2025 RIP: 0010:mast_fill_bnode+0x42/0x590 Call Trace: <TASK> mas_split+0x551/0xcd0 ? xas_load+0x11/0x100 mas_wr_bnode+0x7e/0x130 </TASK> Full stack traces for all 17 oops events on -29 and all 3 oops events on -23, complete hardware inventory (dmidecode + lspci + nvidia-smi + smartctl + DKMS status + /proc/cmdline + /etc/default/grub + dpkg + apt-mark holds), victim process tables, and frequency comparison data are available as attachments. Please request and I will attach to this bug. Attachments prepared (~190 KB total): - dmesg_oops_kernel-6.17.0-29_boot-2.log (121 KB / 1205 lines verbatim 17 events) - dmesg_oops_kernel-6.17.0-23_boot0.log (47 KB / 333 lines verbatim 3 events) - hardware_inventory.txt (19 KB / complete system inventory) - oops_summary.txt (2.4 KB / events table + victim processes + frequency comparison) Reporter: Kim ([email protected]), I+D NoMasRuido (acoustic engineering). Hardware platform: workstation BMetal (192.168.1.220), MSI Z890 + RTX 5080. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: apport-bug arrow-lake hwe-24.04 kernel-oops maple-tree mm needs-triage ollama questing regression runc vma z890 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2154186 Title: Kernel oops in mm/maple_tree.c (mas_walk, mast_fill_bnode, mas_split) under fork-heavy workload on Ubuntu 25.10, affects 6.17.0-29 and 6.17.0-23 kernels To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2154186/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
