full system hang

Jorge Mera Sat, 20 Jun 2026 08:25:43 -0700

Public bug reported:

On an Ubuntu 24.04.4 server running the HWE kernel 6.17.0-35-generic, a normal
userspace process (mattermost, a Go binary that does NOT use the GPU) triggered 
a
kernel Oops ("unable to handle page fault") in zap_present_ptes while the 
process
was being torn down on exit. The faulting address is in the vmemmap (struct 
page)
region and is "not-present", i.e. a bad/stale struct page pointer was 
dereferenced
during PTE zapping.


The dying task was holding a page-table lock and was inside an RCU read-side
critical section. The kernel printed "Fixing recursive fault but reboot is
needed!", then "BUG: scheduling while atomic" and a WARNING "Voluntary context
switch within RCU read-side critical section!" (kernel/rcu/tree_plugin.h:332).
Because the task died without releasing the RCU read lock and the page-table 
lock,
RCU grace periods stalled indefinitely and kcompactd0 entered a permanent soft
lockup spinning on the orphaned page-table lock 
(native_queued_spin_lock_slowpath
under compact_zone -> migrate_pages -> page_vma_mapped_walk -> map_pte).

The system degraded over ~2.5 hours (RCU stalls, then ~37 minutes of soft lockup
on CPU#3) and became fully unresponsive: new SSH logins never completed and 
Docker
health checks timed out. It had to be hard power-reset. The OOM killer never 
fired
(51 GB RAM free, disks 4-10% used) - this was an orphaned kernel lock, not 
memory
exhaustion.

Single occurrence so far (only one such event in the persistent journal since
2026-05-29); NOT reproducible on demand.

ENVIRONMENT
- Ubuntu 24.04.4 LTS
- Kernel: 6.17.0-35-generic #35~24.04.1-Ubuntu (linux-generic-hwe-24.04)
- CPU: Intel Core i9-14900K (32 threads)
- RAM: 64 GB
- Board/BIOS: Gigabyte B760M DS3H WIFI6E GEN5, BIOS F3 (2025-09-18)
- GPU: NVIDIA RTX 5070 Ti, proprietary driver 595.71.05 (nvidia/nvidia_uvm OOT)
- Taint: G D W OE. NOTE: there are NO nvidia frames in the crash; it is pure 
core
  MM. The crashing app does not use the GPU. nvidia modules are merely loaded.
- Affected task: mattermost (Docker container, UID 2000), PID 3290405, on CPU#23

IMPACT
Full server hang for ~13 hours (from the Oops at 16:28 until manual reset the 
next
morning). No clean shutdown recorded in wtmp/last.

----------------------------------------------------------------
TRACE 1 - primary Oops (root cause), during process teardown
----------------------------------------------------------------
{{{
BUG: unable to handle page fault for address: fffffaab0877acc8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 10bfbc2067 P4D 10bfbc2067 PUD 10bfbc0067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 23 UID: 2000 PID: 3290405 Comm: mattermost Tainted: G           OE       
6.17.0-35-generic #35~24.04.1-Ubuntu PREEMPT(voluntary)
Hardware name: Gigabyte Technology Co., Ltd. B760M DS3H WIFI6E GEN5, BIOS F3 
09/18/2025
RIP: 0010:zap_present_ptes.constprop.0+0x43/0x800
RAX: fffffaab0877acc0 R14: fffffaab0877acc0 CR2: fffffaab0877acc8
Call Trace:
 <TASK>
 zap_pte_range+0x198/0x5a0
 zap_pmd_range.isra.0+0xfc/0x240
 unmap_page_range+0x24d/0x3f0
 unmap_single_vma.isra.0+0x78/0xd0
 unmap_vmas+0x9a/0x180
 exit_mmap+0xf9/0x3f0
 __mmput+0x41/0x150
 mmput+0x31/0x40
 exit_mm+0xe0/0x140
 do_exit+0x1c4/0x480
 do_group_exit+0x34/0x90
 get_signal+0x835/0x840
 arch_do_signal_or_restart+0x41/0x200
 exit_to_user_mode_loop+0x91/0x170
 do_syscall_64+0x198/0xa20
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>
}}}

----------------------------------------------------------------
TRACE 2 - recursive fault + scheduling while atomic
----------------------------------------------------------------
{{{
note: mattermost[3290405] exited with irqs disabled
note: mattermost[3290405] exited with preempt_count 1
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: mattermost/3290405/0x00000000
Call Trace:
 <TASK>
 __schedule_bug+0x64/0x80
 __schedule+0x685/0x7a0
 do_task_dead+0x4a/0x60
 make_task_dead+0x136/0x140
 rewind_stack_and_make_dead+0x16/0x20
 </TASK>
}}}

----------------------------------------------------------------
TRACE 3 - RCU warning (the lock that was never released)
----------------------------------------------------------------
{{{
------------[ cut here ]------------
Voluntary context switch within RCU read-side critical section!
WARNING: CPU: 23 PID: 3290405 at kernel/rcu/tree_plugin.h:332 
rcu_note_context_switch+0x2b1/0x2d0
Call Trace:
 <TASK>
 __schedule+0xed/0x7a0
 do_task_dead+0x4a/0x60
 make_task_dead+0x136/0x140
 rewind_stack_and_make_dead+0x16/0x20
 </TASK>
}}}

----------------------------------------------------------------
TRACE 4 - resulting kcompactd soft lockup (cause of the hang)
From 19:01:22, repeating every ~26s, counter 22s -> 2108s until the log stops
at 19:38:48 (system frozen).
----------------------------------------------------------------
{{{
watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kcompactd0:219]
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
 <TASK>
 _raw_spin_lock+0x3f/0x60
 map_pte+0x74/0x150
 page_vma_mapped_walk+0x318/0x840
 migrate_pages_batch+0x162/0x840
 migrate_pages_sync+0x83/0x1e0
 migrate_pages+0x38d/0x4c0
 compact_zone+0x43a/0x720
 compact_node+0xaf/0x130
 kcompactd+0x38d/0x4f0
 </TASK>
}}}

----------------------------------------------------------------
ANALYSIS (causal chain)
----------------------------------------------------------------
1. mattermost (PID 3290405) exits; during address-space teardown
   (exit_mmap -> zap_pte_range -> zap_present_ptes) the kernel dereferences an
   invalid struct page (vmemmap 0xfffffaab0877acc8, not-present) -> Oops. 
Suggests
   corruption or a race in 6.17's batched PTE zapping 
(zap_present_ptes.constprop.0).
2. The Oops happens with a page-table lock held and inside an RCU read-side 
section
   (irqs disabled, preempt_count 1). The attempt to kill the task recurses:
   "Fixing recursive fault but reboot is needed!".
3. make_task_dead -> do_task_dead -> __schedule with preemption disabled ->
   "scheduling while atomic" + the RCU read-side WARNING. The task dies WITHOUT
   releasing the RCU read lock or the PTL.
4. Consequence: RCU grace periods stall indefinitely (rcu_preempt stalls on the
   dead PID for ~2.5h), and kcompactd0 spins forever on the orphaned page-table
   lock during page migration -> soft lockup on CPU#3 (22s -> 2108s).
5. With CPU#3 monopolized and MM/scheduler degraded, new fork()s (SSH login) and
   Docker health checks stop progressing; the box is unusable until a hard 
reset.

Root-cause hypothesis: a regression/race in the 6.17 memory-unmap path
(zap_present_ptes batched zapping). The crash is pure core MM (no nvidia 
frames),
although out-of-tree nvidia modules are loaded (taint OE) - maintainers may ask 
to
reproduce without nvidia.

RULED OUT (with evidence)
- OOM: no oom-kill; 51 GB available.
- Disk full / read-only FS: / 4%, /var/lib/docker 10%; no ext4/NVMe errors.
- GPU hardware fault: no Xid/NVRM; mattermost does not use the GPU.
- Network: the "[UFW BLOCK] ... ff02::1 DPT=10001" lines are harmless IPv6
  multicast, unrelated.

ATTACHMENT
Full kernel journal of the incident boot is attached:
incidente-2026-06-19-kernel.log  (5241 lines).

No kernel vmcore exists for this occurrence (kdump was installed afterwards); 
kdump
is now enabled (crashkernel reserved) so any recurrence will produce a vmcore.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: kernel-oops linux-hwe-6.17 noble regression-update.

** Attachment added: "incidente-2026-06-19-kernel.log"
   
https://bugs.launchpad.net/bugs/2157705/+attachment/5978182/+files/incidente-2026-06-19-kernel.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2157705

Title:
  linux-hwe-6.17 6.17.0-35: kernel page-fault Oops in zap_present_ptes
  during exit_mmap; dying task leaves page-table lock + RCU read-side
  held -> RCU stalls + kcompactd soft lockup -> full system hang

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2157705/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2157705] [NEW] linux-hwe-6.17 6.17.0-35: kernel page-fault Oops in zap_present_ptes during exit_mmap; dying task leaves page-table lock + RCU read-side held -> RCU stalls + kcompactd soft lockup -> full system hang

Reply via email to