[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Hi! I want to keep you updated: The problem isn't fixed, still, so I I'm running this simple script via cron to avoid uncontrolled kernel panic: ---snip--- #!/usr/bin/sh # Detect RAM corruption. If detected log a message and reboot # to prevent kernel panic #cron jobs need a PATH PATH=/sbin:/usr/sbin:/usr/bin:/bin if journalctl -b -g 'Code: Bad RIP value|BUG: Bad rss-counter state mm:' >/dev/null then MSG='RAM corruption detected, starting pro-active reboot' logger -t reboot-before-panic -p local0.notice "$MSG" shutdown -r +1 "$MSG" fi --- Still I suspect it might be related to snapshots being made. After a few days of running the problems started again like this: Mar 26 23:00:01 h19 systemd[1]: Started Timeline of Snapper Snapshots. Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Activating via systemd: service name='org.opensuse.Snapper' unit='snapperd.service' requested by ':1.343' (uid=0 pid=11200 comm="/usr/lib/snapper/systemd-helper --timeline ") Mar 26 23:00:01 h19 systemd[1]: Starting DBus interface for snapper... Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Successfully activated service 'org.opensuse.Snapper' Mar 26 23:00:01 h19 systemd[1]: Started DBus interface for snapper. Mar 26 23:00:01 h19 systemd[1]: snapper-timeline.service: Succeeded. Mar 26 23:00:01 h19 systemd[1]: Created slice Slice /system/systemd-coredump. Mar 26 23:00:01 h19 systemd[1]: Started Process Core Dump (PID 11227/UID 0). Mar 26 23:00:01 h19 systemd-coredump[11231]: Process 11226 (run-crons) of user 0 dumped core. Stack trace of thread 11226: #0 0x7f89ff9dacdb raise (libc.so.6 + 0x4acdb) #1 0x7f89ff9dc324 abort (libc.so.6 + 0x4c324) #2 0x7f89ffa20b07 __libc_message (libc.so.6 + 0x90b07) #3 0x7f89ffa28b8a malloc_printerr (libc.so.6 + 0x98b8a) #4 0x7f89ffa2a634 _int_free (libc.so.6 + 0x9a634) #5 0x55c998de3963 command_substitute (bash + 0x9f963) #6 0x55c998ddb380 n/a (bash + 0x97380) #7 0x55c998ddda57 n/a (bash + 0x99a57) #8 0x55c998ddcb94 n/a (bash + 0x98b94) #9 0x55c998dc8955 n/a (bash + 0x84955) #10 0x55c998dc756d execute_command_internal (bash + 0x8356d) #11 0x55c998dc86e1 execute_command (bash + 0x846e1) #12 0x55c998dc76fd execute_command_internal (bash + 0x836fd) #13 0x55c998dc86e1 execute_command (bash + 0x846e1) #14 0x55c998dc8516 execute_command_internal (bash + 0x84516) #15 0x55c998dc773c execute_command_internal (bash + 0x8373c) #16 0x55c998dc86e1 execute_command (bash + 0x846e1) #17 0x55c998dc8007 execute_command_internal (bash + 0x84007) #18 0x55c998dc86e1 execute_command (bash + 0x846e1) #19 0x55c998dbce2b reader_loop (bash + 0x78e2b) #20 0x55c998dbcabc main (bash + 0x78abc) #21 0x7f89ff9c52bd __libc_start_main (libc.so.6 + 0x352bd) #22 0x55c998df729a _start (bash + 0xb329a) Mar 26 23:00:01 h19 systemd[1]: systemd-coredump@0-11227-0.service: Succeeded. Mar 26 23:00:01 h19 kernel: BUG: Bad rss-counter state mm:acc74328 idx:1 val:14 Mar 26 23:01:01 h19 systemd[1]: snapperd.service: Succeeded. Mar 26 23:05:01 h19 reboot-before-panic[12356]: RAM corruption detected, starting pro-active reboot Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Hi! You may getting bored, but since the last message we had at least two more kernel crashes, one on a machine with latest updates (firmware and SLES), and the vendor diagnostics found no hardware problem. The latest kdump (20 minutes ago) was: --- [259832.062351] BTRFS info (device dm-0): qgroup scan completed (inconsistency flag cleared) [263221.648397] BUG: Bad rss-counter state mm:276fd188 idx:1 val:2 [266675.397157] BUG: Bad rss-counter state mm:3a0ed3fb idx:1 val:235 [272206.129167] BUG: kernel NULL pointer dereference, address: 0008 [272206.129183] #PF: supervisor write access in kernel mode [272206.129187] #PF: error_code(0x0002) - not-present page [272206.129190] PGD 0 P4D 0 [272206.129196] Oops: 0002 [#1] SMP NOPTI [272206.129202] CPU: 2 PID: 417 Comm: kswapd0 Tainted: G X 5.3.18-150300.59.49-default #1 SLE15-SP3 [272206.129209] Hardware name: Dell Inc. PowerEdge R7415/07YXFK, BIOS 1.17.0 07/30/2021 [272206.129219] RIP: e030:down_read_trylock+0x18/0x50 [272206.129224] Code: 20 48 c7 47 08 00 00 00 00 c7 47 10 00 00 00 00 c3 90 0f 1f 44 00 00 31 c0 48 b9 07 00 00 00 00 00 00 80 48 8d 90 00 01 00 00 48 0f b1 17 75 21 48 8b 47 08 65 48 8b 14 25 c0 8b 01 00 83 e0 [272206.129233] RSP: e02b:c90040cafc10 EFLAGS: 00010246 [272206.129237] RAX: RBX: 88812435e4d1 RCX: 8007 [272206.129241] RDX: 0100 RSI: c90040cafca8 RDI: 0008 [272206.129245] RBP: ea0009608000 R08: 0067 R09: 000557878d4d [272206.129250] R10: R11: 0003 R12: 88812435e4d0 [272206.129254] R13: 0008 R14: ea0009608008 R15: ea0009608000 [272206.129263] FS: () GS:88839148() knlGS: [272206.129267] CS: e030 DS: ES: CR0: 80050033 [272206.129271] CR2: 0008 CR3: 00033e414000 CR4: 00050660 [272206.129276] Call Trace: [272206.129286] page_lock_anon_vma_read+0x48/0xe0 [272206.129292] rmap_walk_anon+0x16c/0x250 [272206.129297] page_referenced+0xd5/0x170 [272206.129301] ? rmap_walk_anon+0x250/0x250 [272206.129305] ? page_get_anon_vma+0x80/0x80 [272206.129311] shrink_active_list+0x2dd/0x490 [272206.129316] balance_pgdat+0x50b/0x630 [272206.129321] kswapd+0x14b/0x3d0 [272206.129326] ? wait_woken+0x80/0x80 [272206.129330] ? balance_pgdat+0x630/0x630 [272206.129335] kthread+0x10d/0x130 [272206.129339] ? kthread_park+0xa0/0xa0 [272206.129345] ret_from_fork+0x22/0x40 ... [272206.129468] Supported: Yes, External [272206.129474] CR2: 0008 [272206.129508] ---[ end trace 92e4c491ab44e169 ]--- [272206.129512] RIP: e030:down_read_trylock+0x18/0x50 [272206.129516] Code: 20 48 c7 47 08 00 00 00 00 c7 47 10 00 00 00 00 c3 90 0f 1f 44 00 00 31 c0 48 b9 07 00 00 00 00 00 00 80 48 8d 90 00 01 00 00 48 0f b1 17 75 21 48 8b 47 08 65 48 8b 14 25 c0 8b 01 00 83 e0 [272206.129525] RSP: e02b:c90040cafc10 EFLAGS: 00010246 [272206.129528] RAX: RBX: 88812435e4d1 RCX: 8007 [272206.129533] RDX: 0100 RSI: c90040cafca8 RDI: 0008 [272206.129537] RBP: ea0009608000 R08: 0067 R09: 000557878d4d [272206.129541] R10: R11: 0003 R12: 88812435e4d0 [272206.129545] R13: 0008 R14: ea0009608008 R15: ea0009608000 [272206.129552] FS: () GS:88839148() knlGS: [272206.129557] CS: e030 DS: ES: CR0: 80050033 [272206.129561] CR2: 0008 CR3: 00033e414000 CR4: 00050660 [272206.129566] Kernel panic - not syncing: Fatal exception [272206.896249] Kernel Offset: disabled --- Regards, Ulrich >>> Ulrich Windl schrieb am 01.03.2022 um 09:48 in Nachricht <621A.166 : >>> 161 : 60728>: > Hi! > > I want to give an update on this issue (support is working on it): > > First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to > upgrade to SLES15 SP3, as we have about one crash per node and week. > We had one last night a few minutes after BtrFS balance had finished. > Meanwhile we also had a few crash dumps (kdump), and the call stacks are > like this: > > (5.3.18-150300.59.43-default) > [1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with > ordered data mode. > [1175905.783132] general protection fault: [#1] SMP NOPTI > [1175905.785704] RIP: e030:down_read_trylock+0x18/0x50 > [1175905.798982] Call Trace: > [1175905.800305] page_lock_anon_vma_read+0x48/0xe0 > [1175905.801659] rmap_walk_anon+0x16c/0x250 > [1175905.803021] page_referenced+0xd5/0x170 > [1175905.804254] ? rmap_walk_anon+0x250/0x250 > [1175905.805377] ? page_get_anon_vma+0x80/0x80 > [1175905.806593] shrink_active_list+0x2dd/0x490 > [1175905.807920] shrink_lruvec+0x4aa/0x6e0 > [1175905.809253] ? free_unref_page_list+0x16f/0x180 > [1175905.810460] ?
[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")
Hi! I want to give an update on this issue (support is working on it): First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to upgrade to SLES15 SP3, as we have about one crash per node and week. We had one last night a few minutes after BtrFS balance had finished. Meanwhile we also had a few crash dumps (kdump), and the call stacks are like this: (5.3.18-150300.59.43-default) [1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with ordered data mode. [1175905.783132] general protection fault: [#1] SMP NOPTI [1175905.785704] RIP: e030:down_read_trylock+0x18/0x50 [1175905.798982] Call Trace: [1175905.800305] page_lock_anon_vma_read+0x48/0xe0 [1175905.801659] rmap_walk_anon+0x16c/0x250 [1175905.803021] page_referenced+0xd5/0x170 [1175905.804254] ? rmap_walk_anon+0x250/0x250 [1175905.805377] ? page_get_anon_vma+0x80/0x80 [1175905.806593] shrink_active_list+0x2dd/0x490 [1175905.807920] shrink_lruvec+0x4aa/0x6e0 [1175905.809253] ? free_unref_page_list+0x16f/0x180 [1175905.810460] ? free_unref_page_list+0x16f/0x180 [1175905.811024] ? shrink_node+0x143/0x600 [1175905.811593] shrink_node+0x143/0x600 [1175905.812149] balance_pgdat+0x28a/0x630 [1175905.812716] kswapd+0x14b/0x3d0 [1175905.813263] ? wait_woken+0x80/0x80 [1175905.813793] ? balance_pgdat+0x630/0x630 [1175905.814319] kthread+0x10d/0x130 [1175905.814847] ? kthread_park+0xa0/0xa0 [1175905.815386] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [27926.595977] BUG: kernel NULL pointer dereference, address: 0007 [27926.597124] #PF: supervisor write access in kernel mode [27926.598197] #PF: error_code(0x0002) - not-present page [27926.599265] PGD 0 P4D 0 [27926.600322] Oops: 0002 [#1] SMP NOPTI [27926.603924] RIP: e030:down_read_trylock+0x18/0x50 [27926.618618] Call Trace: [27926.619138] page_lock_anon_vma_read+0x48/0xe0 [27926.619665] rmap_walk_anon+0x16c/0x250 [27926.620182] page_referenced+0xd5/0x170 [27926.620705] ? rmap_walk_anon+0x250/0x250 [27926.621214] ? page_get_anon_vma+0x80/0x80 [27926.621772] shrink_active_list+0x2dd/0x490 [27926.622280] balance_pgdat+0x50b/0x630 [27926.622793] kswapd+0x14b/0x3d0 [27926.623308] ? wait_woken+0x80/0x80 [27926.623822] ? balance_pgdat+0x630/0x630 [27926.624329] kthread+0x10d/0x130 [27926.624843] ? kthread_park+0xa0/0xa0 [27926.625577] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [566428.257264] BTRFS info (device dm-0): scrub: finished on devid 1 with status: 0 [571252.379396] ping[3707]: segfault at 0 ip sp 7fffbae8dc10 error 14 in bash[55a3a9d26000+f1000] [571252.379410] Code: Bad RIP value. [571252.948920] BUG: Bad rss-counter state mm:4d568db6 idx:1 val:4 [571262.985375] general protection fault: [#1] SMP NOPTI [571262.985410] RIP: e030:down_read_trylock+0x18/0x50 [571262.985470] Call Trace: [571262.985481] page_lock_anon_vma_read+0x48/0xe0 [571262.985487] rmap_walk_anon+0x16c/0x250 [571262.985492] page_referenced+0xd5/0x170 [571262.985496] ? rmap_walk_anon+0x250/0x250 [571262.985500] ? page_get_anon_vma+0x80/0x80 [571262.985506] shrink_active_list+0x2dd/0x490 [571262.985512] shrink_lruvec+0x4aa/0x6e0 [571262.985517] ? free_unref_page_list+0x16f/0x180 [571262.985522] ? free_unref_page_list+0x16f/0x180 [571262.985526] ? shrink_node+0x143/0x600 [571262.985529] shrink_node+0x143/0x600 [571262.985534] balance_pgdat+0x28a/0x630 [571262.985539] kswapd+0x14b/0x3d0 [571262.985544] ? wait_woken+0x80/0x80 [571262.985548] ? balance_pgdat+0x630/0x630 [571262.985553] kthread+0x10d/0x130 [571262.985557] ? kthread_park+0xa0/0xa0 [571262.985563] ret_from_fork+0x22/0x40 (5.3.18-150300.59.49-default) [22707.270890] #PF: supervisor write access in kernel mode [22707.271539] #PF: error_code(0x0002) - not-present page [22707.272159] PGD 0 P4D 0 [22707.272786] Oops: 0002 [#1] SMP NOPTI [22707.274680] RIP: e030:down_read_trylock+0x18/0x50 [22707.282129] Call Trace: [22707.282682] page_lock_anon_vma_read+0x48/0xe0 [22707.283225] rmap_walk_anon+0x16c/0x250 [22707.283772] ? uncharge_batch+0xe3/0x180 [22707.284307] try_to_unmap+0x93/0xf0 [22707.284835] ? page_remove_rmap+0x2c0/0x2c0 [22707.285375] ? page_not_mapped+0x20/0x20 [22707.285906] ? page_get_anon_vma+0x80/0x80 [22707.286428] ? invalid_mkclean_vma+0x20/0x20 [22707.286974] migrate_pages+0x857/0xb50 [22707.287494] ? isolate_freepages_block+0x370/0x370 [22707.288013] ? move_freelist_tail+0xd0/0xd0 [22707.288538] compact_zone+0x775/0xd90 [22707.289059] kcompactd_do_work+0xfe/0x2a0 [22707.289576] ? xen_load_sp0+0x7a/0x160 [22707.290096] ? __set_cpus_allowed_ptr+0xb5/0x1e0 [22707.290623] ? kcompactd_do_work+0x2a0/0x2a0 [22707.291158] ? kcompactd+0x84/0x1e0 [22707.291703] kcompactd+0x84/0x1e0 [22707.292243] ? wait_woken+0x80/0x80 [22707.292788] kthread+0x10d/0x130 [22707.293329] ? kthread_park+0xa0/0xa0 [22707.293868] ret_from_fork+0x22/0x40 All those dumps happened with BIOS 1.17.0; today I realized that