[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")

2022-03-28 Thread Ulrich Windl
Hi!

I want to keep you updated: The problem isn't fixed, still, so I
I'm running this simple script via cron to avoid uncontrolled kernel panic:
---snip---
#!/usr/bin/sh
# Detect RAM corruption. If detected log a message and reboot
# to prevent kernel panic

#cron jobs need a PATH
PATH=/sbin:/usr/sbin:/usr/bin:/bin
if journalctl -b -g 'Code: Bad RIP value|BUG: Bad rss-counter state mm:' 
>/dev/null
then
MSG='RAM corruption detected, starting pro-active reboot'
logger -t reboot-before-panic -p local0.notice "$MSG"
shutdown -r +1 "$MSG"
fi
---

Still I suspect it might be related to snapshots being made. After a few days 
of running the problems started again like this:
Mar 26 23:00:01 h19 systemd[1]: Started Timeline of Snapper Snapshots.
Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Activating via systemd: service 
name='org.opensuse.Snapper' unit='snapperd.service' requested by ':1.343' 
(uid=0 pid=11200 comm="/usr/lib/snapper/systemd-helper --timeline ")
Mar 26 23:00:01 h19 systemd[1]: Starting DBus interface for snapper...
Mar 26 23:00:01 h19 dbus-daemon[5700]: [system] Successfully activated service 
'org.opensuse.Snapper'
Mar 26 23:00:01 h19 systemd[1]: Started DBus interface for snapper.
Mar 26 23:00:01 h19 systemd[1]: snapper-timeline.service: Succeeded.
Mar 26 23:00:01 h19 systemd[1]: Created slice Slice /system/systemd-coredump.
Mar 26 23:00:01 h19 systemd[1]: Started Process Core Dump (PID 11227/UID 0).
Mar 26 23:00:01 h19 systemd-coredump[11231]: Process 11226 (run-crons) of user 
0 dumped core.

  Stack trace of thread 11226:
  #0  0x7f89ff9dacdb raise 
(libc.so.6 + 0x4acdb)
  #1  0x7f89ff9dc324 abort 
(libc.so.6 + 0x4c324)
  #2  0x7f89ffa20b07 
__libc_message (libc.so.6 + 0x90b07)
  #3  0x7f89ffa28b8a 
malloc_printerr (libc.so.6 + 0x98b8a)
  #4  0x7f89ffa2a634 
_int_free (libc.so.6 + 0x9a634)
  #5  0x55c998de3963 
command_substitute (bash + 0x9f963)
  #6  0x55c998ddb380 n/a 
(bash + 0x97380)
  #7  0x55c998ddda57 n/a 
(bash + 0x99a57)
  #8  0x55c998ddcb94 n/a 
(bash + 0x98b94)
  #9  0x55c998dc8955 n/a 
(bash + 0x84955)
  #10 0x55c998dc756d 
execute_command_internal (bash + 0x8356d)
  #11 0x55c998dc86e1 
execute_command (bash + 0x846e1)
  #12 0x55c998dc76fd 
execute_command_internal (bash + 0x836fd)
  #13 0x55c998dc86e1 
execute_command (bash + 0x846e1)
  #14 0x55c998dc8516 
execute_command_internal (bash + 0x84516)
  #15 0x55c998dc773c 
execute_command_internal (bash + 0x8373c)
  #16 0x55c998dc86e1 
execute_command (bash + 0x846e1)
  #17 0x55c998dc8007 
execute_command_internal (bash + 0x84007)
  #18 0x55c998dc86e1 
execute_command (bash + 0x846e1)
  #19 0x55c998dbce2b 
reader_loop (bash + 0x78e2b)
  #20 0x55c998dbcabc main 
(bash + 0x78abc)
  #21 0x7f89ff9c52bd 
__libc_start_main (libc.so.6 + 0x352bd)
  #22 0x55c998df729a _start 
(bash + 0xb329a)
Mar 26 23:00:01 h19 systemd[1]: systemd-coredump@0-11227-0.service: Succeeded.
Mar 26 23:00:01 h19 kernel: BUG: Bad rss-counter state mm:acc74328 
idx:1 val:14
Mar 26 23:01:01 h19 systemd[1]: snapperd.service: Succeeded.
Mar 26 23:05:01 h19 reboot-before-panic[12356]: RAM corruption detected, 
starting pro-active reboot

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")

2022-03-08 Thread Ulrich Windl
Hi!

You may getting bored, but since the last message we had at least two more 
kernel crashes, one on a machine with latest updates (firmware and SLES), and 
the vendor diagnostics found no hardware problem.
The latest kdump (20 minutes ago) was:


---
[259832.062351] BTRFS info (device dm-0): qgroup scan completed (inconsistency 
flag cleared)
[263221.648397] BUG: Bad rss-counter state mm:276fd188 idx:1 val:2
[266675.397157] BUG: Bad rss-counter state mm:3a0ed3fb idx:1 val:235
[272206.129167] BUG: kernel NULL pointer dereference, address: 0008
[272206.129183] #PF: supervisor write access in kernel mode
[272206.129187] #PF: error_code(0x0002) - not-present page
[272206.129190] PGD 0 P4D 0
[272206.129196] Oops: 0002 [#1] SMP NOPTI
[272206.129202] CPU: 2 PID: 417 Comm: kswapd0 Tainted: G   X
5.3.18-150300.59.49-default #1 SLE15-SP3
[272206.129209] Hardware name: Dell Inc. PowerEdge R7415/07YXFK, BIOS 1.17.0 
07/30/2021
[272206.129219] RIP: e030:down_read_trylock+0x18/0x50
[272206.129224] Code: 20 48 c7 47 08 00 00 00 00 c7 47 10 00 00 00 00 c3 90 0f 
1f 44 00 00 31 c0 48 b9 07 00 00 00 00 00 00 80 48 8d 90 00 01 00 00  48 0f 
b1 17 75 21 48 8b 47 08 65 48 8b 14 25 c0 8b 01 00 83 e0
[272206.129233] RSP: e02b:c90040cafc10 EFLAGS: 00010246
[272206.129237] RAX:  RBX: 88812435e4d1 RCX: 
8007
[272206.129241] RDX: 0100 RSI: c90040cafca8 RDI: 
0008
[272206.129245] RBP: ea0009608000 R08: 0067 R09: 
000557878d4d
[272206.129250] R10:  R11: 0003 R12: 
88812435e4d0
[272206.129254] R13: 0008 R14: ea0009608008 R15: 
ea0009608000
[272206.129263] FS:  () GS:88839148() 
knlGS:
[272206.129267] CS:  e030 DS:  ES:  CR0: 80050033
[272206.129271] CR2: 0008 CR3: 00033e414000 CR4: 
00050660
[272206.129276] Call Trace:
[272206.129286]  page_lock_anon_vma_read+0x48/0xe0
[272206.129292]  rmap_walk_anon+0x16c/0x250
[272206.129297]  page_referenced+0xd5/0x170
[272206.129301]  ? rmap_walk_anon+0x250/0x250
[272206.129305]  ? page_get_anon_vma+0x80/0x80
[272206.129311]  shrink_active_list+0x2dd/0x490
[272206.129316]  balance_pgdat+0x50b/0x630
[272206.129321]  kswapd+0x14b/0x3d0
[272206.129326]  ? wait_woken+0x80/0x80
[272206.129330]  ? balance_pgdat+0x630/0x630
[272206.129335]  kthread+0x10d/0x130
[272206.129339]  ? kthread_park+0xa0/0xa0
[272206.129345]  ret_from_fork+0x22/0x40
...
[272206.129468] Supported: Yes, External
[272206.129474] CR2: 0008
[272206.129508] ---[ end trace 92e4c491ab44e169 ]---
[272206.129512] RIP: e030:down_read_trylock+0x18/0x50
[272206.129516] Code: 20 48 c7 47 08 00 00 00 00 c7 47 10 00 00 00 00 c3 90 0f 
1f 44 00 00 31 c0 48 b9 07 00 00 00 00 00 00 80 48 8d 90 00 01 00 00  48 0f 
b1 17 75 21 48 8b 47 08 65 48 8b 14 25 c0 8b 01 00 83 e0
[272206.129525] RSP: e02b:c90040cafc10 EFLAGS: 00010246
[272206.129528] RAX:  RBX: 88812435e4d1 RCX: 
8007
[272206.129533] RDX: 0100 RSI: c90040cafca8 RDI: 
0008
[272206.129537] RBP: ea0009608000 R08: 0067 R09: 
000557878d4d
[272206.129541] R10:  R11: 0003 R12: 
88812435e4d0
[272206.129545] R13: 0008 R14: ea0009608008 R15: 
ea0009608000
[272206.129552] FS:  () GS:88839148() 
knlGS:
[272206.129557] CS:  e030 DS:  ES:  CR0: 80050033
[272206.129561] CR2: 0008 CR3: 00033e414000 CR4: 
00050660
[272206.129566] Kernel panic - not syncing: Fatal exception
[272206.896249] Kernel Offset: disabled
---

Regards,
Ulrich

>>> Ulrich Windl schrieb am 01.03.2022 um 09:48 in Nachricht <621A.166 : 
>>> 161 :
60728>:
> Hi!
> 
> I want to give an update on this issue (support is working on it):
> 
> First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to 
> upgrade to SLES15 SP3, as we have about one crash per node and week.
> We had one last night a few minutes after BtrFS balance had finished.
> Meanwhile we also had a few crash dumps (kdump), and the call stacks are 
> like this:
> 
> (5.3.18-150300.59.43-default)
> [1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with 
> ordered data mode.
> [1175905.783132] general protection fault:  [#1] SMP NOPTI
> [1175905.785704] RIP: e030:down_read_trylock+0x18/0x50
> [1175905.798982] Call Trace:
> [1175905.800305]  page_lock_anon_vma_read+0x48/0xe0
> [1175905.801659]  rmap_walk_anon+0x16c/0x250
> [1175905.803021]  page_referenced+0xd5/0x170
> [1175905.804254]  ? rmap_walk_anon+0x250/0x250
> [1175905.805377]  ? page_get_anon_vma+0x80/0x80
> [1175905.806593]  shrink_active_list+0x2dd/0x490
> [1175905.807920]  shrink_lruvec+0x4aa/0x6e0
> [1175905.809253]  ? free_unref_page_list+0x16f/0x180
> [1175905.810460]  ? 

[ClusterLabs] Antw: heads up: series of core dumps in SLES15 SP3 ("kernel: BUG: Bad rss-counter state mm:00000000d1a9d1f5 idx:1 val:4")

2022-03-01 Thread Ulrich Windl
Hi!

I want to give an update on this issue (support is working on it):

First I recommend everyone using Xen and a Dell PowerEdge R7415 _not_ to 
upgrade to SLES15 SP3, as we have about one crash per node and week.
We had one last night a few minutes after BtrFS balance had finished.
Meanwhile we also had a few crash dumps (kdump), and the call stacks are like 
this:

(5.3.18-150300.59.43-default)
[1175886.947081] ocfs2: Mounting device (9,10) on (node 116, slot 1) with 
ordered data mode.
[1175905.783132] general protection fault:  [#1] SMP NOPTI
[1175905.785704] RIP: e030:down_read_trylock+0x18/0x50
[1175905.798982] Call Trace:
[1175905.800305]  page_lock_anon_vma_read+0x48/0xe0
[1175905.801659]  rmap_walk_anon+0x16c/0x250
[1175905.803021]  page_referenced+0xd5/0x170
[1175905.804254]  ? rmap_walk_anon+0x250/0x250
[1175905.805377]  ? page_get_anon_vma+0x80/0x80
[1175905.806593]  shrink_active_list+0x2dd/0x490
[1175905.807920]  shrink_lruvec+0x4aa/0x6e0
[1175905.809253]  ? free_unref_page_list+0x16f/0x180
[1175905.810460]  ? free_unref_page_list+0x16f/0x180
[1175905.811024]  ? shrink_node+0x143/0x600
[1175905.811593]  shrink_node+0x143/0x600
[1175905.812149]  balance_pgdat+0x28a/0x630
[1175905.812716]  kswapd+0x14b/0x3d0
[1175905.813263]  ? wait_woken+0x80/0x80
[1175905.813793]  ? balance_pgdat+0x630/0x630
[1175905.814319]  kthread+0x10d/0x130
[1175905.814847]  ? kthread_park+0xa0/0xa0
[1175905.815386]  ret_from_fork+0x22/0x40

(5.3.18-150300.59.49-default)
[27926.595977] BUG: kernel NULL pointer dereference, address: 0007
[27926.597124] #PF: supervisor write access in kernel mode
[27926.598197] #PF: error_code(0x0002) - not-present page
[27926.599265] PGD 0 P4D 0
[27926.600322] Oops: 0002 [#1] SMP NOPTI
[27926.603924] RIP: e030:down_read_trylock+0x18/0x50
[27926.618618] Call Trace:
[27926.619138]  page_lock_anon_vma_read+0x48/0xe0
[27926.619665]  rmap_walk_anon+0x16c/0x250
[27926.620182]  page_referenced+0xd5/0x170
[27926.620705]  ? rmap_walk_anon+0x250/0x250
[27926.621214]  ? page_get_anon_vma+0x80/0x80
[27926.621772]  shrink_active_list+0x2dd/0x490
[27926.622280]  balance_pgdat+0x50b/0x630
[27926.622793]  kswapd+0x14b/0x3d0
[27926.623308]  ? wait_woken+0x80/0x80
[27926.623822]  ? balance_pgdat+0x630/0x630
[27926.624329]  kthread+0x10d/0x130
[27926.624843]  ? kthread_park+0xa0/0xa0
[27926.625577]  ret_from_fork+0x22/0x40

(5.3.18-150300.59.49-default)
[566428.257264] BTRFS info (device dm-0): scrub: finished on devid 1 with 
status: 0
[571252.379396] ping[3707]: segfault at 0 ip  sp 
7fffbae8dc10 error 14 in bash[55a3a9d26000+f1000]
[571252.379410] Code: Bad RIP value.
[571252.948920] BUG: Bad rss-counter state mm:4d568db6 idx:1 val:4
[571262.985375] general protection fault:  [#1] SMP NOPTI
[571262.985410] RIP: e030:down_read_trylock+0x18/0x50
[571262.985470] Call Trace:
[571262.985481]  page_lock_anon_vma_read+0x48/0xe0
[571262.985487]  rmap_walk_anon+0x16c/0x250
[571262.985492]  page_referenced+0xd5/0x170
[571262.985496]  ? rmap_walk_anon+0x250/0x250
[571262.985500]  ? page_get_anon_vma+0x80/0x80
[571262.985506]  shrink_active_list+0x2dd/0x490
[571262.985512]  shrink_lruvec+0x4aa/0x6e0
[571262.985517]  ? free_unref_page_list+0x16f/0x180
[571262.985522]  ? free_unref_page_list+0x16f/0x180
[571262.985526]  ? shrink_node+0x143/0x600
[571262.985529]  shrink_node+0x143/0x600
[571262.985534]  balance_pgdat+0x28a/0x630
[571262.985539]  kswapd+0x14b/0x3d0
[571262.985544]  ? wait_woken+0x80/0x80
[571262.985548]  ? balance_pgdat+0x630/0x630
[571262.985553]  kthread+0x10d/0x130
[571262.985557]  ? kthread_park+0xa0/0xa0
[571262.985563]  ret_from_fork+0x22/0x40

(5.3.18-150300.59.49-default)
[22707.270890] #PF: supervisor write access in kernel mode
[22707.271539] #PF: error_code(0x0002) - not-present page
[22707.272159] PGD 0 P4D 0
[22707.272786] Oops: 0002 [#1] SMP NOPTI
[22707.274680] RIP: e030:down_read_trylock+0x18/0x50
[22707.282129] Call Trace:
[22707.282682]  page_lock_anon_vma_read+0x48/0xe0
[22707.283225]  rmap_walk_anon+0x16c/0x250
[22707.283772]  ? uncharge_batch+0xe3/0x180
[22707.284307]  try_to_unmap+0x93/0xf0
[22707.284835]  ? page_remove_rmap+0x2c0/0x2c0
[22707.285375]  ? page_not_mapped+0x20/0x20
[22707.285906]  ? page_get_anon_vma+0x80/0x80
[22707.286428]  ? invalid_mkclean_vma+0x20/0x20
[22707.286974]  migrate_pages+0x857/0xb50
[22707.287494]  ? isolate_freepages_block+0x370/0x370
[22707.288013]  ? move_freelist_tail+0xd0/0xd0
[22707.288538]  compact_zone+0x775/0xd90
[22707.289059]  kcompactd_do_work+0xfe/0x2a0
[22707.289576]  ? xen_load_sp0+0x7a/0x160
[22707.290096]  ? __set_cpus_allowed_ptr+0xb5/0x1e0
[22707.290623]  ? kcompactd_do_work+0x2a0/0x2a0
[22707.291158]  ? kcompactd+0x84/0x1e0
[22707.291703]  kcompactd+0x84/0x1e0
[22707.292243]  ? wait_woken+0x80/0x80
[22707.292788]  kthread+0x10d/0x130
[22707.293329]  ? kthread_park+0xa0/0xa0
[22707.293868]  ret_from_fork+0x22/0x40

All those dumps happened with BIOS 1.17.0; today I realized that