Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-08-09 Thread Ben Hutchings
On Wed, 2023-08-09 at 13:35 +0200, kolafl...@kolahilft.de wrote:
> I was able to resolve the issue with my ASRock J4105-ITX board (Celeron 
> J4105) on Debian-12. I just had to
>apt install intel-microcode
> and reboot. After 4 weeks of running the system 23/7 I had no further 
> crashes.
> 
> Without that intel-microcode package
>/sys/devices/system/cpu/cpu*/microcode/version
> is 0x26 (38 decimal).
> And with intel-microcode-3.20230512.1 the version is 0x3e (decimal 62).
> 
> 
> I'm not sure why intel-microcode was not installed by default. All my 
> other Debian computers have that package installed automatically.
[...]

We didn't use to install intel-microcode by default because it's non-
free.  Starting with Debian 12, non-free firmware and microcode are
installed on systems where they are useful, but upgrades from older
versions won't change this.

But I'm fairly sure what you've found is a different issue from what
Olivier originally reported.

Ben.

-- 
Ben Hutchings
[W]e found...that it wasn't as easy to get programs right as we had
thought. I realized that a large part of my life from then on was going
to be spent in finding mistakes in my own programs.
 - Maurice Wilkes, 1949



signature.asc
Description: This is a digitally signed message part


Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-08-09 Thread kolafl...@kolahilft.de
I was able to resolve the issue with my ASRock J4105-ITX board (Celeron 
J4105) on Debian-12. I just had to

  apt install intel-microcode
and reboot. After 4 weeks of running the system 23/7 I had no further 
crashes.


Without that intel-microcode package
  /sys/devices/system/cpu/cpu*/microcode/version
is 0x26 (38 decimal).
And with intel-microcode-3.20230512.1 the version is 0x3e (decimal 62).


I'm not sure why intel-microcode was not installed by default. All my 
other Debian computers have that package installed automatically. I had 
some very rare crashes with Debian-11. So Debian-12 maybe just worsened 
the issue and it already existed before updating to Debian-12.


In theory a BIOS update might also do the job. I run BIOS 1.40 and there 
is a 1.60 version from 2021. But 1.60 does not list a microcode update.

https://www.asrock.com/mb/Intel/J4105-ITX/#BIOS
(haven't updated to BIOS 1.60 yet)


Kind regards,
kolAflash


OpenPGP_0xEA831012D83C3408.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-06-12 Thread Olivier Berger
Le Mon, Jun 12, 2023 at 12:49:25PM +0200, Olivier Berger a écrit :
> Hi.
> 
> I can confirm the reproduction of the same kind of crash, this time without 
> wifi activated.
> 
> It seems to occur whenever I'm away from the machine for a while, probably 
> linked to screen saving condition.
> 

For the records, the video card is reported as "00:02.0 VGA compatible 
controller: Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics] (rev 01)", 
offering a HDMI port (on Dell HP ProBook laptop), on which I connect a screen 
via an HDMI to Display Port adapter.

I suspect some kind of weird corner case linked to that adapter, which is the 
"HDMI to DisplayPort Adapter - 4K Ready" from Cable Matters 
(https://www.cablematters.com/pc-825-139-hdmi-to-displayport-adapter-4k-ready.aspx
 )

Just my 2 more cents,

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)



Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-06-12 Thread Olivier Berger
Hi.

I can confirm the reproduction of the same kind of crash, this time without 
wifi activated.

It seems to occur whenever I'm away from the machine for a while, probably 
linked to screen saving condition.

Hope this helps,

Le Fri, Jun 09, 2023 at 08:58:52AM +0200, Olivier Berger a écrit :
> 
> As a followup, I've been able to get another crash, this time when netconsole 
> was on, and got a bunch of traces, in the attached logs.
> 
> Hope this helps identify the culprit... probably i915/drm ?
> 
> The title of the bug report should be changed, but I'm not sure how best to 
> retitle.
> 

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)
[ 1192.922330] netpoll: netconsole: local port 
[ 1192.922341] netpoll: netconsole: local IPv4 address 192.168.1.32
[ 1192.922345] netpoll: netconsole: interface 'enp2s0'
[ 1192.922347] netpoll: netconsole: remote port 
[ 1192.922350] netpoll: netconsole: remote IPv4 address 192.168.1.25
[ 1192.922352] netpoll: netconsole: remote ethernet address 38:2c:4a:b1:63:94
[ 1192.922461] printk: console [netcon0] enabled
[ 1192.922468] netconsole: network logging started
[ 1793.154776] mce: CPU#1: Unexpected int18 (Machine Check)
[ 1793.154809] mce: CPU#5: Unexpected int18 (Machine Check)
[ 1794.400586] [ cut here ]
[ 1794.400600] DPLL 0 assertion failure (expected on, current off)
[ 1794.400763] WARNING: CPU: 1 PID: 1163 at 
drivers/gpu/drm/i915/display/intel_dpll_mgr.c:191 
assert_shared_dpll+0x10a/0x120 [i915]
[ 1794.400977] Modules linked in: netconsole xt_conntrack nft_chain_nat 
xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink 
br_netfilter bridge stp llc vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ctr ccm 
rfcomm snd_seq_dummy snd_hrtimer snd_seq cmac algif_hash algif_skcipher af_alg 
qrtr cpufreq_ondemand cpufreq_conservative cpufreq_powersave overlay squashfs 
cpufreq_userspace bnep binfmt_misc snd_ctl_led snd_soc_skl_hda_dsp 
snd_soc_intel_hda_dsp_common snd_soc_hdac_hdmi snd_sof_probes 
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio 
snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel 
soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci 
snd_sof_xtensa_dsp iwlmvm snd_sof snd_sof_utils snd_soc_hdac_hda 
x86_pkg_temp_thermal snd_hda_ext_core intel_powerclamp snd_soc_acpi_intel_match 
coretemp mac80211 snd_soc_acpi snd_soc_core
[ 1794.401079]  mei_hdcp snd_compress soundwire_bus intel_rapl_msr btusb 
libarc4 btrtl pmt_telemetry pmt_class btbcm btintel btmtk bluetooth kvm_intel 
snd_hda_intel snd_intel_dspcfg iwlwifi kvm uvcvideo jitterentropy_rng 
snd_intel_sdw_acpi irqbypass cfg80211 snd_usb_audio videobuf2_vmalloc 
snd_hda_codec videobuf2_memops drbg snd_usbmidi_lib videobuf2_v4l2 ansi_cprng 
snd_hda_core hp_wmi processor_thermal_device_pci_legacy rapl snd_rawmidi 
processor_thermal_device videobuf2_common nls_ascii platform_profile 
ecdh_generic snd_hwdep iTCO_wdt snd_seq_device processor_thermal_rfim 
intel_cstate ucsi_acpi intel_uncore snd_pcm videodev snd_timer typec_ucsi 
pcspkr nls_cp437 processor_thermal_mbox processor_thermal_rapl intel_pmc_bxt 
vfat mei_me snd roles intel_rapl_common iTCO_vendor_support fat wmi_bmof ee1004 
mc int3403_thermal watchdog soundcore ecc mei rfkill intel_vsec typec joydev 
igen6_edac intel_soc_dts_iosf int340x_thermal_zone ac intel_hid int3400_thermal 
intel_pmc_core acpi_thermal_rel
[ 1794.401185]  sparse_keymap acpi_pad hid_multitouch evdev serio_raw nfsd 
auth_rpcgss msr parport_pc nfs_acl ppdev lockd lp grace parport fuse loop 
dm_mod efi_pstore configfs sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache 
jbd2 hid_logitech_hidpp hid_logitech_dj usbhid btrfs blake2b_generic 
zstd_compress efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath 
linear md_mod i915 nvme drm_buddy crc32_pclmul i2c_algo_bit crc32c_intel 
nvme_core drm_display_helper t10_pi hid_generic xhci_pci ghash_clmulni_intel 
xhci_hcd cec crc64_rocksoft_generic rc_core rtsx_pci_sdmmc crc64_rocksoft 
crc_t10dif ttm crct10dif_generic mmc_core usbcore i2c_i801 drm_kms_helper r8169 
intel_lpss_pci intel_lpss i2c_hid_acpi realtek i2c_hid crct10dif_pclmul crc64 
mdio_devres aesni_intel drm crypto_simd cryptd libphy rtsx_pci i2c_smbus 
crct10dif_common idma64 vmd usb_common battery hid video wmi button sha512_ssse3
[ 1794.401312]  sha512_generic
[ 1794.401328] CPU: 1 PID: 1163 Comm: Xorg Tainted: G   OE  
6.1.0-9-amd64 #1  Debian 6.1.27-1
[ 1794.401339] Hardware name: HP HP ProBook 450 G8 Notebook PC/87E1, BIOS T70 
Ver. 01.13.01 03/30/2023
[ 1794.401346] RIP: 0010:assert_shared_dpll+0x10a/0x120 [i915]
[ 1794.401579] Code: ed 48 

Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-06-09 Thread Olivier Berger
Hi.

As a followup, I've been able to get another crash, this time when netconsole 
was on, and got a bunch of traces, in the attached logs.

Hope this helps identify the culprit... probably i915/drm ?

The title of the bug report should be changed, but I'm not sure how best to 
retitle.

Best regards,

Le Wed, May 24, 2023 at 01:35:31PM +0200, Olivier Berger a écrit :
> The i915 hint is interesting.
> 
> Salvatore Bonaccorso  writes:
> 
> >
> > Would you be able to bisect the changes between 6.1.20 and 6.1.27 to
> > identify the culprit, though not instantntly triggerable? Maybe
> > focusing around the i915 changes, I stumpled over a2b6e99d8a62
> > ("drm/i915: Disable DC states for all commits") which was backported
> > to 6.1.23.
> >

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)
[  118.158855] netpoll: netconsole: local port 
[  118.158865] netpoll: netconsole: local IPv4 address 192.168.0.35
[  118.158870] netpoll: netconsole: interface 'wlp0s20f3'
[  118.158872] netpoll: netconsole: remote port 
[  118.158874] netpoll: netconsole: remote IPv4 address 192.168.0.47
[  118.158877] netpoll: netconsole: remote ethernet address 38:2c:4a:b1:63:94
[  118.159010] [ cut here ]
[  118.159012] WARNING: CPU: 3 PID: 3290 at net/mac80211/tx.c:3723 
ieee80211_tx_dequeue+0xcb3/0xd30 [mac80211]
[  118.159102] Modules linked in: netconsole(+) xt_conntrack nft_chain_nat 
xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink 
br_netfilter bridge stp llc vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ctr ccm 
rfcomm snd_seq_dummy snd_hrtimer snd_seq cmac algif_hash algif_skcipher af_alg 
squashfs cpufreq_ondemand qrtr cpufreq_conservative cpufreq_powersave overlay 
bnep cpufreq_userspace hid_logitech_hidpp binfmt_misc nls_ascii nls_cp437 vfat 
fat snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_hda_dsp_common 
snd_soc_hdac_hdmi snd_sof_probes snd_hda_codec_hdmi hid_logitech_dj 
snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_soc_dmic 
snd_sof_pci_intel_tgl iwlmvm snd_sof_intel_hda_common btusb btrtl btbcm btintel 
soundwire_intel btmtk soundwire_generic_allocation mac80211 soundwire_cadence 
snd_sof_intel_hda snd_sof_pci bluetooth snd_sof_xtensa_dsp snd_sof 
snd_usb_audio snd_sof_utils
[  118.159163]  snd_soc_hdac_hda libarc4 snd_hda_ext_core 
snd_soc_acpi_intel_match snd_usbmidi_lib snd_soc_acpi snd_rawmidi 
x86_pkg_temp_thermal intel_powerclamp usbhid snd_seq_device snd_soc_core 
iwlwifi coretemp snd_compress jitterentropy_rng soundwire_bus joydev drbg 
snd_hda_intel mei_hdcp kvm_intel snd_intel_dspcfg snd_intel_sdw_acpi 
pmt_telemetry snd_hda_codec intel_rapl_msr pmt_class ansi_cprng uvcvideo 
cfg80211 kvm snd_hda_core hp_wmi videobuf2_vmalloc snd_hwdep platform_profile 
irqbypass ecdh_generic videobuf2_memops videobuf2_v4l2 snd_pcm 
processor_thermal_device_pci_legacy processor_thermal_device rapl 
processor_thermal_rfim videobuf2_common processor_thermal_mbox snd_timer 
iTCO_wdt intel_cstate processor_thermal_rapl ucsi_acpi intel_uncore videodev 
typec_ucsi intel_pmc_bxt snd iTCO_vendor_support roles mei_me intel_rapl_common 
mc pcspkr ecc wmi_bmof ee1004 watchdog soundcore mei rfkill intel_vsec 
igen6_edac typec intel_soc_dts_iosf int3403_thermal int340x_thermal_zone
[  118.159218]  int3400_thermal acpi_thermal_rel intel_hid sparse_keymap 
intel_pmc_core acpi_pad ac hid_multitouch serio_raw evdev nfsd msr parport_pc 
auth_rpcgss ppdev nfs_acl lockd lp grace parport fuse loop dm_mod efi_pstore 
configfs sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs 
blake2b_generic zstd_compress efivarfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic 
raid1 raid0 multipath linear md_mod i915 drm_buddy i2c_algo_bit crc32_pclmul 
drm_display_helper nvme crc32c_intel nvme_core cec hid_generic rc_core 
rtsx_pci_sdmmc t10_pi ghash_clmulni_intel mmc_core ttm i2c_hid_acpi 
crc64_rocksoft_generic crc64_rocksoft drm_kms_helper r8169 crc_t10dif realtek 
xhci_pci mdio_devres crct10dif_generic i2c_hid aesni_intel xhci_hcd 
intel_lpss_pci crct10dif_pclmul crypto_simd i2c_i801 intel_lpss crc64 cryptd 
drm usbcore i2c_smbus libphy rtsx_pci crct10dif_common idma64 usb_common vmd 
hid battery video wmi button
[  118.159289]  sha512_ssse3 sha512_generic
[  118.159292] CPU: 3 PID: 3290 Comm: modprobe Tainted: G   OE  
6.1.0-9-amd64 #1  Debian 6.1.27-1
[  118.159298] Hardware name: HP HP ProBook 450 G8 Notebook PC/87E1, BIOS T70 
Ver. 01.13.01 03/30/2023
[  118.159300] RIP: 0010:ieee80211_tx_dequeue+0xcb3/0xd30 [mac80211]
[  118.159374] Code: ff ff 01 ce 48 89 ef 29 d6 e8 09 ab 35 d5 48 85 c0 0f 84 
23 f8 ff ff 0f b7 85 b8 00 00 00 48 03 85 c8 00 00 00 e9 f5 f7 ff ff <0f> 0b e9 
ab f3 ff ff 

Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-24 Thread Olivier Berger
Hi.

I'm afraid this would be much beyond my capacity, sorry.

The i915 hint is interesting.

Thanks.

Best regards,

Salvatore Bonaccorso  writes:

> Control: tags -1 + moreinfo
>
> Hi Olivier,
>
>
> Would you be able to bisect the changes between 6.1.20 and 6.1.27 to
> identify the culprit, though not instantntly triggerable? Maybe
> focusing around the i915 changes, I stumpled over a2b6e99d8a62
> ("drm/i915: Disable DC states for all commits") which was backported
> to 6.1.23.
>
> Regards,
> Salvatore
>

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)



Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-24 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

Hi Olivier,

On Tue, May 23, 2023 at 06:49:00PM +0200, Olivier Berger wrote:
> Package: src:linux
> Version: 6.1.27-1
> Severity: normal
> 
> Hi.
> 
> I'm experiencing crashes (computer reset or completely shutting down) without 
> much details available on why. It used to work fine with 6.1.0-7 but has had 
> problems with the 2 later updates of the testing kernel.
> 
> I've managed to get a log of the kernel panic with netconsole (otherwise 
> wouldn't get any hints whatsoever in logs on disks after restarting), bellow.
> 
> I guess this is nasty as being close to the freeze. I've had the issue for a 
> few days now, but only managed to test a netconsole remote log today.
> 
> It seems to me that the crash mainly happen when I'm away from the laptop for 
> several minutes, so maybe related to some kind of energy saving stuff...
> 
> Hope this provides enough details to help.
> 
> [  394.735702] netpoll: netconsole: local port 
> [  394.735711] netpoll: netconsole: local IPv4 address 192.168.0.23
> [  394.735715] netpoll: netconsole: interface 'enp2s0'
> [  394.735717] netpoll: netconsole: remote port 
> [  394.735719] netpoll: netconsole: remote IPv4 address 192.168.0.47
> [  394.735722] netpoll: netconsole: remote ethernet address 38:2c:4a:b1:63:94
> [  394.735819] printk: console [netcon0] enabled
> [  394.735825] netconsole: network logging started
> [  463.655009] usb 3-6: new high-speed USB device number 8 using xhci_hcd
> [  463.659448] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  463.943099] usb 3-6: New USB device found, idVendor=1307, idProduct=0190, 
> bcdDevice= 1.00
> [  463.943133] usb 3-6: New USB device strings: Mfr=1, Product=2, 
> SerialNumber=3
> [  463.943144] usb 3-6: Product: USB Mass Storage Device
> [  463.943153] usb 3-6: Manufacturer: USBest Technology
> [  463.943160] usb 3-6: SerialNumber: 00027F
> [  463.974560] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  463.974717] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  463.987184] SCSI subsystem initialized
> [  463.990687] usb-storage 3-6:1.0: USB Mass Storage device detected
> [  463.990771] scsi host0: usb-storage 3-6:1.0
> [  463.990859] usbcore: registered new interface driver usb-storage
> [  463.992482] usbcore: registered new interface driver uas
> [  464.995952] scsi 0:0:0:0: Direct-Access Ut190USB2FlashStorage 0.00 
> PQ: 0 ANSI: 2
> [  464.996613] scsi 0:0:0:1: Direct-Access Ut190SD0StorageDevice 0.00 
> PQ: 0 ANSI: 2
> [  465.008300] scsi 0:0:0:0: Attached scsi generic sg0 type 0
> [  465.008343] scsi 0:0:0:1: Attached scsi generic sg1 type 0
> [  465.014353] sd 0:0:0:0: [sda] 7897088 512-byte logical blocks: (4.04 
> GB/3.77 GiB)
> [  465.014619] sd 0:0:0:1: [sdb] Media removed, stopped polling
> [  465.014756] sd 0:0:0:0: [sda] Write Protect is off
> [  465.014764] sd 0:0:0:0: [sda] Mode Sense: 00 00 00 00
> [  465.014804] sd 0:0:0:1: [sdb] Attached SCSI removable disk
> [  465.014951] sd 0:0:0:0: [sda] Asking for cache data failed
> [  465.014957] sd 0:0:0:0: [sda] Assuming drive cache: write through
> [  465.284600] GPT:Primary header thinks Alt. header is not at the end of the 
> disk.
> [  465.284627] GPT:2590719 != 7897087
> [  465.284634] GPT:Alternate GPT header not at the end of the disk.
> [  465.284640] GPT:2590719 != 7897087
> [  465.284645] GPT: Use GNU Parted to correct GPT errors.
> [  465.284659]  sda: sda1
> [  465.285144] sd 0:0:0:0: [sda] Attached SCSI removable disk
> [  474.111368] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  497.264500] sda: detected capacity change from 7897088 to 0
> [  502.045711] usb 3-6: USB disconnect, device number 8
> [  519.695345] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  535.857315] EXT4-fs (dm-0): recovery complete
> [  535.858056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. 
> Quota mode: none.
> [  543.576681] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  551.263395] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  634.375963] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  725.578095] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  845.577721] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  871.117193] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  905.577391] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [  905.620289] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  905.623541] systemd-journald[428]: Successfully sent stream file 
> descriptor to service manager.
> [  995.577111] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [ 1085.576193] systemd-journald[428]: Sent WATCHDOG=1 notification.
> [ 

Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-24 Thread Olivier Berger
Hi.

Diederik de Haas  writes:

>
> The stack traces should be useful for someone who understands those (which
> isn't me), but I did notice several other items:
>
> - [  465.284645] GPT: Use GNU Parted to correct GPT errors
> That happened after you plugged in an USB drive?
> I would follow that advice, but it would be useful to get that USB drive
> 'out of the equation'.
> Does the issue also occur when that USB drive isn't used?
> The kernel seems to assign both sda and sdb before settling on sda(1)?
> Not sure what to make of that, but it doesn't look good
>

I guess the USB drive has nothing to do with the issue, AFAIU. Actually,
I just wanted to be sure that netconsole was indeed capturing kernel
events, as suggested by a howto on remote debugging of kernel panics
with netconsole. And FYI, this is a USB key that embeds a SD card
reader, hence the 2 drives that popup... as for GPT, dunno, maybe a
formatting mistake.
In any case, the laptop crashed in the past whenever no such USB key was
being plugged.

> - [  535.857315] EXT4-fs (dm-0): recovery complete
> I can understand a FS recovery when you're dealing with a freeze/crash,
> but I find the timing a 'bit' unusual. After 9.5 minutes, I doubt it's the
> primary/boot drive (and we had the USB drive before that), so where
> is that coming from?
>

Thats a LUKS partition being mounted after a while by me, for secrets
stored on the hard drive in a dedicated partition. As the laptop crashed
in the previous execution with the partition mounted, it explains the
FS recovery at mount time.

Nothing strange here either.

> - [  543.576681] systemd-journald[428]: Sent WATCHDOG=1 notification
> I'm not really sure what that means, but afaik a watchdog is used to
> (automatically) reboot the machine if the system hangs.
> So seeing that message numerous times, is worrisome. And it looks like it
> doesn't do its actual job?
>

I booted with 'debug ignore_loglevel' as kernel arguments... maybe that
explains the occurence of such logs... dunno exactly if this is
worrysome.

> - BIOS T70 Ver. 01.13.01 03/30/2023
> Can you check whether there is a newer BIOS version available?
> I believe 'NMI' is BIOS related, so it may have an effect.

I just updated the HP BIOS to the latest available the last day, but
crashes were occuring before too... maybe related, but nothing can be
updated more for the moment, at least from what the Windows HP Support
Assistant can show.

Thanks for your help.

Best regards,

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)



Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-23 Thread Diederik de Haas
Control: found -1 6.1.25-1
Control: retitle -1 Kernel panic - not syncing: Timeout: Not all CPUs entered 
broadcast exception handler

On Tuesday, 23 May 2023 18:49:00 CEST Olivier Berger wrote:
> It used to work fine with 6.1.0-7 but has had problems with the 2 later
> updates of the testing kernel.

The stack traces should be useful for someone who understands those (which
isn't me), but I did notice several other items:

- [  465.284645] GPT: Use GNU Parted to correct GPT errors
That happened after you plugged in an USB drive?
I would follow that advice, but it would be useful to get that USB drive
'out of the equation'.
Does the issue also occur when that USB drive isn't used?
The kernel seems to assign both sda and sdb before settling on sda(1)?
Not sure what to make of that, but it doesn't look good

- [  535.857315] EXT4-fs (dm-0): recovery complete
I can understand a FS recovery when you're dealing with a freeze/crash,
but I find the timing a 'bit' unusual. After 9.5 minutes, I doubt it's the
primary/boot drive (and we had the USB drive before that), so where
is that coming from?

- [  543.576681] systemd-journald[428]: Sent WATCHDOG=1 notification
I'm not really sure what that means, but afaik a watchdog is used to
(automatically) reboot the machine if the system hangs.
So seeing that message numerous times, is worrisome. And it looks like it
doesn't do its actual job?

- BIOS T70 Ver. 01.13.01 03/30/2023
Can you check whether there is a newer BIOS version available?
I believe 'NMI' is BIOS related, so it may have an effect.

signature.asc
Description: This is a digitally signed message part.


Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-23 Thread Olivier Berger
Hi.

Just in order to provide a bit more useful hints, maybe, the latest version 
working fine is linux-image-6.1.0-7-amd64 as 6.1.20-2.

Sorry about the lack of clarity in the initial report.

Le Tue, May 23, 2023 at 06:49:00PM +0200, Olivier Berger a écrit :
> 
> I'm experiencing crashes (computer reset or completely shutting down) without 
> much details available on why. It used to work fine with 6.1.0-7 but has had 
> problems with the 2 later updates of the testing kernel.
> 
> I've managed to get a log of the kernel panic with netconsole (otherwise 
> wouldn't get any hints whatsoever in logs on disks after restarting), bellow.
> 
> I guess this is nasty as being close to the freeze. I've had the issue for a 
> few days now, but only managed to test a netconsole remote log today.
> 
> It seems to me that the crash mainly happen when I'm away from the laptop for 
> several minutes, so maybe related to some kind of energy saving stuff...
> 
> Hope this provides enough details to help.
> 

-- 
Olivier BERGER
https://www-public.imtbs-tsp.eu/~berger_o/ - OpenPGP 2048R/0xF9EAE3A65819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)



Bug#1036644: linux-image-6.1.0-9-amd64: System crashes. Netconsole reports CPUs not responding to MCE broadcast

2023-05-23 Thread Olivier Berger
Package: src:linux
Version: 6.1.27-1
Severity: normal

Hi.

I'm experiencing crashes (computer reset or completely shutting down) without 
much details available on why. It used to work fine with 6.1.0-7 but has had 
problems with the 2 later updates of the testing kernel.

I've managed to get a log of the kernel panic with netconsole (otherwise 
wouldn't get any hints whatsoever in logs on disks after restarting), bellow.

I guess this is nasty as being close to the freeze. I've had the issue for a 
few days now, but only managed to test a netconsole remote log today.

It seems to me that the crash mainly happen when I'm away from the laptop for 
several minutes, so maybe related to some kind of energy saving stuff...

Hope this provides enough details to help.

[  394.735702] netpoll: netconsole: local port 
[  394.735711] netpoll: netconsole: local IPv4 address 192.168.0.23
[  394.735715] netpoll: netconsole: interface 'enp2s0'
[  394.735717] netpoll: netconsole: remote port 
[  394.735719] netpoll: netconsole: remote IPv4 address 192.168.0.47
[  394.735722] netpoll: netconsole: remote ethernet address 38:2c:4a:b1:63:94
[  394.735819] printk: console [netcon0] enabled
[  394.735825] netconsole: network logging started
[  463.655009] usb 3-6: new high-speed USB device number 8 using xhci_hcd
[  463.659448] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  463.943099] usb 3-6: New USB device found, idVendor=1307, idProduct=0190, 
bcdDevice= 1.00
[  463.943133] usb 3-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  463.943144] usb 3-6: Product: USB Mass Storage Device
[  463.943153] usb 3-6: Manufacturer: USBest Technology
[  463.943160] usb 3-6: SerialNumber: 00027F
[  463.974560] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  463.974717] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  463.987184] SCSI subsystem initialized
[  463.990687] usb-storage 3-6:1.0: USB Mass Storage device detected
[  463.990771] scsi host0: usb-storage 3-6:1.0
[  463.990859] usbcore: registered new interface driver usb-storage
[  463.992482] usbcore: registered new interface driver uas
[  464.995952] scsi 0:0:0:0: Direct-Access Ut190USB2FlashStorage 0.00 
PQ: 0 ANSI: 2
[  464.996613] scsi 0:0:0:1: Direct-Access Ut190SD0StorageDevice 0.00 
PQ: 0 ANSI: 2
[  465.008300] scsi 0:0:0:0: Attached scsi generic sg0 type 0
[  465.008343] scsi 0:0:0:1: Attached scsi generic sg1 type 0
[  465.014353] sd 0:0:0:0: [sda] 7897088 512-byte logical blocks: (4.04 GB/3.77 
GiB)
[  465.014619] sd 0:0:0:1: [sdb] Media removed, stopped polling
[  465.014756] sd 0:0:0:0: [sda] Write Protect is off
[  465.014764] sd 0:0:0:0: [sda] Mode Sense: 00 00 00 00
[  465.014804] sd 0:0:0:1: [sdb] Attached SCSI removable disk
[  465.014951] sd 0:0:0:0: [sda] Asking for cache data failed
[  465.014957] sd 0:0:0:0: [sda] Assuming drive cache: write through
[  465.284600] GPT:Primary header thinks Alt. header is not at the end of the 
disk.
[  465.284627] GPT:2590719 != 7897087
[  465.284634] GPT:Alternate GPT header not at the end of the disk.
[  465.284640] GPT:2590719 != 7897087
[  465.284645] GPT: Use GNU Parted to correct GPT errors.
[  465.284659]  sda: sda1
[  465.285144] sd 0:0:0:0: [sda] Attached SCSI removable disk
[  474.111368] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  497.264500] sda: detected capacity change from 7897088 to 0
[  502.045711] usb 3-6: USB disconnect, device number 8
[  519.695345] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  535.857315] EXT4-fs (dm-0): recovery complete
[  535.858056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Quota 
mode: none.
[  543.576681] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  551.263395] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  634.375963] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  725.578095] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  845.577721] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  871.117193] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  905.577391] systemd-journald[428]: Sent WATCHDOG=1 notification.
[  905.620289] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  905.623541] systemd-journald[428]: Successfully sent stream file descriptor 
to service manager.
[  995.577111] systemd-journald[428]: Sent WATCHDOG=1 notification.
[ 1085.576193] systemd-journald[428]: Sent WATCHDOG=1 notification.
[ 1205.575316] systemd-journald[428]: Sent WATCHDOG=1 notification.
[ 1265.574866] systemd-journald[428]: Sent WATCHDOG=1 notification.
[ 1305.267119] mce: CPUs not responding to MCE broadcast (may include false 
positives): 0-1,3-5,7
[ 1305.267121] mce: CPUs not responding to MCE broadcast (may include