[Kernel-packages] [Bug 1922387] Re: BUG: kernel NULL pointer dereference, address: 0000000000000050

2021-04-14 Thread Ian
Also worth mentioning.  We are only seeing this on the A100.  Neither
our automated testing or manual testing of ftrace saw any issues on
DGX2.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1922387

Title:
  BUG: kernel NULL pointer dereference, address: 0050

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Groovy:
  Incomplete
Status in linux source package in Hirsute:
  Incomplete

Bug description:
  I observed the following kernel panic with the 5.4.0-71.79-generic
  kernel while running kernel selftests:

  blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
  [ 1672.118199] mmiotrace: Error taking CPU254 down: -28
  [ 1672.230306] mmiotrace: Error taking CPU255 down: -28
  [ 2503.359753] BUG: kernel NULL pointer dereference, address: 0050
  [ 2503.367527] #PF: supervisor read access in kernel mode
  [ 2503.373257] #PF: error_code(0x) - not-present page
  [ 2503.378989] PGD 0 P4D 0 
  [ 2503.381812] Oops:  [#1] SMP NOPTI
  [ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   OE 
5.4.0-71-generic #79-Ubuntu
  [ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 
0.33 01/19/2021
  [ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
  [ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 
8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 
50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
  [ 2503.432683] RSP: 0018:a8d6c0003d90 EFLAGS: 00010286
  [ 2503.438513] RAX:  RBX:  RCX: 
8100
  [ 2503.446474] RDX: 9968a228f418 RSI: 0100 RDI: 
9968a228f414
  [ 2503.454436] RBP: a8d6c0003df8 R08: 9968a228f414 R09: 
0100
  [ 2503.462394] R10: 0007 R11: 0007 R12: 
9968a228f418
  [ 2503.470353] R13: fffa R14: 0003 R15: 
9a686f9b3000
  [ 2503.478316] FS:  () GS:99690cc0() 
knlGS:
  [ 2503.487342] CS:  0010 DS:  ES:  CR0: 80050033
  [ 2503.493752] CR2: 0050 CR3: 007e08ad6000 CR4: 
00340ef0
  [ 2503.501712] Call Trace:
  [ 2503.504438]  
  [ 2503.506682]  wb_timer_fn+0x1d6/0x3c0
  [ 2503.510672]  ? blk_stat_free_callback_rcu+0x30/0x30
  [ 2503.516112]  blk_stat_timer_fn+0x134/0x140
  [ 2503.520683]  call_timer_fn+0x32/0x130
  [ 2503.524768]  __run_timers.part.0+0x180/0x280
  [ 2503.529535]  ? trace_event_raw_event_softirq+0x5d/0xa0
  [ 2503.535267]  run_timer_softirq+0x2a/0x50
  [ 2503.539644]  __do_softirq+0xe1/0x2d6
  [ 2503.543629]  irq_exit+0xae/0xb0
  [ 2503.547132]  smp_apic_timer_interrupt+0x7b/0x140
  [ 2503.552280]  apic_timer_interrupt+0xf/0x20
  [ 2503.556848]  
  [ 2503.559187] RIP: 0010:native_safe_halt+0xe/0x10
  [ 2503.564239] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 
2d 66 dd 52 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 56 dd 52 00 fb f4  90 0f 
1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 cd cd 63 ff 65
  [ 2503.585191] RSP: 0018:94803e18 EFLAGS: 0202 ORIG_RAX: 
ff13
  [ 2503.593635] RAX: 0001e7c0 RBX: 996849080de8 RCX: 
00149022
  [ 2503.601595] RDX: 00149022 RSI:  RDI: 
948c5ba0
  [ 2503.609556] RBP: 94803e38 R08: 02a8 R09: 
9968a228f000
  [ 2503.617516] R10:  R11: 0002 R12: 

  [ 2503.625475] R13:  R14:  R15: 

  [ 2503.633440]  ? default_idle+0x20/0x140
  [ 2503.637623]  arch_cpu_idle+0x15/0x20
  [ 2503.641608]  default_idle_call+0x23/0x30
  [ 2503.645984]  do_idle+0x1fb/0x270
  [ 2503.649583]  cpu_startup_entry+0x20/0x30
  [ 2503.653960]  rest_init+0xae/0xb0
  [ 2503.657563]  arch_call_rest_init+0xe/0x1b
  [ 2503.662025]  start_kernel+0x549/0x56a
  [ 2503.666108]  x86_64_start_reservations+0x24/0x26
  [ 2503.671258]  x86_64_start_kernel+0x75/0x79
  [ 2503.675828]  secondary_startup_64+0xa4/0xb0
  [ 2503.680493] Modules linked in: sch_etf sch_fq dccp_ipv6 dccp_ipv4 dccp 
ip6table_nat iptable_nat xt_nat nf_nat algif_hash af_alg ip6table_filter 
xt_conntrack nf_conntrack nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 ip_vti 
ip6_vti fou6 sit ipip tunnel4 geneve act_mirred cls_basic esp6 authenc echainiv 
iptable_filter xt_policy bpfilter veth esp4_offload esp4 xfrm_user xfrm_algo 
macsec fou vxlan ip6_udp_tunnel udp_tunnel vrf 8021q garp mrp bridge stp llc 
ip6_gre ip6_tunnel tunnel6 ip_gre ip_tunnel gre cls_u32 sch_htb dummy 
binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua 
amd64_edac_mod edac_mce_amd kvm_amd kvm ipmi_ssif input_leds cdc_ether usbnet 
mii ccp k10temp ipmi_si ipmi_devintf 

[Kernel-packages] [Bug 1922387] Re: BUG: kernel NULL pointer dereference, address: 0000000000000050

2021-04-14 Thread Ian
Here are the steps I used to reproduce:

#if using proposed pocket kernel
https://wiki.ubuntu.com/Testing/EnableProposed

#Need to enable deb-src for proposed/updates for this work
sudo apt update
$ sudo apt-get source linux

#After source is pulled, build and run ftrace selftests
$ sudo make -C linux-5.4.0/tools/testing/selftests TARGETS=ftrace run_tests

I also tested on Ubuntu-5.4.0-70.78 and saw similar behavior with soft
lockups, but have yet to replicate the crash.  Though I don't feel I
have evidence to indicate this is a kernel regression.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1922387

Title:
  BUG: kernel NULL pointer dereference, address: 0050

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Groovy:
  Incomplete
Status in linux source package in Hirsute:
  Incomplete

Bug description:
  I observed the following kernel panic with the 5.4.0-71.79-generic
  kernel while running kernel selftests:

  blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
  [ 1672.118199] mmiotrace: Error taking CPU254 down: -28
  [ 1672.230306] mmiotrace: Error taking CPU255 down: -28
  [ 2503.359753] BUG: kernel NULL pointer dereference, address: 0050
  [ 2503.367527] #PF: supervisor read access in kernel mode
  [ 2503.373257] #PF: error_code(0x) - not-present page
  [ 2503.378989] PGD 0 P4D 0 
  [ 2503.381812] Oops:  [#1] SMP NOPTI
  [ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   OE 
5.4.0-71-generic #79-Ubuntu
  [ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 
0.33 01/19/2021
  [ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
  [ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 
8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 
50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
  [ 2503.432683] RSP: 0018:a8d6c0003d90 EFLAGS: 00010286
  [ 2503.438513] RAX:  RBX:  RCX: 
8100
  [ 2503.446474] RDX: 9968a228f418 RSI: 0100 RDI: 
9968a228f414
  [ 2503.454436] RBP: a8d6c0003df8 R08: 9968a228f414 R09: 
0100
  [ 2503.462394] R10: 0007 R11: 0007 R12: 
9968a228f418
  [ 2503.470353] R13: fffa R14: 0003 R15: 
9a686f9b3000
  [ 2503.478316] FS:  () GS:99690cc0() 
knlGS:
  [ 2503.487342] CS:  0010 DS:  ES:  CR0: 80050033
  [ 2503.493752] CR2: 0050 CR3: 007e08ad6000 CR4: 
00340ef0
  [ 2503.501712] Call Trace:
  [ 2503.504438]  
  [ 2503.506682]  wb_timer_fn+0x1d6/0x3c0
  [ 2503.510672]  ? blk_stat_free_callback_rcu+0x30/0x30
  [ 2503.516112]  blk_stat_timer_fn+0x134/0x140
  [ 2503.520683]  call_timer_fn+0x32/0x130
  [ 2503.524768]  __run_timers.part.0+0x180/0x280
  [ 2503.529535]  ? trace_event_raw_event_softirq+0x5d/0xa0
  [ 2503.535267]  run_timer_softirq+0x2a/0x50
  [ 2503.539644]  __do_softirq+0xe1/0x2d6
  [ 2503.543629]  irq_exit+0xae/0xb0
  [ 2503.547132]  smp_apic_timer_interrupt+0x7b/0x140
  [ 2503.552280]  apic_timer_interrupt+0xf/0x20
  [ 2503.556848]  
  [ 2503.559187] RIP: 0010:native_safe_halt+0xe/0x10
  [ 2503.564239] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 
2d 66 dd 52 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 56 dd 52 00 fb f4  90 0f 
1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 cd cd 63 ff 65
  [ 2503.585191] RSP: 0018:94803e18 EFLAGS: 0202 ORIG_RAX: 
ff13
  [ 2503.593635] RAX: 0001e7c0 RBX: 996849080de8 RCX: 
00149022
  [ 2503.601595] RDX: 00149022 RSI:  RDI: 
948c5ba0
  [ 2503.609556] RBP: 94803e38 R08: 02a8 R09: 
9968a228f000
  [ 2503.617516] R10:  R11: 0002 R12: 

  [ 2503.625475] R13:  R14:  R15: 

  [ 2503.633440]  ? default_idle+0x20/0x140
  [ 2503.637623]  arch_cpu_idle+0x15/0x20
  [ 2503.641608]  default_idle_call+0x23/0x30
  [ 2503.645984]  do_idle+0x1fb/0x270
  [ 2503.649583]  cpu_startup_entry+0x20/0x30
  [ 2503.653960]  rest_init+0xae/0xb0
  [ 2503.657563]  arch_call_rest_init+0xe/0x1b
  [ 2503.662025]  start_kernel+0x549/0x56a
  [ 2503.666108]  x86_64_start_reservations+0x24/0x26
  [ 2503.671258]  x86_64_start_kernel+0x75/0x79
  [ 2503.675828]  secondary_startup_64+0xa4/0xb0
  [ 2503.680493] Modules linked in: sch_etf sch_fq dccp_ipv6 dccp_ipv4 dccp 
ip6table_nat iptable_nat xt_nat nf_nat algif_hash af_alg ip6table_filter 
xt_conntrack nf_conntrack nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 ip_vti 
ip6_vti fou6 sit ipip tunnel4 geneve act_mirred cls_basic esp6 authenc echainiv 

[Kernel-packages] [Bug 1922387] Re: BUG: kernel NULL pointer dereference, address: 0000000000000050

2021-04-14 Thread Ian
I did some manual ubuntu_kernel_selftests ftrace testing on the
5.4.0-71.79-generic kernel.  I was able to replicate the panic, but not
on every run, but even on runs with no panic dmesg would report several
soft lockups.

After removing the MOFED dkms, I was unable to replicate a panic or any
of the soft lockups previously seen. Currently I don't have evidence as
to which MOFED module is potentially triggering the problem.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1922387

Title:
  BUG: kernel NULL pointer dereference, address: 0050

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Groovy:
  Incomplete
Status in linux source package in Hirsute:
  Incomplete

Bug description:
  I observed the following kernel panic with the 5.4.0-71.79-generic
  kernel while running kernel selftests:

  blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
  [ 1672.118199] mmiotrace: Error taking CPU254 down: -28
  [ 1672.230306] mmiotrace: Error taking CPU255 down: -28
  [ 2503.359753] BUG: kernel NULL pointer dereference, address: 0050
  [ 2503.367527] #PF: supervisor read access in kernel mode
  [ 2503.373257] #PF: error_code(0x) - not-present page
  [ 2503.378989] PGD 0 P4D 0 
  [ 2503.381812] Oops:  [#1] SMP NOPTI
  [ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   OE 
5.4.0-71-generic #79-Ubuntu
  [ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 
0.33 01/19/2021
  [ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
  [ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 
8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 
50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
  [ 2503.432683] RSP: 0018:a8d6c0003d90 EFLAGS: 00010286
  [ 2503.438513] RAX:  RBX:  RCX: 
8100
  [ 2503.446474] RDX: 9968a228f418 RSI: 0100 RDI: 
9968a228f414
  [ 2503.454436] RBP: a8d6c0003df8 R08: 9968a228f414 R09: 
0100
  [ 2503.462394] R10: 0007 R11: 0007 R12: 
9968a228f418
  [ 2503.470353] R13: fffa R14: 0003 R15: 
9a686f9b3000
  [ 2503.478316] FS:  () GS:99690cc0() 
knlGS:
  [ 2503.487342] CS:  0010 DS:  ES:  CR0: 80050033
  [ 2503.493752] CR2: 0050 CR3: 007e08ad6000 CR4: 
00340ef0
  [ 2503.501712] Call Trace:
  [ 2503.504438]  
  [ 2503.506682]  wb_timer_fn+0x1d6/0x3c0
  [ 2503.510672]  ? blk_stat_free_callback_rcu+0x30/0x30
  [ 2503.516112]  blk_stat_timer_fn+0x134/0x140
  [ 2503.520683]  call_timer_fn+0x32/0x130
  [ 2503.524768]  __run_timers.part.0+0x180/0x280
  [ 2503.529535]  ? trace_event_raw_event_softirq+0x5d/0xa0
  [ 2503.535267]  run_timer_softirq+0x2a/0x50
  [ 2503.539644]  __do_softirq+0xe1/0x2d6
  [ 2503.543629]  irq_exit+0xae/0xb0
  [ 2503.547132]  smp_apic_timer_interrupt+0x7b/0x140
  [ 2503.552280]  apic_timer_interrupt+0xf/0x20
  [ 2503.556848]  
  [ 2503.559187] RIP: 0010:native_safe_halt+0xe/0x10
  [ 2503.564239] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 
2d 66 dd 52 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 56 dd 52 00 fb f4  90 0f 
1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 cd cd 63 ff 65
  [ 2503.585191] RSP: 0018:94803e18 EFLAGS: 0202 ORIG_RAX: 
ff13
  [ 2503.593635] RAX: 0001e7c0 RBX: 996849080de8 RCX: 
00149022
  [ 2503.601595] RDX: 00149022 RSI:  RDI: 
948c5ba0
  [ 2503.609556] RBP: 94803e38 R08: 02a8 R09: 
9968a228f000
  [ 2503.617516] R10:  R11: 0002 R12: 

  [ 2503.625475] R13:  R14:  R15: 

  [ 2503.633440]  ? default_idle+0x20/0x140
  [ 2503.637623]  arch_cpu_idle+0x15/0x20
  [ 2503.641608]  default_idle_call+0x23/0x30
  [ 2503.645984]  do_idle+0x1fb/0x270
  [ 2503.649583]  cpu_startup_entry+0x20/0x30
  [ 2503.653960]  rest_init+0xae/0xb0
  [ 2503.657563]  arch_call_rest_init+0xe/0x1b
  [ 2503.662025]  start_kernel+0x549/0x56a
  [ 2503.666108]  x86_64_start_reservations+0x24/0x26
  [ 2503.671258]  x86_64_start_kernel+0x75/0x79
  [ 2503.675828]  secondary_startup_64+0xa4/0xb0
  [ 2503.680493] Modules linked in: sch_etf sch_fq dccp_ipv6 dccp_ipv4 dccp 
ip6table_nat iptable_nat xt_nat nf_nat algif_hash af_alg ip6table_filter 
xt_conntrack nf_conntrack nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 ip_vti 
ip6_vti fou6 sit ipip tunnel4 geneve act_mirred cls_basic esp6 authenc echainiv 
iptable_filter xt_policy bpfilter veth esp4_offload esp4 xfrm_user xfrm_algo 
macsec fou vxlan ip6_udp_tunnel udp_tunnel vrf 8021q garp 

[Kernel-packages] [Bug 1922387] Re: BUG: kernel NULL pointer dereference, address: 0000000000000050

2021-04-02 Thread Francis Ginther
This panic occurred while running the ubuntu_kernel_selftests suite. The
last bit of logs are:

13:33:20 DEBUG| [stdout] # selftests: ftrace: ftracetest
13:33:20 DEBUG| [stdout] # === Ftrace unit tests ===
13:33:28 DEBUG| [stdout] # [1] Basic trace file check [PASS]
13:37:04 DEBUG| [stdout] # [2] Basic test for tracers [PASS]
13:39:48 DEBUG| [stdout] # [3] Basic trace clock test [PASS]
13:39:56 DEBUG| [stdout] # [4] Basic event tracing check [PASS]
13:40:04 DEBUG| [stdout] # [5] Change the ringbuffer size [PASS]
13:40:20 DEBUG| [stdout] # [6] Snapshot and tracing setting [PASS]
13:40:35 DEBUG| [stdout] # [7] trace_pipe and trace_marker [PASS]
13:40:51 DEBUG| [stdout] # [8] Generic dynamic event - add/remove kprobe events 
[PASS]
13:41:07 DEBUG| [stdout] # [9] Generic dynamic event - add/remove synthetic 
events [PASS]
13:41:14 DEBUG| [stdout] # [10] Generic dynamic event - selective clear 
(compatibility) [PASS]
13:41:22 DEBUG| [stdout] # [11] Generic dynamic event - generic clear event 
[PASS]
13:41:46 DEBUG| [stdout] # [12] event tracing - enable/disable with event level 
files [PASS]
13:42:17 DEBUG| [stdout] # [13] event tracing - restricts events based on pid 
[PASS]
13:42:41 DEBUG| [stdout] # [14] event tracing - enable/disable with subsystem 
level files [PASS]
13:43:05 DEBUG| [stdout] # [15] event tracing - enable/disable with top level 
files [PASS]
13:43:14 DEBUG| [stdout] # [16] Test trace_printk from module [PASS]
13:43:56 DEBUG| [stdout] # [17] ftrace - function graph filters with stack 
tracer [PASS]
13:44:29 DEBUG| [stdout] # [18] ftrace - function graph filters [PASS]
13:45:49 DEBUG| [stdout] # [19] ftrace - function pid filters [PASS]
13:46:06 DEBUG| [stdout] # [20] ftrace - stacktrace filter command [PASS]
13:46:38 DEBUG| [stdout] # [21] ftrace - function trace with cpumask [PASS]
13:47:13 DEBUG| [stdout] # [22] ftrace - test for function event triggers [PASS]
13:47:21 DEBUG| [stdout] # [23] ftrace - function trace on module [PASS]
13:47:31 DEBUG| [stdout] # [24] ftrace - function profiling [PASS]
13:48:07 DEBUG| [stdout] # [25] ftrace - function profiler with function 
tracing [PASS]
13:48:25 DEBUG| [stdout] # [26] ftrace - test reading of set_ftrace_filter 
[PASS]
 END OF MESSAGES 

This job was run twice. The prior run also hung before completing, but
we don't have a console log for that time period, so it's unclear if it
also panic'd. It's last messages were:

04:44:27 DEBUG| [stdout] # selftests: timers: nsleep-lat
04:44:48 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME [OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC [OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_RAW [UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_COARSE [UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_COARSE [UNSUPPORTED]
04:45:30 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME [OK]
04:45:52 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_ALARM [OK]
04:46:13 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME_ALARM [OK]
04:46:34 DEBUG| [stdout] # nsleep latency CLOCK_TAI [OK]
04:46:34 DEBUG| [stdout] # # Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0
04:46:34 DEBUG| [stdout] ok 3 selftests: timers: nsleep-lat
04:46:34 DEBUG| [stdout] # selftests: timers: set-timer-lat

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1922387

Title:
  BUG: kernel NULL pointer dereference, address: 0050

Status in linux package in Ubuntu:
  New
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Groovy:
  New
Status in linux source package in Hirsute:
  New

Bug description:
  I observed the following kernel panic with the 5.4.0-71.79-generic
  kernel while running kernel selftests:

  blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
  [ 1672.118199] mmiotrace: Error taking CPU254 down: -28
  [ 1672.230306] mmiotrace: Error taking CPU255 down: -28
  [ 2503.359753] BUG: kernel NULL pointer dereference, address: 0050
  [ 2503.367527] #PF: supervisor read access in kernel mode
  [ 2503.373257] #PF: error_code(0x) - not-present page
  [ 2503.378989] PGD 0 P4D 0 
  [ 2503.381812] Oops:  [#1] SMP NOPTI
  [ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   OE 
5.4.0-71-generic #79-Ubuntu
  [ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 
0.33 01/19/2021
  [ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
  [ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 
8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 
50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
  [ 2503.432683] RSP: 0018:a8d6c0003d90 EFLAGS: 00010286
  [ 2503.438513] RAX:  RBX:  RCX: 
8100
  [ 2503.446474] RDX: 9968a228f418 RSI: 

[Kernel-packages] [Bug 1922387] Re: BUG: kernel NULL pointer dereference, address: 0000000000000050

2021-04-02 Thread Francis Ginther
This panic occurred while running the ubuntu_kernel_selftests suite. The
last bit of logs are:

13:33:20 DEBUG| [stdout] # selftests: ftrace: ftracetest
13:33:20 DEBUG| [stdout] # === Ftrace unit tests ===
13:33:28 DEBUG| [stdout] # [1] Basic trace file check   [PASS]
13:37:04 DEBUG| [stdout] # [2] Basic test for tracers   [PASS]
13:39:48 DEBUG| [stdout] # [3] Basic trace clock test   [PASS]
13:39:56 DEBUG| [stdout] # [4] Basic event tracing check[PASS]
13:40:04 DEBUG| [stdout] # [5] Change the ringbuffer size   [PASS]
13:40:20 DEBUG| [stdout] # [6] Snapshot and tracing setting [PASS]
13:40:35 DEBUG| [stdout] # [7] trace_pipe and trace_marker  [PASS]
13:40:51 DEBUG| [stdout] # [8] Generic dynamic event - add/remove kprobe events 
[PASS]
13:41:07 DEBUG| [stdout] # [9] Generic dynamic event - add/remove synthetic 
events  [PASS]
13:41:14 DEBUG| [stdout] # [10] Generic dynamic event - selective clear 
(compatibility) [PASS]
13:41:22 DEBUG| [stdout] # [11] Generic dynamic event - generic clear event 
[PASS]
13:41:46 DEBUG| [stdout] # [12] event tracing - enable/disable with event level 
files   [PASS]
13:42:17 DEBUG| [stdout] # [13] event tracing - restricts events based on pid   
[PASS]
13:42:41 DEBUG| [stdout] # [14] event tracing - enable/disable with subsystem 
level files   [PASS]
13:43:05 DEBUG| [stdout] # [15] event tracing - enable/disable with top level 
files [PASS]
13:43:14 DEBUG| [stdout] # [16] Test trace_printk from module   [PASS]
13:43:56 DEBUG| [stdout] # [17] ftrace - function graph filters with stack 
tracer   [PASS]
13:44:29 DEBUG| [stdout] # [18] ftrace - function graph filters [PASS]
13:45:49 DEBUG| [stdout] # [19] ftrace - function pid filters   [PASS]
13:46:06 DEBUG| [stdout] # [20] ftrace - stacktrace filter command  [PASS]
13:46:38 DEBUG| [stdout] # [21] ftrace - function trace with cpumask[PASS]
13:47:13 DEBUG| [stdout] # [22] ftrace - test for function event triggers   
[PASS]
13:47:21 DEBUG| [stdout] # [23] ftrace - function trace on module   [PASS]
13:47:31 DEBUG| [stdout] # [24] ftrace - function profiling [PASS]
13:48:07 DEBUG| [stdout] # [25] ftrace - function profiler with function 
tracing[PASS]
13:48:25 DEBUG| [stdout] # [26] ftrace - test reading of set_ftrace_filter  
[PASS]
 END OF MESSAGES 

This job was run twice. The prior run also hung before completing, but
we don't have a console log for that time period, so it's unclear if it
also panic'd. It's last messages were:

04:44:27 DEBUG| [stdout] # selftests: timers: nsleep-lat
04:44:48 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME [OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC[OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_RAW
[UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_COARSE  
[UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_COARSE 
[UNSUPPORTED]
04:45:30 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME [OK]
04:45:52 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_ALARM   [OK]
04:46:13 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME_ALARM   [OK]
04:46:34 DEBUG| [stdout] # nsleep latency CLOCK_TAI  [OK]
04:46:34 DEBUG| [stdout] # # Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0
04:46:34 DEBUG| [stdout] ok 3 selftests: timers: nsleep-lat
04:46:34 DEBUG| [stdout] # selftests: timers: set-timer-lat

The job can be found here:
http://10.246.72.4:8080/view/nvidia%20a100%20-%20blanka/job/focal-linux-
generic-amd64-5.4.0-blanka-ubuntu_kernel_selftests/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1922387

Title:
  BUG: kernel NULL pointer dereference, address: 0050

Status in linux package in Ubuntu:
  New
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Groovy:
  New
Status in linux source package in Hirsute:
  New

Bug description:
  I observed the following kernel panic with the 5.4.0-71.79-generic
  kernel while running kernel selftests:

  blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
  [ 1672.118199] mmiotrace: Error taking CPU254 down: -28
  [ 1672.230306] mmiotrace: Error taking CPU255 down: -28
  [ 2503.359753] BUG: kernel NULL pointer dereference, address: 0050
  [ 2503.367527] #PF: supervisor read access in kernel mode
  [ 2503.373257] #PF: error_code(0x) - not-present page
  [ 2503.378989] PGD 0 P4D 0 
  [ 2503.381812] Oops:  [#1] SMP NOPTI
  [ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   OE 
5.4.0-71-generic #79-Ubuntu
  [ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 
0.33 01/19/2021
  [ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
  [ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00