Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
Hi James, On Wed, Apr 18, 2018 at 02:59:15PM +0100, James Simmons wrote: Hello, FYI this happens in mainline kernel 4.17.0-rc1. It looks like a new regression. [7.587002] lnet_selftest_init+0x2c4/0x5d9: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134 [7.587002] ? lnet_selftest_exit+0x8d/0x8d: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90 Are you running lnet selftest ? Perhaps yes -- it's randconfig boot test and the .config does include CONFIG_LNET_SELFTEST: CONFIG_LNET=y CONFIG_LNET_MAX_PAYLOAD=1048576 ==> CONFIG_LNET_SELFTEST=y CONFIG_LNET_XPRT_IB=y Is this a UMP setup? Yes, .config has: # CONFIG_SMP is not set The reason I ask is that their is a SMP handling bug in lnet selftest. If you look at the mailing list I pushed a SMP patch series. Can you try that series and tell me if it works for you. So it looks your fixup patch is not for this case? Anyway the reproduce-* script attached in the previous email should be fairly straightforward to try out for reproducing the bug. Thanks, Fengguang
Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
Hi James, On Wed, Apr 18, 2018 at 02:59:15PM +0100, James Simmons wrote: Hello, FYI this happens in mainline kernel 4.17.0-rc1. It looks like a new regression. [7.587002] lnet_selftest_init+0x2c4/0x5d9: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134 [7.587002] ? lnet_selftest_exit+0x8d/0x8d: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90 Are you running lnet selftest ? Perhaps yes -- it's randconfig boot test and the .config does include CONFIG_LNET_SELFTEST: CONFIG_LNET=y CONFIG_LNET_MAX_PAYLOAD=1048576 ==> CONFIG_LNET_SELFTEST=y CONFIG_LNET_XPRT_IB=y Is this a UMP setup? Yes, .config has: # CONFIG_SMP is not set The reason I ask is that their is a SMP handling bug in lnet selftest. If you look at the mailing list I pushed a SMP patch series. Can you try that series and tell me if it works for you. So it looks your fixup patch is not for this case? Anyway the reproduce-* script attached in the previous email should be fairly straightforward to try out for reproducing the bug. Thanks, Fengguang
Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
> Hello, > > FYI this happens in mainline kernel 4.17.0-rc1. > It looks like a new regression. > > [7.587002] lnet_selftest_init+0x2c4/0x5d9: > lnet_selftest_init at > drivers/staging/lustre/lnet/selftest/module.c:134 > [7.587002] ? lnet_selftest_exit+0x8d/0x8d: > lnet_selftest_init at > drivers/staging/lustre/lnet/selftest/module.c:90 Are you running lnet selftest ? Is this a UMP setup? The reason I ask is that their is a SMP handling bug in lnet selftest. If you look at the mailing list I pushed a SMP patch series. Can you try that series and tell me if it works for you. Thanks
Re: [cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
> Hello, > > FYI this happens in mainline kernel 4.17.0-rc1. > It looks like a new regression. > > [7.587002] lnet_selftest_init+0x2c4/0x5d9: > lnet_selftest_init at > drivers/staging/lustre/lnet/selftest/module.c:134 > [7.587002] ? lnet_selftest_exit+0x8d/0x8d: > lnet_selftest_init at > drivers/staging/lustre/lnet/selftest/module.c:90 Are you running lnet selftest ? Is this a UMP setup? The reason I ask is that their is a SMP handling bug in lnet selftest. If you look at the mailing list I pushed a SMP patch series. Can you try that series and tell me if it works for you. Thanks
[cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
Hello, FYI this happens in mainline kernel 4.17.0-rc1. It looks like a new regression. It occurs in 5 out of 5 boots. [6.524361] ledtrig-cpu: registered to indicate activity on CPUs [6.527658] NET: Registered protocol family 4 [6.528191] comedi: version 0.7.76 - http://www.comedi.org [6.528851] LNetError: 1:0:(module.c:546:libcfs_init()) misc_register: error -16 [7.220272] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3 [7.586283] BUG: unable to handle kernel NULL pointer dereference at 0050 [7.586962] *pdpt = *pde = f000ff53f000ff53 [7.587002] Oops: [#1] PREEMPT [7.587002] CPU: 0 PID: 1 Comm: swapper Not tainted 4.17.0-rc1 #1 [7.587002] EIP: cfs_trace_lock_tcd+0xb/0xa0: cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149 [7.587002] EFLAGS: 00210246 CPU: 0 [7.587002] EAX: EBX: ECX: 81fcb588 EDX: [7.587002] ESI: 1800 EDI: 8f5d1e08 EBP: 8f5d1d7c ESP: 8f5d1d70 [7.587002] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 [7.587002] CR0: 80050033 CR2: 0050 CR3: 022f CR4: 06b0 [7.587002] Call Trace: [7.587002] libcfs_debug_vmsg2+0x8f/0x82f: libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:317 [7.587002] ? trace_irq_enable_rcuidle+0x25/0x62: static_key_false at include/linux/jump_label.h:206 (inlined by) trace_irq_enable_rcuidle at include/trace/events/preemptirq.h:40 [7.587002] ? slob_free+0x249/0x251: slob_free at mm/slob.c:421 [7.587002] libcfs_debug_msg+0x19/0x1b: libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:287 [7.587002] ksocknal_startup+0xe77/0x12b2: ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2845 (discriminator 3) [7.587002] ? lock_release+0x135/0x1ec: lock_release at kernel/locking/lockdep.c:3942 [7.587002] ? _raw_spin_unlock+0x3c/0x4b: __raw_spin_unlock at include/linux/spinlock_api_smp.h:152 (inlined by) _raw_spin_unlock at kernel/locking/spinlock.c:176 [7.587002] lnet_startup_lndni+0x4cd/0x9ec: lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1304 [7.587002] LNetNIInit+0x880/0xa00: lnet_startup_lndnis at drivers/staging/lustre/lnet/lnet/api-ni.c:1385 (inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1543 [7.587002] ? read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 [7.587002] srpc_startup+0x84/0x381: srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613 [7.587002] lnet_selftest_init+0x2c4/0x5d9: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134 [7.587002] ? lnet_selftest_exit+0x8d/0x8d: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90 [7.587002] do_one_initcall+0x76/0x1d7: __read_once_size at include/linux/compiler.h:188 (inlined by) arch_atomic_read at arch/x86/include/asm/atomic.h:31 (inlined by) atomic_read at include/asm-generic/atomic-instrumented.h:22 (inlined by) static_key_count at include/linux/jump_label.h:194 (inlined by) static_key_false at include/linux/jump_label.h:206 (inlined by) trace_initcall_finish at include/trace/events/initcall.h:44 (inlined by) do_one_initcall at init/main.c:884 [7.587002] ?
[cfs_trace_lock_tcd] BUG: unable to handle kernel NULL pointer dereference at 00000050
Hello, FYI this happens in mainline kernel 4.17.0-rc1. It looks like a new regression. It occurs in 5 out of 5 boots. [6.524361] ledtrig-cpu: registered to indicate activity on CPUs [6.527658] NET: Registered protocol family 4 [6.528191] comedi: version 0.7.76 - http://www.comedi.org [6.528851] LNetError: 1:0:(module.c:546:libcfs_init()) misc_register: error -16 [7.220272] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3 [7.586283] BUG: unable to handle kernel NULL pointer dereference at 0050 [7.586962] *pdpt = *pde = f000ff53f000ff53 [7.587002] Oops: [#1] PREEMPT [7.587002] CPU: 0 PID: 1 Comm: swapper Not tainted 4.17.0-rc1 #1 [7.587002] EIP: cfs_trace_lock_tcd+0xb/0xa0: cfs_trace_lock_tcd at drivers/staging/lustre/lnet/libcfs/linux/linux-tracefile.c:149 [7.587002] EFLAGS: 00210246 CPU: 0 [7.587002] EAX: EBX: ECX: 81fcb588 EDX: [7.587002] ESI: 1800 EDI: 8f5d1e08 EBP: 8f5d1d7c ESP: 8f5d1d70 [7.587002] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 [7.587002] CR0: 80050033 CR2: 0050 CR3: 022f CR4: 06b0 [7.587002] Call Trace: [7.587002] libcfs_debug_vmsg2+0x8f/0x82f: libcfs_debug_vmsg2 at drivers/staging/lustre/lnet/libcfs/tracefile.c:317 [7.587002] ? trace_irq_enable_rcuidle+0x25/0x62: static_key_false at include/linux/jump_label.h:206 (inlined by) trace_irq_enable_rcuidle at include/trace/events/preemptirq.h:40 [7.587002] ? slob_free+0x249/0x251: slob_free at mm/slob.c:421 [7.587002] libcfs_debug_msg+0x19/0x1b: libcfs_debug_msg at drivers/staging/lustre/lnet/libcfs/tracefile.c:287 [7.587002] ksocknal_startup+0xe77/0x12b2: ksocknal_startup at drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c:2845 (discriminator 3) [7.587002] ? lock_release+0x135/0x1ec: lock_release at kernel/locking/lockdep.c:3942 [7.587002] ? _raw_spin_unlock+0x3c/0x4b: __raw_spin_unlock at include/linux/spinlock_api_smp.h:152 (inlined by) _raw_spin_unlock at kernel/locking/spinlock.c:176 [7.587002] lnet_startup_lndni+0x4cd/0x9ec: lnet_startup_lndni at drivers/staging/lustre/lnet/lnet/api-ni.c:1304 [7.587002] LNetNIInit+0x880/0xa00: lnet_startup_lndnis at drivers/staging/lustre/lnet/lnet/api-ni.c:1385 (inlined by) LNetNIInit at drivers/staging/lustre/lnet/lnet/api-ni.c:1543 [7.587002] ? read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 read_seqcount_retry+0x1b/0x22: read_seqcount_retry at include/linux/seqlock.h:222 [7.587002] srpc_startup+0x84/0x381: srpc_startup at drivers/staging/lustre/lnet/selftest/rpc.c:1613 [7.587002] lnet_selftest_init+0x2c4/0x5d9: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:134 [7.587002] ? lnet_selftest_exit+0x8d/0x8d: lnet_selftest_init at drivers/staging/lustre/lnet/selftest/module.c:90 [7.587002] do_one_initcall+0x76/0x1d7: __read_once_size at include/linux/compiler.h:188 (inlined by) arch_atomic_read at arch/x86/include/asm/atomic.h:31 (inlined by) atomic_read at include/asm-generic/atomic-instrumented.h:22 (inlined by) static_key_count at include/linux/jump_label.h:194 (inlined by) static_key_false at include/linux/jump_label.h:206 (inlined by) trace_initcall_finish at include/trace/events/initcall.h:44 (inlined by) do_one_initcall at init/main.c:884 [7.587002] ?