from:"Fengguang Wu"


On Sun, Nov 12, 2017 at 09:23:52AM +0800, Alexei Starovoitov wrote:

On 11/12/17 9:18 AM, Fengguang Wu wrote:

On Sun, Nov 12, 2017 at 09:14:14AM +0800, Alexei Starovoitov wrote:

On 11/12/17 8:23 AM, kbuild test robot wrote:

tree:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
master
head:   7c5556decd0a629e9ee02e93653f75ba7b7da03c
commit: 638f5b90d46016372a8e3e0a434f199cc5e12b8c [488/665] bpf:
reduce verifier memory consumption
config: mips-64r6el_defconfig (attached as .config)
compiler: mips64el-linux-gnuabi64-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
git checkout 638f5b90d46016372a8e3e0a434f199cc5e12b8c
# save the attached .config to linux build tree
make.cross ARCH=mips

All errors (new ones prefixed by >>):

   kernel/bpf/verifier.o: In function `realloc_verifier_state.isra.19':

verifier.c:(.text+0x36fc): undefined reference to `__multi3'


that's a known issue with gcc 7 on mips that is "optimizing"
normal 64-bit multiply into 128-bit variant.
Nothing to fix on the kernel side.


Good to know that! Do you think it a good idea to blacklist __multi3
errors in mips builds?


I would do so. yes.


OK.


Though digging further this function was added to
arch/sparc/lib/multi3.S
since gcc doing the same "optimization" there.
Adding asm code doesn't look right to me. I'd rather push
gcc folks to avoid such codegen.


Sure, I just forwarded the original report to GCC list.

Thanks,
Fengguang

[net-next:master 488/665] verifier.c:undefined reference to `__multi3'

CC gcc list. According to Alexei:

 This is a known issue with gcc 7 on mips that is "optimizing"
 normal 64-bit multiply into 128-bit variant.
 Nothing to fix on the kernel side.

 Digging further, this function was added to
 arch/sparc/lib/multi3.S
 since gcc doing the same "optimization" there.
 Adding asm code doesn't look right to me. I'd rather push
 gcc folks to avoid such codegen.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   7c5556decd0a629e9ee02e93653f75ba7b7da03c
commit: 638f5b90d46016372a8e3e0a434f199cc5e12b8c [488/665] bpf: reduce verifier 
memory consumption
config: mips-64r6el_defconfig (attached as .config)
compiler: mips64el-linux-gnuabi64-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 git checkout 638f5b90d46016372a8e3e0a434f199cc5e12b8c
 # save the attached .config to linux build tree
 make.cross ARCH=mips 

All errors (new ones prefixed by >>):

kernel/bpf/verifier.o: In function `realloc_verifier_state.isra.19':
>> verifier.c:(.text+0x36fc): undefined reference to `__multi3'
crypto/scompress.o: In function `.L82':
scompress.c:(.text+0x55c): undefined reference to `__multi3'
lib/mpi/generic_mpih-mul1.o: In function `.L2':
generic_mpih-mul1.c:(.text+0x60): undefined reference to `__multi3'
lib/mpi/generic_mpih-mul2.o: In function `.L2':
generic_mpih-mul2.c:(.text+0x5c): undefined reference to `__multi3'
lib/mpi/generic_mpih-mul3.o: In function `.L2':
generic_mpih-mul3.c:(.text+0x5c): undefined reference to `__multi3'
lib/mpi/mpih-div.o:mpih-div.c:(.text+0x1b8): more undefined references to 
`__multi3' follow

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip
___
kbuild-all mailing list
kbuild-...@lists.01.org
https://lists.01.org/mailman/listinfo/kbuild-all

Re: [net-next:master 488/665] verifier.c:undefined reference to `__multi3'


On Sun, Nov 12, 2017 at 09:14:14AM +0800, Alexei Starovoitov wrote:

On 11/12/17 8:23 AM, kbuild test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   7c5556decd0a629e9ee02e93653f75ba7b7da03c
commit: 638f5b90d46016372a8e3e0a434f199cc5e12b8c [488/665] bpf: reduce verifier 
memory consumption
config: mips-64r6el_defconfig (attached as .config)
compiler: mips64el-linux-gnuabi64-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
git checkout 638f5b90d46016372a8e3e0a434f199cc5e12b8c
# save the attached .config to linux build tree
make.cross ARCH=mips

All errors (new ones prefixed by >>):

   kernel/bpf/verifier.o: In function `realloc_verifier_state.isra.19':

verifier.c:(.text+0x36fc): undefined reference to `__multi3'


that's a known issue with gcc 7 on mips that is "optimizing"
normal 64-bit multiply into 128-bit variant.
Nothing to fix on the kernel side.


Good to know that! Do you think it a good idea to blacklist __multi3
errors in mips builds?

Thanks,
Fengguang


   crypto/scompress.o: In function `.L82':
   scompress.c:(.text+0x55c): undefined reference to `__multi3'
   lib/mpi/generic_mpih-mul1.o: In function `.L2':
   generic_mpih-mul1.c:(.text+0x60): undefined reference to `__multi3'
   lib/mpi/generic_mpih-mul2.o: In function `.L2':
   generic_mpih-mul2.c:(.text+0x5c): undefined reference to `__multi3'
   lib/mpi/generic_mpih-mul3.o: In function `.L2':
   generic_mpih-mul3.c:(.text+0x5c): undefined reference to `__multi3'
   lib/mpi/mpih-div.o:mpih-div.c:(.text+0x1b8): more undefined references to 
`__multi3' follow

Re: [run_timer_softirq] BUG: unable to handle kernel paging request at 0000000000010007

On Fri, Nov 10, 2017 at 10:29:59PM +0100, Thomas Gleixner wrote:

On Fri, 10 Nov 2017, Linus Torvalds wrote:

On Wed, Nov 8, 2017 at 9:19 PM, Fengguang Wu <fengguang...@intel.com> wrote:
>
> Yes it's accessing the list. Here is the faddr2line output.

Ok, so it's a corrupted timer list. Which is not a big surprise.

It's

next->pprev = pprev;

in __hlist_del(), and the trapping instruction decodes as

mov%rdx,0x8(%rax)

with %rax having the value dead0200,

Which is just LIST_POISON2.

So we've deleted that entry twice - LIST_POISON2 is what hlist_del()
sets pprev to after already deleting it once.

Although in this case it might not be hlist_del(), because
detach_timer() also sets entry->next to LIST_POISON2.

Which is pretty bogus, we are supposed to use LIST_POISON1 for the
"next" pointer. Oh well. Nobody cares, except for the list entry
debugging code, which isn't run on the hlist cases.

Adding Thomas Gleixner to the cc. It should not be possible to delete
the same timer twice.

Right, it shouldn't.

Fengguang, can you please enable:

CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_TIMERS

and try to reproduce? Debugobject should catch that hopefully.

Sure. However I've not got any results until now -- it's rather hard
to reproduce. I'll check possible results tomorrow.

Regards,
Fengguang

Re: [Patch net] vlan: fix a use-after-free in vlan_device_event()

2017-11-10 Thread Fengguang Wu


It works, thank you for fixing this ancient bug!

Tested-by: Fengguang Wu <fengguang...@intel.com>

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


On Thu, Nov 09, 2017 at 02:55:10PM +0800, Fengguang Wu wrote:

On Wed, Nov 08, 2017 at 10:34:10PM -0800, Cong Wang wrote:

On Wed, Nov 8, 2017 at 7:12 PM, Fengguang Wu <fengguang...@intel.com> wrote:

Hi Alex,


So looking over the trace the panic seems to be happening after a
decnet interface is getting deleted. Is there any chance we could try
compiling the kernel without decnet support to see if that is the
source of these issues? I don't know if anyone on the Intel Wired Lan
team is testing with that enabled so if we can eliminate that as a
possible cause that would be useful.



Sure and thank you for the suggestion!

It looks disabling DECNET still triggers the vlan_device_event BUG.
However when looking at the dmesgs, I find another warning just before
the vlan_device_event BUG. Not sure if it's related one or independent
now-fixed issue.


Those decnet symbols are probably noises.


Yes it's not related to CONFIG_DECNET.


How do you reproduce it? And what is your setup? Vlan device on
top of your eth0 (e1000)?


It can basically be reproduced in one of our test machines --
lkp-wsx03, which is a Westmere EX server.


Anyway if you'd like to try, here are the steps. It'll auto download
the images and run QEMU.

   apt-get install lib32gcc-7-dev # or lib32gcc-6-dev
   git clone https://github.com/intel/lkp-tests.git
   cd lkp-tests
   bin/lkp qemu -k  job-script  # job-script is attached in this 
email

Note that even in our lkp-wsx03 machine, the chance of reproducing it
is only 3% (3 out of 100 boots).

Thanks,
Fengguang
#!/bin/sh

export_top_env()
{
export suite='trinity'
export testcase='trinity'
export runtime=300
export 
job_origin='/lkp/lkp/src/allot/rand/vm-lkp-wsx03-openwrt-i386/trinity.yaml'
export testbox='vm-lkp-wsx03-openwrt-i386-5'
export tbox_group='vm-lkp-wsx03-openwrt-i386'
export kconfig='i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS'
export compiler='gcc-6'
export queue='wfg'
export branch='linus/master'
export commit='c470abd4fde40ea6a0846a2beab642a578c0b8cd'
export submit_id='5a03a4550b9a93f7c99708b0'
export 
job_file='/lkp/scheduled/vm-lkp-wsx03-openwrt-i386-5/trinity-300s-openwrt-i386-2016-03-16.cgz-c470abd4fde40ea6a0846a2beab642a578c0b8cd-20171109-63433-kf9gj3-wait_kernel-0.yaml'
export id='181954ca4367d475b88dc8de99b2d52ab533a5e1'
export model='qemu-system-i386 -enable-kvm'
export nr_vm=32
export nr_cpu=1
export memory='320M'
export rootfs='openwrt-i386-2016-03-16.cgz'
export hdd_partitions='/dev/vda'
export swap_partitions='/dev/vdb'
export need_kconfig='CONFIG_KVM_GUEST=y'
export enqueue_time='2017-11-09 08:41:58 +0800'
export _id='5a03a4560b9a93f7c99708bb'
export 
_rt='/result/trinity/300s/vm-lkp-wsx03-openwrt-i386/openwrt-i386-2016-03-16.cgz/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd'
export user='lkp'
export 
result_root='/result/trinity/300s/vm-lkp-wsx03-openwrt-i386/openwrt-i386-2016-03-16.cgz/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd/0'
export LKP_SERVER='inn'
export max_uptime=1500
export initrd='/osimage/openwrt/openwrt-i386-2016-03-16.cgz'
export bootloader_append='root=/dev/ram0
user=lkp
job=/lkp/scheduled/vm-lkp-wsx03-openwrt-i386-5/trinity-300s-openwrt-i386-2016-03-16.cgz-c470abd4fde40ea6a0846a2beab642a578c0b8cd-20171109-63433-kf9gj3-wait_kernel-0.yaml
ARCH=i386
kconfig=i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS
branch=linus/master
commit=c470abd4fde40ea6a0846a2beab642a578c0b8cd
BOOT_IMAGE=/pkg/linux/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd/vmlinuz-4.10.0
max_uptime=1500
RESULT_ROOT=/result/trinity/300s/vm-lkp-wsx03-openwrt-i386/openwrt-i386-2016-03-16.cgz/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd/0
LKP_SERVER=inn
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw'
export lkp_initrd='/lkp/lkp/lkp-i386.cgz'
export bm_initrd='/osimage/pkg/static/trinity-i386.cgz'
export site='inn'
export LKP_CGI_PORT=80
export LKP_CIFS_PORT=139
export 
vmlinux_file='/pkg/linux/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd/vmlinux'
export 
kernel='/pkg/linux/i386-randconfig-b0-11061302-CONFIG_DRM_BOCHS/gcc-6/c470abd4fde40ea6a0846a2beab642a578c0b8cd/vmlinuz-4.10.0'
export dequeue_time='2017-11-09 09:06:15 +0800'
export 
job_initrd

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


On Thu, Nov 09, 2017 at 12:09:45PM +0800, Fengguang Wu wrote:

On Thu, Nov 09, 2017 at 11:12:06AM +0800, Fengguang Wu wrote:

Hi Alex,


So looking over the trace the panic seems to be happening after a
decnet interface is getting deleted. Is there any chance we could try
compiling the kernel without decnet support to see if that is the
source of these issues? I don't know if anyone on the Intel Wired Lan
team is testing with that enabled so if we can eliminate that as a
possible cause that would be useful.


Sure and thank you for the suggestion!

It looks disabling DECNET still triggers the vlan_device_event BUG.
However when looking at the dmesgs, I find another warning just before
the vlan_device_event BUG. Not sure if it's related one or independent
now-fixed issue.

Please press Enter to activate this console.
[ 1291.938326] Writes:  Total: 2  Max/Min: 0/0   Fail: 0
[ 1297.731690] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1297.828227] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1300.506245] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1302.467460] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
LKP: HOSTNAME vm-lkp-wsx03-openwrt-i386-10, MAC , kernel 4.13.0 1, serial 
console /dev/ttyS0
[ 1304.161688] Kernel tests: Boot OK!
[ 1306.558532] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1308.507499] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1310.526380] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1311.246017] LKP: waiting for network...
[ 1312.543432] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1313.985807]
[ 1313.991541] =
[ 1314.002398] WARNING: bad unlock balance detected!
[ 1314.013154] 4.13.0 #1 Not tainted
[ 1314.021549] -
[ 1314.032505] procd/1244 is trying to release lock (rcu_preempt_state) at:
[ 1314.047216] [] rcu_read_unlock_special+0x580/0x5b0
[ 1314.059825] but there are no more locks to release!
[ 1314.070546]
[ 1314.070546] other info that might help us debug this:
[ 1314.085941] 2 locks held by procd/1244:
[ 1314.095139]  #0:  (>cred_guard_mutex){..}, at: [] 
prepare_bprm_creds+0x28/0xc0
[ 1314.114616]  #1:  (rcu_read_lock){..}, at: [] 
path_init+0x490/0x6f0
[ 1314.132155]
[ 1314.132155] stack backtrace:
[ 1314.144402] CPU: 0 PID: 1244 Comm: procd Not tainted 4.13.0 #1
[ 1314.160197] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[ 1314.179404] Call Trace:
[ 1314.186768]  dump_stack+0x16/0x1c
[ 1314.195387]  print_unlock_imbalance_bug+0xb9/0xd0
[ 1314.205753]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.216381]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.226982]  lock_release+0x1cc/0x490
[ 1314.235602]  ? rcu_gp_kthread_wake+0x34/0x50
[ 1314.245262]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.255724]  rt_mutex_unlock+0x1e/0xb0
[ 1314.264610]  rcu_read_unlock_special+0x580/0x5b0
[ 1314.274814]  __rcu_read_unlock+0xa7/0xb0
[ 1314.283954]  unlazy_walk+0xcf/0x1f0
[ 1314.292409]  trailing_symlink+0x349/0x4e0
[ 1314.301583]  path_openat+0x333/0x1280
[ 1314.310197]  do_filp_open+0x67/0x140
[ 1314.318696]  ? getname_kernel+0x23/0x1e0
[ 1314.327766]  ? cache_alloc_debugcheck_after+0x13a/0x2a0
[ 1314.340076]  ? getname_kernel+0x23/0x1e0
[ 1314.349179]  do_open_execat+0xab/0x2a0
[ 1314.358063]  open_exec+0x57/0x80
[ 1314.366128]  load_script+0x33c/0x3d0
[ 1314.374556]  ? kvm_sched_clock_read+0x9/0x20
[ 1314.384219]  ? sched_clock+0x9/0x10
[ 1314.392611]  ? sched_clock_cpu+0x1a/0x1e0
[ 1314.401875]  ? _raw_read_unlock+0x55/0x90
[ 1314.411080]  search_binary_handler+0xd9/0x160
[ 1314.420799]  do_execveat_common+0x8f6/0xb10
[ 1314.430334]  SyS_execve+0x1f/0x30
[ 1314.438458]  do_int80_syscall_32+0x95/0x1b0
[ 1314.447956]  entry_INT80_32+0x2f/0x2f
[ 1314.456606] EIP: 0xb7e9ab07
[ 1314.464062] EFLAGS: 0296 CPU: 0
[ 1314.472421] EAX: ffda EBX: 0807b584 ECX: bfb0fd70 EDX: 08061250
[ 1314.485257] ESI: 0807b584 EDI:  EBP: bfb0fd58 ESP: bfb0fd28
[ 1314.498024]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[ 1314.613681] hotplug-call (1244) used greatest stack depth: 6384 bytes left
[ 1314.957636] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1316.955154] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1318.197800] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1320.222754] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1321.409456] BUG: unable to handle kernel paging request at 6b6b6f4f
[ 1321.421942] IP: vlan_device_event+0x7f5/0xa40
[ 1321.431239] *pde = 


Note that this call trace is different from the ones posted in earlier
emails.

[ 1321.431267]
[ 1321.443356] Oops:  [#1] PREEMPT
[ 1321.451390] CPU: 0 PID: 798 Comm: netifd Not tainted 4.13.0 #1
[ 1321.462802] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[ 1321.479743] task: cf8ae

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


On Wed, Nov 08, 2017 at 10:34:10PM -0800, Cong Wang wrote:

On Wed, Nov 8, 2017 at 7:12 PM, Fengguang Wu <fengguang...@intel.com> wrote:

Hi Alex,


So looking over the trace the panic seems to be happening after a
decnet interface is getting deleted. Is there any chance we could try
compiling the kernel without decnet support to see if that is the
source of these issues? I don't know if anyone on the Intel Wired Lan
team is testing with that enabled so if we can eliminate that as a
possible cause that would be useful.



Sure and thank you for the suggestion!

It looks disabling DECNET still triggers the vlan_device_event BUG.
However when looking at the dmesgs, I find another warning just before
the vlan_device_event BUG. Not sure if it's related one or independent
now-fixed issue.


Those decnet symbols are probably noises.


Yes it's not related to CONFIG_DECNET.


How do you reproduce it? And what is your setup? Vlan device on
top of your eth0 (e1000)?


It can basically be reproduced in one of our test machines --
lkp-wsx03, which is a Westmere EX server.

The test boots an openwrt image in QEMU and run trinity for minutes.

Here is the openwrt image:

https://github.com/0day-ci/lkp-qemu/blob/master/osimage/openwrt/openwrt-i386-2016-03-16.cgz

I don't know how openwrt deals with VLAN. :)

Thanks,
Fengguang

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf

On Thu, Nov 09, 2017 at 10:43:08AM +0800, Fengguang Wu wrote:

Of course, if it's bisectable, that would be great too.

Yes, bisect is on the way. So far it's bisecting in the 4.12 commits.

The bisect was unsuccessful due to an unrelated DRM_BOCHS oops in 4.11.
Disabling the buggy driver, I managed to reproduce the
vlan_device_event bug in 4.11. However only to find the older kernels
suffer from different kind of oops, which make the bisect troublesome.

% grep -E 'dmesg.(BUG|EIP)' v4.*/matrix.json
v4.10/matrix.json: "dmesg.BUG:unable_to_handle_kernel": [
v4.10/matrix.json: "dmesg.EIP:kobject_get": [
v4.10/matrix.json: "dmesg.BUG:kernel_reboot-without-warning_in_test_stage": [
v4.11/matrix.json: "dmesg.BUG:unable_to_handle_kernel": [
v4.11/matrix.json: "dmesg.EIP:vlan_device_event": [
v4.8/matrix.json:
"dmesg.BUG:kernel_reboot-without-warning_in_early-boot_stage,last_printk:Decompressing_Linux":
[
v4.9/matrix.json: "dmesg.BUG:key_not_in.data": [

I'll try tuning kconfig to get a good bisect base.

Hopefully we'll make the coming RC releases free from such bisect
stoppers with better regression tracking and bisect automation.

However if the situation turns out to be not that optimistic, we might
track "bisect stopper" kconfig options over kernel releases. For
bisect scripts to auto pick them up.

An example kconfig bisect-blacklist file would be:

$ cat tools/testing/configs/bisect-4.12.config
# CONFIG_DRM_BOCHS is not set

That would mean:

CONFIG_DRM_BOCHS will reliably stop kernel from booting before 4.12.
If ever bisecting into pre-4.12 kernels, better disable it.

The best fit would be the uncommon drivers and optional features.

Warnings are typically not bisect stoppers so can be ignored. Kernel
panics that happen rare enough to impact bisect may also be ignored.

Thanks,
Fengguang

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


On Thu, Nov 09, 2017 at 11:12:06AM +0800, Fengguang Wu wrote:

Hi Alex,


So looking over the trace the panic seems to be happening after a
decnet interface is getting deleted. Is there any chance we could try
compiling the kernel without decnet support to see if that is the
source of these issues? I don't know if anyone on the Intel Wired Lan
team is testing with that enabled so if we can eliminate that as a
possible cause that would be useful.


Sure and thank you for the suggestion!

It looks disabling DECNET still triggers the vlan_device_event BUG.
However when looking at the dmesgs, I find another warning just before
the vlan_device_event BUG. Not sure if it's related one or independent
now-fixed issue.

Please press Enter to activate this console.
[ 1291.938326] Writes:  Total: 2  Max/Min: 0/0   Fail: 0
[ 1297.731690] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1297.828227] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1300.506245] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1302.467460] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
LKP: HOSTNAME vm-lkp-wsx03-openwrt-i386-10, MAC , kernel 4.13.0 1, serial 
console /dev/ttyS0
[ 1304.161688] Kernel tests: Boot OK!
[ 1306.558532] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1308.507499] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1310.526380] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1311.246017] LKP: waiting for network...
[ 1312.543432] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1313.985807]
[ 1313.991541] =
[ 1314.002398] WARNING: bad unlock balance detected!
[ 1314.013154] 4.13.0 #1 Not tainted
[ 1314.021549] -
[ 1314.032505] procd/1244 is trying to release lock (rcu_preempt_state) at:
[ 1314.047216] [] rcu_read_unlock_special+0x580/0x5b0
[ 1314.059825] but there are no more locks to release!
[ 1314.070546]
[ 1314.070546] other info that might help us debug this:
[ 1314.085941] 2 locks held by procd/1244:
[ 1314.095139]  #0:  (>cred_guard_mutex){..}, at: [] 
prepare_bprm_creds+0x28/0xc0
[ 1314.114616]  #1:  (rcu_read_lock){..}, at: [] 
path_init+0x490/0x6f0
[ 1314.132155]
[ 1314.132155] stack backtrace:
[ 1314.144402] CPU: 0 PID: 1244 Comm: procd Not tainted 4.13.0 #1
[ 1314.160197] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[ 1314.179404] Call Trace:
[ 1314.186768]  dump_stack+0x16/0x1c
[ 1314.195387]  print_unlock_imbalance_bug+0xb9/0xd0
[ 1314.205753]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.216381]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.226982]  lock_release+0x1cc/0x490
[ 1314.235602]  ? rcu_gp_kthread_wake+0x34/0x50
[ 1314.245262]  ? rcu_read_unlock_special+0x580/0x5b0
[ 1314.255724]  rt_mutex_unlock+0x1e/0xb0
[ 1314.264610]  rcu_read_unlock_special+0x580/0x5b0
[ 1314.274814]  __rcu_read_unlock+0xa7/0xb0
[ 1314.283954]  unlazy_walk+0xcf/0x1f0
[ 1314.292409]  trailing_symlink+0x349/0x4e0
[ 1314.301583]  path_openat+0x333/0x1280
[ 1314.310197]  do_filp_open+0x67/0x140
[ 1314.318696]  ? getname_kernel+0x23/0x1e0
[ 1314.327766]  ? cache_alloc_debugcheck_after+0x13a/0x2a0
[ 1314.340076]  ? getname_kernel+0x23/0x1e0
[ 1314.349179]  do_open_execat+0xab/0x2a0
[ 1314.358063]  open_exec+0x57/0x80
[ 1314.366128]  load_script+0x33c/0x3d0
[ 1314.374556]  ? kvm_sched_clock_read+0x9/0x20
[ 1314.384219]  ? sched_clock+0x9/0x10
[ 1314.392611]  ? sched_clock_cpu+0x1a/0x1e0
[ 1314.401875]  ? _raw_read_unlock+0x55/0x90
[ 1314.411080]  search_binary_handler+0xd9/0x160
[ 1314.420799]  do_execveat_common+0x8f6/0xb10
[ 1314.430334]  SyS_execve+0x1f/0x30
[ 1314.438458]  do_int80_syscall_32+0x95/0x1b0
[ 1314.447956]  entry_INT80_32+0x2f/0x2f
[ 1314.456606] EIP: 0xb7e9ab07
[ 1314.464062] EFLAGS: 0296 CPU: 0
[ 1314.472421] EAX: ffda EBX: 0807b584 ECX: bfb0fd70 EDX: 08061250
[ 1314.485257] ESI: 0807b584 EDI:  EBP: bfb0fd58 ESP: bfb0fd28
[ 1314.498024]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[ 1314.613681] hotplug-call (1244) used greatest stack depth: 6384 bytes left
[ 1314.957636] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1316.955154] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1318.197800] 8021q: adding VLAN 0 to HW filter on device eth0
[ 1320.222754] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[ 1321.409456] BUG: unable to handle kernel paging request at 6b6b6f4f
[ 1321.421942] IP: vlan_device_event+0x7f5/0xa40
[ 1321.431239] *pde = 


Note that this call trace is different from the ones posted in earlier
emails.

[ 1321.431267]
[ 1321.443356] Oops:  [#1] PREEMPT
[ 1321.451390] CPU: 0 PID: 798 Comm: netifd Not tainted 4.13.0 #1
[ 1321.462802] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[ 1321.479743] task: cf8ae5c0 task.stack: cf114000
[ 1321.489345] EIP: vlan_device_event+0x7f5/0x

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


Of course, if it's bisectable, that would be great too.


Yes, bisect is on the way. So far it's bisecting in the 4.12 commits.


The bisect was unsuccessful due to an unrelated DRM_BOCHS oops in 4.11.
Disabling the buggy driver, I managed to reproduce the
vlan_device_event bug in 4.11. However only to find the older kernels
suffer from different kind of oops, which make the bisect troublesome.

% grep -E 'dmesg.(BUG|EIP)' v4.*/matrix.json
v4.10/matrix.json:  "dmesg.BUG:unable_to_handle_kernel": [
v4.10/matrix.json:  "dmesg.EIP:kobject_get": [
v4.10/matrix.json:  "dmesg.BUG:kernel_reboot-without-warning_in_test_stage": [
v4.11/matrix.json:  "dmesg.BUG:unable_to_handle_kernel": [
v4.11/matrix.json:  "dmesg.EIP:vlan_device_event": [
v4.8/matrix.json:  
"dmesg.BUG:kernel_reboot-without-warning_in_early-boot_stage,last_printk:Decompressing_Linux":
 [
v4.9/matrix.json:  "dmesg.BUG:key_not_in.data": [

I'll try tuning kconfig to get a good bisect base.

Hopefully we'll make the coming RC releases free from such bisect
stoppers with better regression tracking and bisect automation.

Regards,
Fengguang

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


[...]

I notice that this trace shows no additional inline files at all.
Is it because I did some kconfig option wrong, so that inline info is
lost? Eg.

CONFIG_OPTIMIZE_INLINING=y (reading lib/Kconfig.debug, it looks better set to N)
CONFIG_DEBUG_INFO_REDUCED=y
CONFIG_DEBUG_INFO_SPLIT=y
(full .config attached)


FYI it's randconfig testing, so may well contain absurd settings --
unless we add rules to force enable/disable some specific options to
make life easier.

Regards,
Fengguang


[  745.719623] BUG: unable to handle kernel paging request at 6b6b6f4f
[  745.732871] IP: vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60
[  745.742106] *pde = 
[  745.748587] Oops:  [#1] PREEMPT
[  745.756104] CPU: 0 PID: 786 Comm: netifd Not tainted 4.14.0-rc8 #1
[  745.769171] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  745.786791] task: cf768780 task.stack: d187a000
[  745.796485] EIP: vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60
[  745.805877] EFLAGS: 00010206 CPU: 0
[  745.813237] EAX: 00f9 EBX: 0002 ECX:  EDX: 6b6b6b6b
[  745.825774] ESI: 02f9 EDI: d1de3700 EBP: d187bdd8 ESP: d187bda4
[  745.838871]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
[  745.850218] CR0: 80050033 CR2: 6b6b6f4f CR3: 0f4c8000 CR4: 0690
[  745.862750] Call Trace:
[  745.868650]  ? dn_dev_delete+0x138/0x1d0:
dn_dev_delete at net/decnet/dn_dev.c:1224
[  745.876751]  ? dn_dev_down+0x69/0x80:
dn_dev_down at net/decnet/dn_dev.c:1240
[  745.885084]  notifier_call_chain+0x4e/0xa0:
notifier_call_chain at kernel/notifier.c:95 (discriminator 1)
[  745.894254]  raw_notifier_call_chain+0xc/0x10:
raw_notifier_call_chain at kernel/notifier.c:402
[  745.903979]  call_netdevice_notifiers_info+0x59/0x90:
call_netdevice_notifiers_info at net/core/dev.c:1672
[  745.914670]  __dev_notify_flags+0xea/0x130:
__dev_notify_flags at net/core/dev.c:1687
[  745.923446]  dev_change_flags+0x60/0x70:
dev_change_flags at net/core/dev.c:6813
[  745.931679]  dev_ifsioc+0x49b/0x610:
dev_ifsioc at net/core/dev_ioctl.c:257
[  745.939102]  ? mutex_lock_nested+0x14/0x20:
mutex_lock_nested at kernel/locking/mutex.c:909
[  745.948173]  dev_ioctl+0x36f/0xb20:
dev_ioctl at net/core/dev_ioctl.c:566
[  745.956154]  sock_ioctl+0x1cd/0x350:
sock_ioctl at net/socket.c:968
[  745.964313]  ? sock_fasync+0xb0/0xb0:
sock_ioctl at net/socket.c:984
[  745.972512]  vfs_ioctl+0x33/0x70:
vfs_ioctl at fs/ioctl.c:47
[  745.979867]  do_vfs_ioctl+0x8d/0xc60:
do_vfs_ioctl at fs/ioctl.c:690
[  745.987782]  ? kmem_cache_free+0x186/0x290:
kmem_cache_free at include/linux/rcupdate.h:777
[  745.996138]  ? putname+0x9f/0xe0:
putname at fs/namei.c:259
[  746.003434]  ? putname+0x9f/0xe0:
putname at fs/namei.c:259
[  746.011240]  ? do_sys_open+0x28d/0x420:
do_sys_open at fs/open.c:1069
[  746.019728]  ? __fget_light+0xb7/0xf0:
__fget_light at fs/file.c:739 (discriminator 2)
bad symbol size: base: 0xc1277040 end: 0xc1277040
[  746.029292]  SyS_ioctl+0x98/0xb0
[  746.036680]  do_int80_syscall_32+0x95/0x290:
do_int80_syscall_32 at arch/x86/entry/common.c:329
[  746.045685]  entry_INT80_32+0x2f/0x2f:
restore_all at arch/x86/entry/entry_32.S:536
[  746.053427] EIP: 0xb7e97648
[  746.059907] EFLAGS: 0246 CPU: 0
[  746.068336] EAX: ffda EBX: 0005 ECX: 8914 EDX: bfcaa350
[  746.081238] ESI: bfcaa350 EDI: bfcaa370 EBP: bfcaa388 ESP: bfcaa31c
[  746.093449]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[  746.103600] Code: 8d e0 db b4 c4 8d 56 01 89 14 8d e0 db b4 c4 0f 85 03 02 00 00 
89 7d d4 31 f6 8b 7d d8 e9 84 00 00 00 8d 74 26 00 25 ff 01 00 00 <8b> 1c 82 31 
d2 85 db 0f 95 c2 8b 04 95 cc db b4 c4 83 c0 01 85
[  746.140089] EIP: vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60 SS:ESP: 0068:d187bda4
[  746.153505] CR2: 6b6b6f4f
[  746.413297] Kernel tests: Boot OK!
[  748.237463] ---[ end trace 40505af7d815b57d ]---
[  748.281157] Kernel panic - not syncing: Fatal exception


Anyway, it does all mean that the use-after-free almost certainly is
that "array[]" access, which was the main suspect to begin with.

I'm adding Jeff Kirsher and the e1000 list to the cc, since apparently
the e1000 driver is involved. Of course, all the stack traces are in
generic networking code, so it is possibly a generic networking issue,
but it would probably be good to have a few people look at it.

But all of this code is ancient. So it's almost certainly not a new
bug, whatever it is.

But as a test-case for your faddr2line improvements, this is excellent!

  Linus



Regards,
Fengguang



#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 4.14.0-rc8 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=8

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


Hi Linus,

On Wed, Nov 08, 2017 at 08:20:38AM -0800, Linus Torvalds wrote:

On Wed, Nov 8, 2017 at 1:48 AM, Fengguang Wu <fengguang...@intel.com> wrote:


Now I got the faddr2line output. :)


Thank you, but this also shows that you then compress the output too
much for convenience.


[  745.719623] BUG: unable to handle kernel paging request at 6b6b6f4f
[  745.732871] IP: vlan_device_event at net/8021q/vlan.h:60


So this looks more legible than "vlan_device_event+0x7f5/0xa40" (or
whatever), but in fact it's not.

The reason faddr2line is great is because it gives inlining
information, so that you can see exactly which access it is, even if
it's inside some helper inline function (macros still obviously are
going to be problematic).

And you've cut the whole information down, to the point where there is
_less_ information than there used to be.

So the full faddr2line output is actually important, and should look
something like this:

   __schedule+0x315/0x840:
   rq_sched_info_arrive at kernel/sched/stats.h:13
(inlined by) sched_info_arrive at kernel/sched/stats.h:99
(inlined by) __sched_info_switch at kernel/sched/stats.h:151
(inlined by) sched_info_switch at kernel/sched/stats.h:158
(inlined by) prepare_task_switch at kernel/sched/core.c:2582
(inlined by) context_switch at kernel/sched/core.c:2755
(inlined by) __schedule at kernel/sched/core.c:3366

which is admittedly a fairly extreme case, but it shows how involved
things can be.


Indeed! I wasn't aware that faddr2line could have so informative
output! After checking several of its outputs, I decided to pipe its
output to |tail -1 :/  That's clearly due to my lacking of first hand
experience on oops analyzing!

IMHO this kind of informative demo in your email could be very good
candidates to put in Documentation/. If you search faddr2line under
Documentation there's no single mention of it (addr2line does show up
7 times).


Note how in my example above the access itself is from a header file
(that inlined rq_sched_info_arrive() function), and I happened to pick
the

   rq->rq_sched_info.pcount++;

line. But Many inline functions are inlined from different points,
often many times in the same top-level function (think of the atomic
helper inlines, for example).

In your case, that net/8021q/vlan.h:60 information is literally not
helping, because it only shows what I could already figure out from
looking at the constants in the disassembly: the code comes from the

   vlan_group_for_each_dev(...) {
   ..

loop setup (it's the inline __vlan_group_get_device() function, and
the constants I could recognize are the VLAN_GROUP_ARRAY_PART_LEN
things.

But exactly like the constants didn't tell me *which* invocation of
that loop it was, your "net/8021q/vlan.h:60" doesn't tell me which one
it is.

There's at least _eight_ different "vlan_group_for_each_dev() {" loops
in vlan_device_event(), and maybe it's not all that meaningful which
of the eight it is in this case, in other cases it's critical.

And since you've actually replaced the "vlan_device_event+0x7f5/0xa40"
with that "net/8021q/vlan.h:60" entirely, all the offset information
that *could* maybe have been used to pick one case over another (but
likely not) is also gone.

That's why that inlining chain is important - so that we can see how
it got to that access in __vlan_group_get_device().

So please do keep _all_ of the faddr2line output, at least for the
"IP:" line (the stack traces might not be worth it).


OK. Here is the original faddr2line output:

$ ~/linux/scripts/faddr2line vmlinux vlan_device_event+0x7f5/0xa40
vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60

And below is call trace embedded with full faddr2line output.

I notice that this trace shows no additional inline files at all.
Is it because I did some kconfig option wrong, so that inline info is
lost? Eg.

CONFIG_OPTIMIZE_INLINING=y (reading lib/Kconfig.debug, it looks better set to N)
CONFIG_DEBUG_INFO_REDUCED=y
CONFIG_DEBUG_INFO_SPLIT=y
(full .config attached)

[  745.719623] BUG: unable to handle kernel paging request at 6b6b6f4f
[  745.732871] IP: vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60
[  745.742106] *pde = 
[  745.748587] Oops:  [#1] PREEMPT
[  745.756104] CPU: 0 PID: 786 Comm: netifd Not tainted 4.14.0-rc8 #1
[  745.769171] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  745.786791] task: cf768780 task.stack: d187a000
[  745.796485] EIP: vlan_device_event+0x7f5/0xa40:
vlan_device_event at net/8021q/vlan.h:60
[  745.805877] EFLAGS: 00010206 CPU: 0
[  745.813237] EAX: 00f9 EBX: 0002 ECX:  EDX: 6b6b6b6b
[  745.825774] ESI: 02f9 EDI: d1de3700 EBP: d187bdd8 ESP: d187bda4
[  745.838871]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
[  745.850218] CR0: 80050033 CR2: 6b6b6f4f

Re: [mdiobus_free] BUG: KASAN: slab-out-of-bounds in _copy_from_user+0x5d/0x8f


On Tue, Nov 07, 2017 at 08:04:25AM -0800, Linus Torvalds wrote:

On Tue, Nov 7, 2017 at 4:15 AM, Fengguang Wu <fengguang...@intel.com> wrote:

Sorry please ignore this report -- according to Andrey, old gcc may
well generate false KASAN reports.


Oh wow, this gcc is even older than the other one that caused slob problems..


Linux version 4.14.0-rc8 (kbuild@intel11) (gcc version 4.6.4 (Debian 4.6.4-7))


Yeah, please do upgrade your compilers to something from this decade.


Sure. I'll drop gcc 4.4, 4.6, 4.8 testing.

Fengguang

Re: [vlan_device_event] BUG: unable to handle kernel paging request at 6b6b6ccf


On Tue, Nov 07, 2017 at 08:25:03AM -0800, Linus Torvalds wrote:

On Tue, Nov 7, 2017 at 2:21 AM, Fengguang Wu <fengguang...@intel.com> wrote:


FYI this happens in v4.14-rc8 -- it's not necessarily a new bug.


Probably not.

Looks like a use-after-free bug in vlan_device_event() judging by the
base pointer:

   ECX: 6b6b6b6b

this is one of those circumstances where having the faddr2line output
for that EIP would make it much easier to see exactly which access it
is that causes problems. There's lots of inlining going on, so without
that it's a pain to figure out.

The code is

  0: 31 c0xor%eax,%eax
  2: 8d 76 00  lea0x0(%esi),%esi
  5: 89 c2mov%eax,%edx
  7: 89 c3mov%eax,%ebx
  9: 81 e2 ff 0f 00 00and$0xfff,%edx
  f: 89 d1mov%edx,%ecx
 11: c1 fb 0c  sar$0xc,%ebx
 14: c1 e9 09  shr$0x9,%ecx
 17: 8d 0c d9  lea(%ecx,%ebx,8),%ecx
 1a: 8b 4c 8e 10  mov0x10(%esi,%ecx,4),%ecx
 1e: 85 c9test   %ecx,%ecx
 20: 74 34je 0x56
 22: 81 e2 ff 01 00 00and$0x1ff,%edx
 28:* 8b 14 91  mov(%ecx,%edx,4),%edx <-- trapping instruction
 2b: 85 d2test   %edx,%edx
 2d: 74 27je 0x56
 2f: f6 82 30 01 00 00 01 testb  $0x1,0x130(%edx)
 36: 74 1eje 0x56

and just by going by the constants in question (0xfff and 0x1ff), I
can see that it's one of

   vlan_group_for_each_dev(..) {
   ...
   }

things, but that's pretty much all I can tell.

Apparently we'll get that faddr2line output soon. In the meantime, I
think this is a real bug report but I don't see enough information to
really go on.


Got it. I should be able to get faddr2line output tomorrow.


Of course, if it's bisectable, that would be great too.


It looks reproducible enough to be bisectable. I'll try.

Regards,
Fengguang

Re: [sock_def_readable] WARNING: bad unlock balance detected!


Sorry please ignore this report -- according to Peter:

This is fixed by commit:

 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")

The problem is that RCU boosting was mixing futex and !futex rt_mutex
ops.

On Tue, Nov 07, 2017 at 12:34:15PM +0800, Fengguang Wu wrote:

FYI, this warning shows up in both v4.14-rc8 and v4.13:

[  144.578809] br-lan: port 1(eth0) entered disabled state
[  144.581360] device eth0 left promiscuous mode
[  144.582699] br-lan: port 1(eth0) entered disabled state
[  144.685012]
[  144.685370] =
[  144.686093] WARNING: bad unlock balance detected!
[  144.686816] 4.14.0-rc8 #158 Not tainted
[  144.687420] -
[  144.688132] ubus/19703 is trying to release lock (rcu_preempt_state) at:
[  144.688911] [] rcu_read_unlock_special+0x5f8/0x620
[  144.689621] but there are no more locks to release!
[  144.690374]
[  144.690374] other info that might help us debug this:
[  144.691407] 1 lock held by ubus/19703:
[  144.692009]  #0:  (rcu_read_lock){}, at: [] 
sock_def_readable+0x0/0x100
[  144.693212]
[  144.693212] stack backtrace:
[  144.693980] CPU: 0 PID: 19703 Comm: ubus Not tainted 4.14.0-rc8 #158
[  144.694923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  144.696168] Call Trace:
[  144.696632]  dump_stack+0x16/0x1c
[  144.697195]  print_unlock_imbalance_bug+0xb9/0xd0
[  144.697940]  ? rcu_read_unlock_special+0x5f8/0x620
[  144.698673]  ? rcu_read_unlock_special+0x5f8/0x620
[  144.699390]  lock_release+0x1cc/0x490
[  144.60]  ? rcu_gp_kthread_wake+0x34/0x50
[  144.700678]  ? rcu_read_unlock_special+0x5f8/0x620
[  144.701431]  rt_mutex_unlock+0x1e/0xb0
[  144.702048]  rcu_read_unlock_special+0x5f8/0x620
[  144.702774]  __rcu_read_unlock+0xa7/0xb0
[  144.703422]  sock_def_readable+0xd1/0x100
[  144.704073]  unix_stream_sendmsg+0x2d1/0x4d0
[  144.704755]  sock_sendmsg+0x2d/0x70
[  144.705334]  ___sys_sendmsg+0x390/0x3a0
[  144.705963]  ? kvm_clock_read+0x31/0x80
[  144.706569]  ? kvm_sched_clock_read+0x9/0x20
[  144.707250]  ? sched_clock+0x9/0x10
[  144.707834]  ? sched_clock_cpu+0x1a/0x1e0
[  144.708473]  ? pvclock_clocksource_read+0xd5/0x230
[  144.709216]  ? kvm_clock_read+0x31/0x80
[  144.709850]  ? kvm_sched_clock_read+0x9/0x20
[  144.710526]  ? sched_clock+0x9/0x10
[  144.711082]  ? sched_clock_cpu+0x1a/0x1e0
[  144.711714]  ? __fget_light+0xb7/0xf0
[  144.712301]  ? sockfd_lookup_light+0xd3/0x120
[  144.713002]  __sys_sendmsg+0x59/0x90
[  144.713585]  SyS_socketcall+0x9e4/0xb20
[  144.714204]  do_int80_syscall_32+0x95/0x290
[  144.714883]  entry_INT80_32+0x2f/0x2f
[  144.715491] EIP: 0xb7f3e595
[  144.715985] EFLAGS: 0292 CPU: 0
[  144.716566] EAX: ffda EBX: 0010 ECX: b678 EDX: b7f83ff4
[  144.717474] ESI: b678 EDI: b6c4 EBP: b708 ESP: b668
[  144.718374]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
procd: - reboot -
[  145.174700] Unregister pv shared memory for cpu 0

Here is another one, called from unix_stream_connect():

Please press Enter to activate this console.
[   46.373661] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[   46.376639] 8021q: adding VLAN 0 to HW filter on device eth0
[   46.507035]
[   46.507329] =
[   46.507940] WARNING: bad unlock balance detected!
[   46.508543] 4.14.0-rc8 #158 Not tainted
[   46.509048] -
[   46.509648] fw3/1185 is trying to release lock (rcu_preempt_state) at:
[   46.510452] [] rcu_read_unlock_special+0x5f8/0x620
[   46.511147] but there are no more locks to release!
[   46.511766]
[   46.511766] other info that might help us debug this:
[   46.512591] 1 lock held by fw3/1185:
[   46.513070]  #0:  (rcu_read_lock){}, at: [] 
sock_def_readable+0x0/0x100
[   46.514074]
[   46.514074] stack backtrace:
[   46.514684] CPU: 0 PID: 1185 Comm: fw3 Not tainted 4.14.0-rc8 #158
[   46.515440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   46.516463] Call Trace:
[   46.516831]  dump_stack+0x16/0x1c
[   46.517288]  print_unlock_imbalance_bug+0xb9/0xd0
[   46.517896]  ? rcu_read_unlock_special+0x5f8/0x620
[   46.518512]  ? rcu_read_unlock_special+0x5f8/0x620
[   46.519113]  lock_release+0x1cc/0x490
[   46.519604]  ? rcu_read_unlock_special+0x5f8/0x620
[   46.520208]  ? _raw_spin_unlock_irqrestore+0x86/0xd0
[   46.520840]  rt_mutex_unlock+0x1e/0xb0
[   46.521341]  rcu_read_unlock_special+0x5f8/0x620
[   46.521936]  __rcu_read_unlock+0xa7/0xb0
[   46.522460]  sock_def_readable+0xd1/0x100
[   46.522993]  unix_stream_connect+0x633/0x680
[   46.523552]  SYSC_connect+0x107/0x120
[   46.524046]  SyS_socketcall+0x384/0xb20
[   46.524562]  ? SyS_fcntl64+0x103/0x2e0
[   46.525060]  do_int80_syscall_32+0x95/0x290
[   46.525610]  entry_INT80_32+0x2f/0x2f
[   46.526099] EIP: 0xb7e97384
[   46.526505] EFLAGS: 0296 CPU: 0
[   4

[rht_deferred_worker] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 33s!

FYI another workqueue lockup on rhashtable testing.

[  330.583179]   Deleting 5 keys
[  339.302251]   Duration of test: 26623682510 ns
[  339.310019] Average test time: 21144321486
[  339.311680] Testing concurrent rhashtable access from 10 threads
[  367.586868] Writes:  Total: 2  Max/Min: 0/0   Fail: 0 
[  406.501055] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 33s!
[  406.503708] Showing busy workqueues and worker pools:
[  406.510695] workqueue events: flags=0x0
[  406.517288]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  406.519526] in-flight: 3:rht_deferred_worker
[  406.520454] pending: vmstat_shepherd, rht_deferred_worker

Attached the full dmesg and kconfig.

Thanks,
Fengguang
early console in setup code
[0.00] Linux version 4.14.0-rc8 (kbuild@lkp-wsm-ep2) (gcc version 6.2.0 
20160901 (Debian 6.2.0-3)) #88 SMP Tue Nov 7 06:44:05 CST 2017
[0.00] Command line: ip=vm-lkp-os-openwrt-ia32-11::dhcp 
root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-lkp-os-openwrt-ia32-11/boot-1-openwrt-i386-2016-03-16.cgz-39dae59d66acd86d1de24294bd2f343fd5e7a625-20171107-3579-gel5jo-0.yaml
 ARCH=x86_64 kconfig=x86_64-randconfig-ws0-11070630 branch=linus/master 
commit=39dae59d66acd86d1de24294bd2f343fd5e7a625 
BOOT_IMAGE=/pkg/linux/x86_64-randconfig-ws0-11070630/gcc-6/39dae59d66acd86d1de24294bd2f343fd5e7a625/vmlinuz-4.14.0-rc8
 max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-lkp-os-openwrt-ia32/openwrt-i386-2016-03-16.cgz/x86_64-randconfig-ws0-11070630/gcc-6/39dae59d66acd86d1de24294bd2f343fd5e7a625/0
 LKP_SERVER=inn debug apic=debug sysrq_always_enabled 
rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 
softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 
prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel 
console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw 
drbd.minor_c
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1a3d] usable
[0.00] BIOS-e820: [mem 0x1a3e-0x1a3f] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] debug: ignoring loglevel setting.
[0.00] NX (Execute Disable) protection: active
[0.00] random: fast init done
[0.00] SMBIOS 2.8 present.
[0.00] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 
04/01/2014
[0.00] tsc: Unable to calibrate against PIT
[0.00] tsc: No reference (HPET/PMTIMER) available
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x1a3e0 max_arch_pfn = 0x4
[0.00] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC  
[0.00] Scan for SMP in [mem 0x-0x03ff]
[0.00] Scan for SMP in [mem 0x0009fc00-0x0009]
[0.00] Scan for SMP in [mem 0x000f-0x000f]
[0.00] found SMP MP-table at [mem 0x000f6aa0-0x000f6aaf] mapped at 
[ff200aa0]
[0.00]   mpc: f6ab0-f6b80
[0.00] Base memory trampoline at [88099000] 99000 size 24576
[0.00] BRK [0x15a08000, 0x15a08fff] PGTABLE
[0.00] BRK [0x15a09000, 0x15a09fff] PGTABLE
[0.00] BRK [0x15a0a000, 0x15a0afff] PGTABLE
[0.00] BRK [0x15a0b000, 0x15a0bfff] PGTABLE
[0.00] RAMDISK: [mem 0x1a11b000-0x1a3d]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F68D0 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1A3E15CE 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1A3E142A 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x1A3E0040 0013EA (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x1A3E 40
[0.00] ACPI: APIC 0x1A3E151E 78 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: HPET 0x1A3E1596 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] mapped APIC to ff5fd000 (fee0)
[0.00] No NUMA configuration found
[0.00] Faking a node at [mem 0x-0x1a3d]
[0.00] NODE_DATA(0) allocated [mem 0x1a117000-0x1a11afff]
[0.00] Zone ranges:
[0.00]   DMA32[mem 0x1000-0x1a3d]
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[

[rht_deferred_worker] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:15]