Re: NULL pointer dereference during memory hotremove
On Fri 17-05-19 13:33:25, Pavel Tatashin wrote: > On Fri, May 17, 2019 at 1:24 PM Pavel Tatashin > wrote: > > > > On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin > > wrote: > > > > > > On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > > > > > > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > > > > This panic is unrelated to circular lock issue that I reported in a > > > > > separate thread, that also happens during memory hotremove. > > > > > > > > > > xakep ~/x/linux$ git describe > > > > > v5.1-12317-ga6a4b66bd8f4 > > > > > > > > Does this happen on 5.0 as well? > > > > > > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a > > > script, and have to do it manually, also it does not happen every > > > time, it happened on 3rd time for me. > > > > Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script > > still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I > > am able to reproduce it on 5.0. > > OK, confirmed on 5.0 as well, took 4 tries to reproduce: What is the last version that survives? Can you bisect? -- Michal Hocko SUSE Labs
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 1:24 PM Pavel Tatashin wrote: > > On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin > wrote: > > > > On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > > > > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > > > This panic is unrelated to circular lock issue that I reported in a > > > > separate thread, that also happens during memory hotremove. > > > > > > > > xakep ~/x/linux$ git describe > > > > v5.1-12317-ga6a4b66bd8f4 > > > > > > Does this happen on 5.0 as well? > > > > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a > > script, and have to do it manually, also it does not happen every > > time, it happened on 3rd time for me. > > Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script > still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I > am able to reproduce it on 5.0. OK, confirmed on 5.0 as well, took 4 tries to reproduce: (qemu) [ 17.104486] Offlined Pages 32768 [ 17.105543] Built 1 zonelists, mobility grouping on. Total pages: 1515892 [ 17.106475] Policy zone: Normal [ 17.107029] BUG: unable to handle kernel NULL pointer dereference at 0698 [ 17.107645] #PF error: [normal kernel read fault] [ 17.108038] PGD 0 P4D 0 [ 17.108287] Oops: [#1] SMP PTI [ 17.108557] CPU: 5 PID: 313 Comm: kworker/u16:5 Not tainted 5.0.0_pt_pmem1 #2 [ 17.109128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 [ 17.109910] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 17.110323] RIP: 0010:__remove_pages+0x2f/0x520 [ 17.110674] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83 ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0 48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f 84 df [ 17.112114] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246 [ 17.112518] RAX: 0640 RBX: 0004 RCX: [ 17.113073] RDX: 0004 RSI: 0024 RDI: 0640 [ 17.113615] RBP: 00024000 R08: R09: 4000 [ 17.114186] R10: 4000 R11: 00024000 R12: e382c900 [ 17.114743] R13: 0640 R14: 0004 R15: 0024 [ 17.115288] FS: () GS:979539b4() knlGS: [ 17.115911] CS: 0010 DS: ES: CR0: 80050033 [ 17.116356] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0 [ 17.116913] DR0: DR1: DR2: [ 17.117467] DR3: DR6: fffe0ff0 DR7: 0400 [ 17.118016] Call Trace: [ 17.118214] ? memblock_isolate_range+0xc4/0x139 [ 17.118570] ? firmware_map_remove+0x48/0x90 [ 17.118908] arch_remove_memory+0x7b/0xc0 [ 17.119216] __remove_memory+0x93/0xc0 [ 17.119528] acpi_memory_device_remove+0x67/0xe0 [ 17.119890] acpi_bus_trim+0x50/0x90 [ 17.120167] acpi_device_hotplug+0x2fc/0x460 [ 17.120498] acpi_hotplug_work_fn+0x15/0x20 [ 17.120834] process_one_work+0x2a0/0x650 [ 17.121146] worker_thread+0x34/0x3d0 [ 17.121432] ? process_one_work+0x650/0x650 [ 17.121772] kthread+0x118/0x130 [ 17.122032] ? kthread_create_on_node+0x60/0x60 [ 17.122413] ret_from_fork+0x3a/0x50 [ 17.122727] Modules linked in: [ 17.122983] CR2: 0698 [ 17.123250] ---[ end trace 389c4034f6d42e6f ]--- [ 17.123618] RIP: 0010:__remove_pages+0x2f/0x520 [ 17.123979] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83 ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0 48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f 84 df [ 17.125410] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246 [ 17.125818] RAX: 0640 RBX: 0004 RCX: [ 17.126359] RDX: 0004 RSI: 0024 RDI: 0640 [ 17.126906] RBP: 00024000 R08: R09: 4000 [ 17.127453] R10: 4000 R11: 00024000 R12: e382c900 [ 17.128008] R13: 0640 R14: 0004 R15: 0024 [ 17.128555] FS: () GS:979539b4() knlGS: [ 17.129182] CS: 0010 DS: ES: CR0: 80050033 [ 17.129627] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0 [ 17.130182] DR0: DR1: DR2: [ 17.130744] DR3: DR6: fffe0ff0 DR7: 0400 [ 17.131293] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:34 [ 17.132050] in_atomic(): 0, irqs_disabled(): 1, pid: 313, name: kworker/u16:5 [ 17.132596] INFO: lockdep is turned off. [ 17.132908] irq event stamp: 14046 [ 17.133175] hardirqs last enabled at (14045): [] kfree+0xba/0x230 [ 17.133777] hardirqs last disabled at (14046): []
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin wrote: > > On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > > This panic is unrelated to circular lock issue that I reported in a > > > separate thread, that also happens during memory hotremove. > > > > > > xakep ~/x/linux$ git describe > > > v5.1-12317-ga6a4b66bd8f4 > > > > Does this happen on 5.0 as well? > > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a > script, and have to do it manually, also it does not happen every > time, it happened on 3rd time for me. Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I am able to reproduce it on 5.0. Pasha
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > This panic is unrelated to circular lock issue that I reported in a > > separate thread, that also happens during memory hotremove. > > > > xakep ~/x/linux$ git describe > > v5.1-12317-ga6a4b66bd8f4 > > Does this happen on 5.0 as well? Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a script, and have to do it manually, also it does not happen every time, it happened on 3rd time for me. Pasha
Re: NULL pointer dereference during memory hotremove
On 17.05.19 16:38, Michal Hocko wrote: > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: >> This panic is unrelated to circular lock issue that I reported in a >> separate thread, that also happens during memory hotremove. >> >> xakep ~/x/linux$ git describe >> v5.1-12317-ga6a4b66bd8f4 > > Does this happen on 5.0 as well? > We have on the list [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Can that help? -- Thanks, David / dhildenb
Re: NULL pointer dereference during memory hotremove
On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > This panic is unrelated to circular lock issue that I reported in a > separate thread, that also happens during memory hotremove. > > xakep ~/x/linux$ git describe > v5.1-12317-ga6a4b66bd8f4 Does this happen on 5.0 as well? -- Michal Hocko SUSE Labs
NULL pointer dereference during memory hotremove
This panic is unrelated to circular lock issue that I reported in a separate thread, that also happens during memory hotremove. xakep ~/x/linux$ git describe v5.1-12317-ga6a4b66bd8f4 Config is attached, qemu script is following: qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -parallel none \ -echr 1 \ -serial none\ -chardev stdio,id=console,signal=off,mux=on \ -serial chardev:console \ -mon chardev=console\ -vga none \ -display none \ -kernel pmem/native/arch/x86/boot/bzImage \ -m 8G,slots=1,maxmem=16G\ -smp 8 \ -fsdev local,id=virtfs1,path=/,security_model=none \ -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\ -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 TERM=xterm ip=dhcp memmap=2G!6G loglevel=7' The unusual case with this script is that 2G reserved for pmem device: memmap=2G!6G. Otherwise, it is a normal layout. Unfortunately, it does not happen every time, but I have hit it a couple times. # QEMU 4.0.0 monitor - type 'help' for more information (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 # echo online_movable > /sys/devices/system/memory/memory79/state [ 40.219090] Built 1 zonelists, mobility grouping on. Total pages: 1529279 [ 40.223258] Policy zone: Normal # (qemu) device_del dimm1 (qemu) [ 49.624600] Offlined Pages 32768 [ 49.625796] Built 1 zonelists, mobility grouping on. Total pages: 1516352 [ 49.627841] Policy zone: Normal [ 49.630932] BUG: kernel NULL pointer dereference, address: 0698 [ 49.633704] #PF: supervisor read access in kernel mode [ 49.635689] #PF: error_code(0x) - not-present page [ 49.637620] PGD 800236b59067 P4D 800236b59067 PUD 2358fe067 PMD 0 [ 49.640163] Oops: [#1] SMP PTI [ 49.641223] CPU: 0 PID: 7 Comm: kworker/u16:0 Not tainted 5.1.0_pt_pmem #38 [ 49.643183] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 [ 49.645858] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 49.647101] RIP: 0010:__remove_pages+0x1a/0x460 [ 49.648165] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00 41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54 55 53 48 89 d3 48 83 ec 50 <48> 2b 47 58 48 89 4c 24 48 48 3d 00 19 00 00 75 09 48 85 c9 0f 85 [ 49.651925] RSP: 0018:bd1000c8fcb8 EFLAGS: 00010286 [ 49.652857] RAX: 0640 RBX: 0004 RCX: [ 49.654139] RDX: 0004 RSI: 0024 RDI: 0640 [ 49.655393] RBP: R08: R09: 4000 [ 49.656523] R10: 4000 R11: 00024000 R12: 4000 [ 49.657654] R13: 00024000 R14: 0024 R15: 0640 [ 49.658828] FS: () GS:9b4bf980() knlGS: [ 49.660178] CS: 0010 DS: ES: CR0: 80050033 [ 49.661033] CR2: 0698 CR3: 0002382e0006 CR4: 00360ef0 [ 49.662114] DR0: DR1: DR2: [ 49.663172] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.664243] Call Trace: [ 49.664622] ? memblock_isolate_range+0xc4/0x139 [ 49.665290] ? firmware_map_add_hotplug+0x7e/0xde [ 49.665908] ? memblock_remove_region+0x30/0x74 [ 49.666498] arch_remove_memory+0x6f/0xa0 [ 49.667012] __remove_memory+0xab/0x130 [ 49.667492] ? walk_memory_range+0xa1/0xe0 [ 49.668008] acpi_memory_device_remove+0x67/0xe0 [ 49.668595] acpi_bus_trim+0x50/0x90 [ 49.669051] acpi_device_hotplug+0x2fa/0x3e0 [ 49.669590] acpi_hotplug_work_fn+0x15/0x20 [ 49.670116] process_one_work+0x2a0/0x650 [ 49.670577] worker_thread+0x34/0x3d0 [ 49.670997] ? process_one_work+0x650/0x650 [ 49.671503] kthread+0x118/0x130 [ 49.671879] ? kthread_create_on_node+0x60/0x60 [ 49.672411] ret_from_fork+0x3a/0x50 [ 49.672836] Modules linked in: [ 49.673190] CR2: 0698 [ 49.673583] ---[ end trace 6b727d3a8ce48aa1 ]--- [ 49.674120] RIP: 0010:__remove_pages+0x1a/0x460 [ 49.674624] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00 41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54