Re: [v5 0/3] "Hotremove" persistent memory
On 17.05.19 16:09, Pavel Tatashin wrote: >> >> I would think that ACPI hotplug would have a similar problem, but it does >> this: >> >> acpi_unbind_memory_blocks(info); >> __remove_memory(nid, info->start_addr, info->length); > > ACPI does have exactly the same problem, so this is not a bug for this > series, I will submit a new version of my series with comments > addressed, but without fix for this issue. > > I was able to reproduce this issue on the current mainline kernel. > Also, I been thinking more about how to fix it, and there is no easy > fix without a major hotplug redesign. Basically, we have to remove > sysfs memory entries before or after memory is hotplugged/hotremoved. > But, we also have to guarantee that hotplug/hotremove will succeed or > reinstate sysfs entries. > > Qemu script: > > qemu-system-x86_64 \ > -enable-kvm \ > -cpu host \ > -parallel none \ > -echr 1 \ > -serial none\ > -chardev stdio,id=console,signal=off,mux=on \ > -serial chardev:console \ > -mon chardev=console\ > -vga none \ > -display none \ > -kernel pmem/native/arch/x86/boot/bzImage \ > -m 8G,slots=1,maxmem=16G\ > -smp 8 \ > -fsdev local,id=virtfs1,path=/,security_model=none \ > -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\ > -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 > TERM=xterm ip=dhcp loglevel=7' > > Config is attached. > > Steps to reproduce: > # > # QEMU 4.0.0 monitor - type 'help' for more information > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > (qemu) > > # echo online_movable > /sys/devices/system/memory/memory79/state > [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 > [ 23.032591] Policy zone: Normal > # (qemu) device_del dimm1 > (qemu) [ 32.013950] Offlined Pages 32768 > [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 > [ 32.014843] Policy zone: Normal > [ 32.015733] > [ 32.015881] == > [ 32.016390] WARNING: possible circular locking dependency detected > [ 32.016881] 5.1.0_pt_pmem #38 Not tainted > [ 32.017202] -- > [ 32.017680] kworker/u16:4/380 is trying to acquire lock: > [ 32.018096] 675cc7e1 (kn->count#18){}, at: > kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.018745] > [ 32.018745] but task is already holding lock: > [ 32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x1d/0xa0 > [ 32.019859] > [ 32.019859] which lock already depends on the new lock. > [ 32.019859] > [ 32.020499] > [ 32.020499] the existing dependency chain (in reverse order) is: > [ 32.021080] > [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: > [ 32.021522]__mutex_lock+0x8b/0x900 > [ 32.021843]hotplug_memory_register+0x26/0xa0 > [ 32.022231]__add_pages+0xe7/0x160 > [ 32.022545]add_pages+0xd/0x60 > [ 32.022835]add_memory_resource+0xc3/0x1d0 > [ 32.023207]__add_memory+0x57/0x80 > [ 32.023530]acpi_memory_device_add+0x13a/0x2d0 > [ 32.023928]acpi_bus_attach+0xf1/0x200 > [ 32.024272]acpi_bus_scan+0x3e/0x90 > [ 32.024597]acpi_device_hotplug+0x284/0x3e0 > [ 32.024972]acpi_hotplug_work_fn+0x15/0x20 > [ 32.025342]process_one_work+0x2a0/0x650 > [ 32.025755]worker_thread+0x34/0x3d0 > [ 32.026077]kthread+0x118/0x130 > [ 32.026442]ret_from_fork+0x3a/0x50 > [ 32.026766] > [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){}: > [ 32.027261]get_online_mems+0x39/0x80 > [ 32.027600]kmem_cache_create_usercopy+0x29/0x2c0 > [ 32.028019]kmem_cache_create+0xd/0x10 > [ 32.028367]ptlock_cache_init+0x1b/0x23 > [ 32.028724]start_kernel+0x1d2/0x4b8 > [ 32.029060]secondary_startup_64+0xa4/0xb0 > [ 32.029447] > [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}: > [ 32.030007]cpus_read_lock+0x39/0x80 > [ 32.030360]__offline_pages+0x32/0x790 > [ 32.030709]memory_subsys_offline+0x3a/0
Re: [v5 0/3] "Hotremove" persistent memory
> > I would think that ACPI hotplug would have a similar problem, but it does > this: > > acpi_unbind_memory_blocks(info); > __remove_memory(nid, info->start_addr, info->length); ACPI does have exactly the same problem, so this is not a bug for this series, I will submit a new version of my series with comments addressed, but without fix for this issue. I was able to reproduce this issue on the current mainline kernel. Also, I been thinking more about how to fix it, and there is no easy fix without a major hotplug redesign. Basically, we have to remove sysfs memory entries before or after memory is hotplugged/hotremoved. But, we also have to guarantee that hotplug/hotremove will succeed or reinstate sysfs entries. Qemu script: qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -parallel none \ -echr 1 \ -serial none\ -chardev stdio,id=console,signal=off,mux=on \ -serial chardev:console \ -mon chardev=console\ -vga none \ -display none \ -kernel pmem/native/arch/x86/boot/bzImage \ -m 8G,slots=1,maxmem=16G\ -smp 8 \ -fsdev local,id=virtfs1,path=/,security_model=none \ -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\ -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 TERM=xterm ip=dhcp loglevel=7' Config is attached. Steps to reproduce: # # QEMU 4.0.0 monitor - type 'help' for more information (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 (qemu) # echo online_movable > /sys/devices/system/memory/memory79/state [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 [ 23.032591] Policy zone: Normal # (qemu) device_del dimm1 (qemu) [ 32.013950] Offlined Pages 32768 [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 [ 32.014843] Policy zone: Normal [ 32.015733] [ 32.015881] == [ 32.016390] WARNING: possible circular locking dependency detected [ 32.016881] 5.1.0_pt_pmem #38 Not tainted [ 32.017202] -- [ 32.017680] kworker/u16:4/380 is trying to acquire lock: [ 32.018096] 675cc7e1 (kn->count#18){}, at: kernfs_remove_by_name_ns+0x3b/0x80 [ 32.018745] [ 32.018745] but task is already holding lock: [ 32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x1d/0xa0 [ 32.019859] [ 32.019859] which lock already depends on the new lock. [ 32.019859] [ 32.020499] [ 32.020499] the existing dependency chain (in reverse order) is: [ 32.021080] [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: [ 32.021522]__mutex_lock+0x8b/0x900 [ 32.021843]hotplug_memory_register+0x26/0xa0 [ 32.022231]__add_pages+0xe7/0x160 [ 32.022545]add_pages+0xd/0x60 [ 32.022835]add_memory_resource+0xc3/0x1d0 [ 32.023207]__add_memory+0x57/0x80 [ 32.023530]acpi_memory_device_add+0x13a/0x2d0 [ 32.023928]acpi_bus_attach+0xf1/0x200 [ 32.024272]acpi_bus_scan+0x3e/0x90 [ 32.024597]acpi_device_hotplug+0x284/0x3e0 [ 32.024972]acpi_hotplug_work_fn+0x15/0x20 [ 32.025342]process_one_work+0x2a0/0x650 [ 32.025755]worker_thread+0x34/0x3d0 [ 32.026077]kthread+0x118/0x130 [ 32.026442]ret_from_fork+0x3a/0x50 [ 32.026766] [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){}: [ 32.027261]get_online_mems+0x39/0x80 [ 32.027600]kmem_cache_create_usercopy+0x29/0x2c0 [ 32.028019]kmem_cache_create+0xd/0x10 [ 32.028367]ptlock_cache_init+0x1b/0x23 [ 32.028724]start_kernel+0x1d2/0x4b8 [ 32.029060]secondary_startup_64+0xa4/0xb0 [ 32.029447] [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}: [ 32.030007]cpus_read_lock+0x39/0x80 [ 32.030360]__offline_pages+0x32/0x790 [ 32.030709]memory_subsys_offline+0x3a/0x60 [ 32.031089]device_offline+0x7e/0xb0 [ 32.031425]acpi_bus_offline+0xd8/0x140 [ 32.031821]acpi_device_hotplug+0x1b2/0x3e0 [ 32.032202]acpi_hotplug_work_fn+0x15/0x20 [ 32.032576]process_o
Re: [v5 0/3] "Hotremove" persistent memory
On 16.05.19 02:42, Dan Williams wrote: > On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin > wrote: >> >>> Hi Pavel, >>> >>> I am working on adding this sort of a workflow into a new daxctl command >>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a >>> dax device to kmem, online the resulting memory, and with your patches, >>> also attempt to offline the memory, and change back to device-dax. >>> >>> In running with these patches, and testing the offlining part, I ran >>> into the following lockdep below. >>> >>> This is with just these three patches on top of -rc7. >>> >>> >>> [ +0.004886] == >>> [ +0.001576] WARNING: possible circular locking dependency detected >>> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O >>> [ +0.000929] -- >>> [ +0.000708] daxctl/22950 is trying to acquire lock: >>> [ +0.000548] f4d397f7 (kn->count#424){}, at: >>> kernfs_remove_by_name_ns+0x40/0x80 >>> [ +0.000922] >>> but task is already holding lock: >>> [ +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: >>> unregister_memory_section+0x22/0xa0 >> >> I have studied this issue, and now have a clear understanding why it >> happens, I am not yet sure how to fix it, so suggestions are welcomed >> :) > > I would think that ACPI hotplug would have a similar problem, but it does > this: > > acpi_unbind_memory_blocks(info); > __remove_memory(nid, info->start_addr, info->length); > > I wonder if that ordering prevents going too deep into the > device_unregister() call stack that you highlighted below. > If that doesn't help, after we have [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling we could probably pull the memory device removal phase out from the mem_hotplug_lock protection and let it be protected by the device_hotplug_lock only. Might require some more work, though. -- Thanks, David / dhildenb
Re: [v5 0/3] "Hotremove" persistent memory
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin wrote: > > > Hi Pavel, > > > > I am working on adding this sort of a workflow into a new daxctl command > > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > > dax device to kmem, online the resulting memory, and with your patches, > > also attempt to offline the memory, and change back to device-dax. > > > > In running with these patches, and testing the offlining part, I ran > > into the following lockdep below. > > > > This is with just these three patches on top of -rc7. > > > > > > [ +0.004886] == > > [ +0.001576] WARNING: possible circular locking dependency detected > > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > > [ +0.000929] -- > > [ +0.000708] daxctl/22950 is trying to acquire lock: > > [ +0.000548] f4d397f7 (kn->count#424){}, at: > > kernfs_remove_by_name_ns+0x40/0x80 > > [ +0.000922] > > but task is already holding lock: > > [ +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: > > unregister_memory_section+0x22/0xa0 > > I have studied this issue, and now have a clear understanding why it > happens, I am not yet sure how to fix it, so suggestions are welcomed > :) I would think that ACPI hotplug would have a similar problem, but it does this: acpi_unbind_memory_blocks(info); __remove_memory(nid, info->start_addr, info->length); I wonder if that ordering prevents going too deep into the device_unregister() call stack that you highlighted below. > > Here is the problem: > > When we offline pages we have the following call stack: > > # echo offline > /sys/devices/system/memory/memory8/state > ksys_write > vfs_write > __vfs_write >kernfs_fop_write > kernfs_get_active > lock_acquire kn->count#122 (lock for > "memory8/state" kn) > sysfs_kf_write > dev_attr_store > state_store >device_offline > memory_subsys_offline > memory_block_action > offline_pages >__offline_pages > percpu_down_write > down_write > lock_acquire mem_hotplug_lock.rw_sem > > When we unbind dax0.0 we have the following stack: > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > drv_attr_store > unbind_store > device_driver_detach >device_release_driver_internal > dev_dax_kmem_remove > remove_memory device_hotplug_lock > try_remove_memory mem_hotplug_lock.rw_sem >arch_remove_memory > __remove_pages > __remove_section > unregister_memory_section >remove_memory_sectionmem_sysfs_mutex > unregister_memory > device_unregister > device_del >device_remove_attrs > sysfs_remove_groups > sysfs_remove_group > remove_files >kernfs_remove_by_name > kernfs_remove_by_name_ns > __kernfs_removekn->count#122 > > So, lockdep found the ordering issue with the above two stacks: > > 1. kn->count#122 -> mem_hotplug_lock.rw_sem > 2. mem_hotplug_lock.rw_sem -> kn->count#122
Re: [v5 0/3] "Hotremove" persistent memory
> Hi Pavel, > > I am working on adding this sort of a workflow into a new daxctl command > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > dax device to kmem, online the resulting memory, and with your patches, > also attempt to offline the memory, and change back to device-dax. > > In running with these patches, and testing the offlining part, I ran > into the following lockdep below. > > This is with just these three patches on top of -rc7. > > > [ +0.004886] == > [ +0.001576] WARNING: possible circular locking dependency detected > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > [ +0.000929] -- > [ +0.000708] daxctl/22950 is trying to acquire lock: > [ +0.000548] f4d397f7 (kn->count#424){}, at: > kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000922] > but task is already holding lock: > [ +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x22/0xa0 I have studied this issue, and now have a clear understanding why it happens, I am not yet sure how to fix it, so suggestions are welcomed :) Here is the problem: When we offline pages we have the following call stack: # echo offline > /sys/devices/system/memory/memory8/state ksys_write vfs_write __vfs_write kernfs_fop_write kernfs_get_active lock_acquire kn->count#122 (lock for "memory8/state" kn) sysfs_kf_write dev_attr_store state_store device_offline memory_subsys_offline memory_block_action offline_pages __offline_pages percpu_down_write down_write lock_acquire mem_hotplug_lock.rw_sem When we unbind dax0.0 we have the following stack: # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind drv_attr_store unbind_store device_driver_detach device_release_driver_internal dev_dax_kmem_remove remove_memory device_hotplug_lock try_remove_memory mem_hotplug_lock.rw_sem arch_remove_memory __remove_pages __remove_section unregister_memory_section remove_memory_sectionmem_sysfs_mutex unregister_memory device_unregister device_del device_remove_attrs sysfs_remove_groups sysfs_remove_group remove_files kernfs_remove_by_name kernfs_remove_by_name_ns __kernfs_removekn->count#122 So, lockdep found the ordering issue with the above two stacks: 1. kn->count#122 -> mem_hotplug_lock.rw_sem 2. mem_hotplug_lock.rw_sem -> kn->count#122
Re: [v5 0/3] "Hotremove" persistent memory
On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote: > The series of operations look like this: > > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps. >and free ramdisk. > 2. Convert raw pmem0 to devdax >ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f > 3. Hotadd to System RAM >echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind >echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id >echo online_movable > /sys/devices/system/memoryXXX/state > 4. Before reboot hotremove device-dax memory from System RAM >echo offline > /sys/devices/system/memoryXXX/state >echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind Hi Pavel, I am working on adding this sort of a workflow into a new daxctl command (daxctl-reconfigure-device)- this will allow changing the 'mode' of a dax device to kmem, online the resulting memory, and with your patches, also attempt to offline the memory, and change back to device-dax. In running with these patches, and testing the offlining part, I ran into the following lockdep below. This is with just these three patches on top of -rc7. [ +0.004886] == [ +0.001576] WARNING: possible circular locking dependency detected [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O [ +0.000929] -- [ +0.000708] daxctl/22950 is trying to acquire lock: [ +0.000548] f4d397f7 (kn->count#424){}, at: kernfs_remove_by_name_ns+0x40/0x80 [ +0.000922] but task is already holding lock: [ +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 [ +0.000960] which lock already depends on the new lock. [ +0.001001] the existing dependency chain (in reverse order) is: [ +0.000837] -> #3 (mem_sysfs_mutex){+.+.}: [ +0.000631]__mutex_lock+0x82/0x9a0 [ +0.000477]unregister_memory_section+0x22/0xa0 [ +0.000582]__remove_pages+0xe9/0x520 [ +0.000489]arch_remove_memory+0x81/0xc0 [ +0.000510]devm_memremap_pages_release+0x180/0x270 [ +0.000633]release_nodes+0x234/0x280 [ +0.000483]device_release_driver_internal+0xf4/0x1d0 [ +0.000701]bus_remove_device+0xfc/0x170 [ +0.000529]device_del+0x16a/0x380 [ +0.000459]unregister_dev_dax+0x23/0x50 [ +0.000526]release_nodes+0x234/0x280 [ +0.000487]device_release_driver_internal+0xf4/0x1d0 [ +0.000646]unbind_store+0x9b/0x130 [ +0.000467]kernfs_fop_write+0xf0/0x1a0 [ +0.000510]vfs_write+0xba/0x1c0 [ +0.000438]ksys_write+0x5a/0xe0 [ +0.000521]do_syscall_64+0x60/0x210 [ +0.000489]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000637] -> #2 (mem_hotplug_lock.rw_sem){}: [ +0.000717]get_online_mems+0x3e/0x80 [ +0.000491]kmem_cache_create_usercopy+0x2e/0x270 [ +0.000609]kmem_cache_create+0x12/0x20 [ +0.000507]ptlock_cache_init+0x20/0x28 [ +0.000506]start_kernel+0x240/0x4d0 [ +0.000480]secondary_startup_64+0xa4/0xb0 [ +0.000539] -> #1 (cpu_hotplug_lock.rw_sem){}: [ +0.000784]cpus_read_lock+0x3e/0x80 [ +0.000511]online_pages+0x37/0x310 [ +0.000469]memory_subsys_online+0x34/0x60 [ +0.000611]device_online+0x60/0x80 [ +0.000611]state_store+0x66/0xd0 [ +0.000552]kernfs_fop_write+0xf0/0x1a0 [ +0.000649]vfs_write+0xba/0x1c0 [ +0.000487]ksys_write+0x5a/0xe0 [ +0.000459]do_syscall_64+0x60/0x210 [ +0.000482]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000646] -> #0 (kn->count#424){}: [ +0.000669]lock_acquire+0x9e/0x180 [ +0.000471]__kernfs_remove+0x26a/0x310 [ +0.000518]kernfs_remove_by_name_ns+0x40/0x80 [ +0.000583]remove_files.isra.1+0x30/0x70 [ +0.000555]sysfs_remove_group+0x3d/0x80 [ +0.000524]sysfs_remove_groups+0x29/0x40 [ +0.000532]device_remove_attrs+0x42/0x80 [ +0.000522]device_del+0x162/0x380 [ +0.000464]device_unregister+0x16/0x60 [ +0.000505]unregister_memory_section+0x6e/0xa0 [ +0.000591]__remove_pages+0xe9/0x520 [ +0.000492]arch_remove_memory+0x81/0xc0 [ +0.000568]try_remove_memory+0xba/0xd0 [ +0.000510]remove_memory+0x23/0x40 [ +0.000483]dev_dax_kmem_remove+0x29/0x57 [kmem] [ +0.000608]device_release_driver_internal+0xe4/0x1d0 [ +0.000637]unbind_store+0x9b/0x130 [ +0.000464]kernfs_fop_write+0xf0/0x1a0 [ +0.000685]vfs_write+0xba/0x1c0 [ +0.000594]ksys_write+0x5a/0xe0 [ +0.000449]do_syscall_64+0x60/0x210 [ +0.000481]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000619] other info that might help us debug this: [ +0.000889] Chain exists of: