Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
On 2021/3/30 15:27, Yu Zhao wrote: > On Tue, Mar 30, 2021 at 12:57 AM Huang, Ying wrote: >> >> Yu Zhao writes: >> >>> On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote: Miaohe Lin writes: > On 2021/3/30 9:57, Huang, Ying wrote: >> Hi, Miaohe, >> >> Miaohe Lin writes: >> >>> Hi all, >>> I am investigating the swap code, and I found the below possible race >>> window: >>> >>> CPU 1 CPU 2 >>> - - >>> do_swap_page >>> skip swapcache case (synchronous swap_readpage) >>> alloc_page_vma >>> swapoff >>> release >>> swap_file, bdev, or ... >>> swap_readpage >>> check sis->flags is ok >>> access swap_file, bdev or ...[oops!] >>> si->flags = 0 >>> >>> The swapcache case is ok because swapoff will wait on the page_lock of >>> swapcache page. >>> Is this will really happen or Am I miss something ? >>> Any reply would be really grateful. Thanks! :) >> >> This appears possible. Even for swapcache case, we can't guarantee the > > Many thanks for reply! > >> swap entry gotten from the page table is always valid too. The > > The page table may change at any time. And we may thus do some useless > work. > But the pte_same() check could handle these races correctly if these do > not > result in oops. > >> underlying swap device can be swapped off at the same time. So we use >> get/put_swap_device() for that. Maybe we need similar stuff here. > > Using get/put_swap_device() to guard against swapoff for swap_readpage() > sounds > really bad as swap_readpage() may take really long time. Also such race > may not be > really hurtful because swapoff is usually done when system shutdown only. > I can not figure some simple and stable stuff out to fix this. Any > suggestions or > could anyone help get rid of such race? Some reference counting on the swap device can prevent swap device from swapping-off. To reduce the performance overhead on the hot-path as much as possible, it appears we can use the percpu_ref. >>> >>> Hi, >>> >>> I've been seeing crashes when testing the latest kernels with >>> stress-ng --class vm -a 20 -t 600s --temp-path /tmp >>> >>> I haven't had time to look into them yet: >>> >>> DEBUG_VM: >>> BUG: unable to handle page fault for address: 905c33c9a000 >>> Call Trace: >>>get_swap_pages+0x278/0x590 >>>get_swap_page+0x1ab/0x280 >>>add_to_swap+0x7d/0x130 >>>shrink_page_list+0xf84/0x25f0 >>>reclaim_pages+0x313/0x430 >>>madvise_cold_or_pageout_pte_range+0x95c/0xaa0 >> >> If my understanding were correct, two bugs are reported? One above and >> one below? If so, and the above one is reported firstly. Can you share >> the full bug message reported in dmesg? > > No, they are from two different kernel configs. I saw the first crash > and didn't know what to look. So I turned on KASAN to see if it gives > more clue. Unfortunately I haven't had time to spend more time on it. > >> Can you convert the call trace to source line? And the commit of the >> kernel? Or the full kconfig? So I can build it by myself. > > It seems to be very reproducible if you enable these three options, on > 5.12, 5.11, 5.10 which is where I gave up trying. > >>> CONFIG_MEMCG_SWAP=y >>> CONFIG_THP_SWAP=y >>> CONFIG_ZSWAP=y > > I'll dig into the log and see if I could at least give you the line > numbers. Kernel config attached. Thanks! > Maybe we could try to fix this issue here with more detailed info. Thanks. > And the command line I used, which is nothing fancy: > >>> stress-ng --class vm -a 20 -t 600s --temp-path /tmp > >>> KASAN: >>> == >>> BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0 >>> Read of size 8 at addr 88901f646f18 by task stress-ng-mrema/31329 >>> CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G SI L >>> 5.12.0-smp-DEV #2 >>> Call Trace: >>>dump_stack+0xff/0x165 >>>print_address_description+0x81/0x390 >>>__kasan_report+0x154/0x1b0 >>>? __frontswap_store+0xc9/0x2e0 >>>? __frontswap_store+0xc9/0x2e0 >>>kasan_report+0x47/0x60 >>>kasan_check_range+0x2f3/0x340 >>>__kasan_check_read+0x11/0x20 >>>__frontswap_store+0xc9/0x2e0 >>>swap_writepage+0x52/0x80 >>>pageout+0x489/0x7f0 >>>shrink_page_list+0x1b11/0x2c90 >>>reclaim_pages+0x6ca/0x930 >>>madvise_cold_or_pageout_pte_range+0x1260/0x13a0 >>> >>> Allocated by task 16813: >>>kasan_kmalloc+0xb0/0xe0 >>>
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
On 2021/3/30 11:44, Huang, Ying wrote: > Miaohe Lin writes: > >> On 2021/3/30 9:57, Huang, Ying wrote: >>> Hi, Miaohe, >>> >>> Miaohe Lin writes: >>> Hi all, I am investigating the swap code, and I found the below possible race window: CPU 1 CPU 2 - - do_swap_page skip swapcache case (synchronous swap_readpage) alloc_page_vma swapoff release swap_file, bdev, or ... swap_readpage check sis->flags is ok access swap_file, bdev or ...[oops!] si->flags = 0 The swapcache case is ok because swapoff will wait on the page_lock of swapcache page. Is this will really happen or Am I miss something ? Any reply would be really grateful. Thanks! :) >>> >>> This appears possible. Even for swapcache case, we can't guarantee the >> >> Many thanks for reply! >> >>> swap entry gotten from the page table is always valid too. The >> >> The page table may change at any time. And we may thus do some useless work. >> But the pte_same() check could handle these races correctly if these do not >> result in oops. >> >>> underlying swap device can be swapped off at the same time. So we use >>> get/put_swap_device() for that. Maybe we need similar stuff here. >> >> Using get/put_swap_device() to guard against swapoff for swap_readpage() >> sounds >> really bad as swap_readpage() may take really long time. Also such race may >> not be >> really hurtful because swapoff is usually done when system shutdown only. >> I can not figure some simple and stable stuff out to fix this. Any >> suggestions or >> could anyone help get rid of such race? > > Some reference counting on the swap device can prevent swap device from > swapping-off. To reduce the performance overhead on the hot-path as > much as possible, it appears we can use the percpu_ref. > Sounds a good idea. Many thanks for your suggestion. :) > Best Regards, > Huang, Ying > . >
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
Yu Zhao writes: > On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote: >> >> Miaohe Lin writes: >> >> > On 2021/3/30 9:57, Huang, Ying wrote: >> >> Hi, Miaohe, >> >> >> >> Miaohe Lin writes: >> >> >> >>> Hi all, >> >>> I am investigating the swap code, and I found the below possible race >> >>> window: >> >>> >> >>> CPU 1 CPU 2 >> >>> - - >> >>> do_swap_page >> >>> skip swapcache case (synchronous swap_readpage) >> >>> alloc_page_vma >> >>> swapoff >> >>> release swap_file, >> >>> bdev, or ... >> >>> swap_readpage >> >>> check sis->flags is ok >> >>> access swap_file, bdev or ...[oops!] >> >>> si->flags = 0 >> >>> >> >>> The swapcache case is ok because swapoff will wait on the page_lock of >> >>> swapcache page. >> >>> Is this will really happen or Am I miss something ? >> >>> Any reply would be really grateful. Thanks! :) >> >> >> >> This appears possible. Even for swapcache case, we can't guarantee the >> > >> > Many thanks for reply! >> > >> >> swap entry gotten from the page table is always valid too. The >> > >> > The page table may change at any time. And we may thus do some useless >> > work. >> > But the pte_same() check could handle these races correctly if these do not >> > result in oops. >> > >> >> underlying swap device can be swapped off at the same time. So we use >> >> get/put_swap_device() for that. Maybe we need similar stuff here. >> > >> > Using get/put_swap_device() to guard against swapoff for swap_readpage() >> > sounds >> > really bad as swap_readpage() may take really long time. Also such race >> > may not be >> > really hurtful because swapoff is usually done when system shutdown only. >> > I can not figure some simple and stable stuff out to fix this. Any >> > suggestions or >> > could anyone help get rid of such race? >> >> Some reference counting on the swap device can prevent swap device from >> swapping-off. To reduce the performance overhead on the hot-path as >> much as possible, it appears we can use the percpu_ref. > > Hi, > > I've been seeing crashes when testing the latest kernels with > stress-ng --class vm -a 20 -t 600s --temp-path /tmp > > I haven't had time to look into them yet: > > DEBUG_VM: > BUG: unable to handle page fault for address: 905c33c9a000 > Call Trace: >get_swap_pages+0x278/0x590 >get_swap_page+0x1ab/0x280 >add_to_swap+0x7d/0x130 >shrink_page_list+0xf84/0x25f0 >reclaim_pages+0x313/0x430 >madvise_cold_or_pageout_pte_range+0x95c/0xaa0 If my understanding were correct, two bugs are reported? One above and one below? If so, and the above one is reported firstly. Can you share the full bug message reported in dmesg? Can you convert the call trace to source line? And the commit of the kernel? Or the full kconfig? So I can build it by myself. Best Regards, Huang, Ying > KASAN: > == > BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0 > Read of size 8 at addr 88901f646f18 by task stress-ng-mrema/31329 > CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G SI L > 5.12.0-smp-DEV #2 > Call Trace: >dump_stack+0xff/0x165 >print_address_description+0x81/0x390 >__kasan_report+0x154/0x1b0 >? __frontswap_store+0xc9/0x2e0 >? __frontswap_store+0xc9/0x2e0 >kasan_report+0x47/0x60 >kasan_check_range+0x2f3/0x340 >__kasan_check_read+0x11/0x20 >__frontswap_store+0xc9/0x2e0 >swap_writepage+0x52/0x80 >pageout+0x489/0x7f0 >shrink_page_list+0x1b11/0x2c90 >reclaim_pages+0x6ca/0x930 >madvise_cold_or_pageout_pte_range+0x1260/0x13a0 > > Allocated by task 16813: >kasan_kmalloc+0xb0/0xe0 >__kasan_kmalloc+0x9/0x10 >__kmalloc_node+0x52/0x70 >kvmalloc_node+0x50/0x90 >__se_sys_swapon+0x353a/0x4860 >__x64_sys_swapon+0x5b/0x70 > > The buggy address belongs to the object at 88901f64 >which belongs to the cache kmalloc-32k of size 32768 > The buggy address is located 28440 bytes inside of >32768-byte region [88901f64, 88901f648000) > The buggy address belongs to the page: > page:32d23e33 refcount:1 mapcount:0 mapping: > index:0x0 pfn:0x101f640 > head:32d23e33 order:4 compound_mapcount:0 compound_pincount:0 > flags: 0x4010200(slab|head) > raw: 04010200 ea00062b8408 ea000a6e9008 888100040300 > raw: 88901f64 00010001 000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: >88901f646e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >88901f646e80: 00 00 00
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote: > > Miaohe Lin writes: > > > On 2021/3/30 9:57, Huang, Ying wrote: > >> Hi, Miaohe, > >> > >> Miaohe Lin writes: > >> > >>> Hi all, > >>> I am investigating the swap code, and I found the below possible race > >>> window: > >>> > >>> CPU 1 CPU 2 > >>> - - > >>> do_swap_page > >>> skip swapcache case (synchronous swap_readpage) > >>> alloc_page_vma > >>> swapoff > >>> release swap_file, > >>> bdev, or ... > >>> swap_readpage > >>> check sis->flags is ok > >>> access swap_file, bdev or ...[oops!] > >>> si->flags = 0 > >>> > >>> The swapcache case is ok because swapoff will wait on the page_lock of > >>> swapcache page. > >>> Is this will really happen or Am I miss something ? > >>> Any reply would be really grateful. Thanks! :) > >> > >> This appears possible. Even for swapcache case, we can't guarantee the > > > > Many thanks for reply! > > > >> swap entry gotten from the page table is always valid too. The > > > > The page table may change at any time. And we may thus do some useless work. > > But the pte_same() check could handle these races correctly if these do not > > result in oops. > > > >> underlying swap device can be swapped off at the same time. So we use > >> get/put_swap_device() for that. Maybe we need similar stuff here. > > > > Using get/put_swap_device() to guard against swapoff for swap_readpage() > > sounds > > really bad as swap_readpage() may take really long time. Also such race may > > not be > > really hurtful because swapoff is usually done when system shutdown only. > > I can not figure some simple and stable stuff out to fix this. Any > > suggestions or > > could anyone help get rid of such race? > > Some reference counting on the swap device can prevent swap device from > swapping-off. To reduce the performance overhead on the hot-path as > much as possible, it appears we can use the percpu_ref. Hi, I've been seeing crashes when testing the latest kernels with stress-ng --class vm -a 20 -t 600s --temp-path /tmp I haven't had time to look into them yet: DEBUG_VM: BUG: unable to handle page fault for address: 905c33c9a000 Call Trace: get_swap_pages+0x278/0x590 get_swap_page+0x1ab/0x280 add_to_swap+0x7d/0x130 shrink_page_list+0xf84/0x25f0 reclaim_pages+0x313/0x430 madvise_cold_or_pageout_pte_range+0x95c/0xaa0 KASAN: == BUG: KASAN: slab-out-of-bounds in __frontswap_store+0xc9/0x2e0 Read of size 8 at addr 88901f646f18 by task stress-ng-mrema/31329 CPU: 2 PID: 31329 Comm: stress-ng-mrema Tainted: G SI L 5.12.0-smp-DEV #2 Call Trace: dump_stack+0xff/0x165 print_address_description+0x81/0x390 __kasan_report+0x154/0x1b0 ? __frontswap_store+0xc9/0x2e0 ? __frontswap_store+0xc9/0x2e0 kasan_report+0x47/0x60 kasan_check_range+0x2f3/0x340 __kasan_check_read+0x11/0x20 __frontswap_store+0xc9/0x2e0 swap_writepage+0x52/0x80 pageout+0x489/0x7f0 shrink_page_list+0x1b11/0x2c90 reclaim_pages+0x6ca/0x930 madvise_cold_or_pageout_pte_range+0x1260/0x13a0 Allocated by task 16813: kasan_kmalloc+0xb0/0xe0 __kasan_kmalloc+0x9/0x10 __kmalloc_node+0x52/0x70 kvmalloc_node+0x50/0x90 __se_sys_swapon+0x353a/0x4860 __x64_sys_swapon+0x5b/0x70 The buggy address belongs to the object at 88901f64 which belongs to the cache kmalloc-32k of size 32768 The buggy address is located 28440 bytes inside of 32768-byte region [88901f64, 88901f648000) The buggy address belongs to the page: page:32d23e33 refcount:1 mapcount:0 mapping: index:0x0 pfn:0x101f640 head:32d23e33 order:4 compound_mapcount:0 compound_pincount:0 flags: 0x4010200(slab|head) raw: 04010200 ea00062b8408 ea000a6e9008 888100040300 raw: 88901f64 00010001 000 page dumped because: kasan: bad access detected Memory state around the buggy address: 88901f646e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 88901f646e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >88901f646f00: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc ^ 88901f646f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 88901f647000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc == Relevant config options I could think of: CONFIG_MEMCG_SWAP=y CONFIG_THP_SWAP=y CONFIG_ZSWAP=y
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
Miaohe Lin writes: > On 2021/3/30 9:57, Huang, Ying wrote: >> Hi, Miaohe, >> >> Miaohe Lin writes: >> >>> Hi all, >>> I am investigating the swap code, and I found the below possible race >>> window: >>> >>> CPU 1 CPU 2 >>> - - >>> do_swap_page >>> skip swapcache case (synchronous swap_readpage) >>> alloc_page_vma >>> swapoff >>> release swap_file, >>> bdev, or ... >>> swap_readpage >>> check sis->flags is ok >>> access swap_file, bdev or ...[oops!] >>> si->flags = 0 >>> >>> The swapcache case is ok because swapoff will wait on the page_lock of >>> swapcache page. >>> Is this will really happen or Am I miss something ? >>> Any reply would be really grateful. Thanks! :) >> >> This appears possible. Even for swapcache case, we can't guarantee the > > Many thanks for reply! > >> swap entry gotten from the page table is always valid too. The > > The page table may change at any time. And we may thus do some useless work. > But the pte_same() check could handle these races correctly if these do not > result in oops. > >> underlying swap device can be swapped off at the same time. So we use >> get/put_swap_device() for that. Maybe we need similar stuff here. > > Using get/put_swap_device() to guard against swapoff for swap_readpage() > sounds > really bad as swap_readpage() may take really long time. Also such race may > not be > really hurtful because swapoff is usually done when system shutdown only. > I can not figure some simple and stable stuff out to fix this. Any > suggestions or > could anyone help get rid of such race? Some reference counting on the swap device can prevent swap device from swapping-off. To reduce the performance overhead on the hot-path as much as possible, it appears we can use the percpu_ref. Best Regards, Huang, Ying
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
On 2021/3/30 9:57, Huang, Ying wrote: > Hi, Miaohe, > > Miaohe Lin writes: > >> Hi all, >> I am investigating the swap code, and I found the below possible race window: >> >> CPU 1CPU 2 >> -- >> do_swap_page >> skip swapcache case (synchronous swap_readpage) >> alloc_page_vma >> swapoff >>release swap_file, >> bdev, or ... >> swap_readpage >> check sis->flags is ok >>access swap_file, bdev or ...[oops!] >> si->flags = 0 >> >> The swapcache case is ok because swapoff will wait on the page_lock of >> swapcache page. >> Is this will really happen or Am I miss something ? >> Any reply would be really grateful. Thanks! :) > > This appears possible. Even for swapcache case, we can't guarantee the Many thanks for reply! > swap entry gotten from the page table is always valid too. The The page table may change at any time. And we may thus do some useless work. But the pte_same() check could handle these races correctly if these do not result in oops. > underlying swap device can be swapped off at the same time. So we use > get/put_swap_device() for that. Maybe we need similar stuff here. Using get/put_swap_device() to guard against swapoff for swap_readpage() sounds really bad as swap_readpage() may take really long time. Also such race may not be really hurtful because swapoff is usually done when system shutdown only. I can not figure some simple and stable stuff out to fix this. Any suggestions or could anyone help get rid of such race? Anyway, thanks again! > > Best Regards, > Huang, Ying > . >
Re: [Question] Is there a race window between swapoff vs synchronous swap_readpage
Hi, Miaohe, Miaohe Lin writes: > Hi all, > I am investigating the swap code, and I found the below possible race window: > > CPU 1 CPU 2 > - - > do_swap_page > skip swapcache case (synchronous swap_readpage) > alloc_page_vma > swapoff > release swap_file, > bdev, or ... > swap_readpage > check sis->flags is ok > access swap_file, bdev or ...[oops!] > si->flags = 0 > > The swapcache case is ok because swapoff will wait on the page_lock of > swapcache page. > Is this will really happen or Am I miss something ? > Any reply would be really grateful. Thanks! :) This appears possible. Even for swapcache case, we can't guarantee the swap entry gotten from the page table is always valid too. The underlying swap device can be swapped off at the same time. So we use get/put_swap_device() for that. Maybe we need similar stuff here. Best Regards, Huang, Ying
[Question] Is there a race window between swapoff vs synchronous swap_readpage
Hi all, I am investigating the swap code, and I found the below possible race window: CPU 1 CPU 2 - - do_swap_page skip swapcache case (synchronous swap_readpage) alloc_page_vma swapoff release swap_file, bdev, or ... swap_readpage check sis->flags is ok access swap_file, bdev or ...[oops!] si->flags = 0 The swapcache case is ok because swapoff will wait on the page_lock of swapcache page. Is this will really happen or Am I miss something ? Any reply would be really grateful. Thanks! :)