Re: Very slow unlockall()
On Wed, 10 Feb 2021, Michal Hocko wrote: > On Wed 10-02-21 17:57:29, Michal Hocko wrote: > > On Wed 10-02-21 16:18:50, Vlastimil Babka wrote: > [...] > > > And the munlock (munlock_vma_pages_range()) is slow, because it uses > > > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so > > > that's > > > always traversing all levels of page tables from scratch. Funnily enough, > > > speeding this up was my first linux-mm series years ago. But the speedup > > > only > > > works if pte's are present, which is not the case for unpopulated > > > PROT_NONE > > > areas. That use case was unexpected back then. We should probably convert > > > this > > > code to a proper page table walk. If there are large areas with > > > unpopulated pmd > > > entries (or even higher levels) we would traverse them very quickly. > > > > Yes, this is a good idea. I suspect it will be little bit tricky without > > duplicating a large part of gup page table walker. > > Thinking about it some more, unmap_page_range would be a better model > for this operation. Could do, I suppose; but I thought it was just a matter of going back to using follow_page_mask() in munlock_vma_pages_range() (whose fear of THP split looks overwrought, since an extra reference now prevents splitting); and enhancing follow_page_mask() to let the no_page_table() FOLL_DUMP case set ctx->page_mask appropriately (or perhaps it can be preset at a higher level, without having to pass ctx so far down, dunno). Nice little job, but I couldn't quite spare the time to do it: needs a bit more care than I could afford (I suspect the page_increm business at the end of munlock_vma_pages_range() is good enough while THP tails are skipped one by one, but will need to be fixed to apply page_mask correctly to the start - __get_user_pages()'s page_increm-entation looks superior). Hugh
Re: Very slow unlockall()
On Wed 10-02-21 17:57:29, Michal Hocko wrote: > On Wed 10-02-21 16:18:50, Vlastimil Babka wrote: [...] > > And the munlock (munlock_vma_pages_range()) is slow, because it uses > > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's > > always traversing all levels of page tables from scratch. Funnily enough, > > speeding this up was my first linux-mm series years ago. But the speedup > > only > > works if pte's are present, which is not the case for unpopulated PROT_NONE > > areas. That use case was unexpected back then. We should probably convert > > this > > code to a proper page table walk. If there are large areas with unpopulated > > pmd > > entries (or even higher levels) we would traverse them very quickly. > > Yes, this is a good idea. I suspect it will be little bit tricky without > duplicating a large part of gup page table walker. Thinking about it some more, unmap_page_range would be a better model for this operation. -- Michal Hocko SUSE Labs
Re: Very slow unlockall()
On Wed 10-02-21 16:18:50, Vlastimil Babka wrote: > On 2/1/21 8:19 PM, Milan Broz wrote: > > On 01/02/2021 19:55, Vlastimil Babka wrote: > >> On 2/1/21 7:00 PM, Milan Broz wrote: > >>> On 01/02/2021 14:08, Vlastimil Babka wrote: > On 1/8/21 3:39 PM, Milan Broz wrote: > > On 08/01/2021 14:41, Michal Hocko wrote: > >> On Wed 06-01-21 16:20:15, Milan Broz wrote: > >>> Hi, > >>> > >>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in > >>> cryptsetup code > >>> and someone tried to use it with hardened memory allocator library. > >>> > >>> Execution time was increased to extreme (minutes) and as we found, > >>> the problem > >>> is in munlockall(). > >>> > >>> Here is a plain reproducer for the core without any external code - > >>> it takes > >>> unlocking on Fedora rawhide kernel more than 30 seconds! > >>> I can reproduce it on 5.10 kernels and Linus' git. > >>> > >>> The reproducer below tries to mmap large amount memory with PROT_NONE > >>> (later never used). > >>> The real code of course does something more useful but the problem is > >>> the same. > >>> > >>> #include > >>> #include > >>> #include > >>> #include > >>> > >>> int main (int argc, char *argv[]) > >>> { > >>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | > >>> MAP_ANONYMOUS, -1, 0); > >> > >> So, this is 2TB memory area, but PROT_NONE means it's never actually > >> populated, > >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | > >> PROT_WRITE there, the mlockall() starts taking ages. > >> > >> So does that reflect your use case? munlockall() with large PROT_NONE > >> areas? If > >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would > >> expect such scenario to be uncommon, so better clarify first. > > > > It is just a simple reproducer of the underlying problem, as suggested here > > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 > > > > We use mlockall() in cryptsetup and with hardened malloc it slows down > > unlock significantly. > > (For the real case problem please read the whole issue report above.) > > OK, finally read through the bug report, and learned two things: > > 1) the PROT_NONE is indeed intentional part of the reproducer > 2) Linux mailing lists still have a bad reputation and people avoid them. > That's > sad :( Well, thanks for overcoming that :) > > Daniel there says "I think the Linux kernel implementation of mlockall is > quite > broken and tries to lock all the reserved PROT_NONE regions in advance which > doesn't make any sense." > > >From my testing this doesn't seem to be the case, as the mlockall() part is > >very > fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts > to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is > slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We > probably > can't just skip them, as they might actually contain mlocked pages if those > were > faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE. Mlock code is quite easy to misunderstand but IIRC the mlock part should be rather straightforward. It will mark VMAs as locked, do some merging/splitting where appropriate and finally populate the range by gup. This should fail because VMA doesn't allow neither read nor write, right? And mlock should report that. mlockall will not bother because it will ignore errors on population. So there is no page table walk happening. > And the munlock (munlock_vma_pages_range()) is slow, because it uses > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's > always traversing all levels of page tables from scratch. Funnily enough, > speeding this up was my first linux-mm series years ago. But the speedup only > works if pte's are present, which is not the case for unpopulated PROT_NONE > areas. That use case was unexpected back then. We should probably convert this > code to a proper page table walk. If there are large areas with unpopulated > pmd > entries (or even higher levels) we would traverse them very quickly. Yes, this is a good idea. I suspect it will be little bit tricky without duplicating a large part of gup page table walker. -- Michal Hocko SUSE Labs
Re: Very slow unlockall()
On 2/1/21 8:19 PM, Milan Broz wrote: > On 01/02/2021 19:55, Vlastimil Babka wrote: >> On 2/1/21 7:00 PM, Milan Broz wrote: >>> On 01/02/2021 14:08, Vlastimil Babka wrote: On 1/8/21 3:39 PM, Milan Broz wrote: > On 08/01/2021 14:41, Michal Hocko wrote: >> On Wed 06-01-21 16:20:15, Milan Broz wrote: >>> Hi, >>> >>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup >>> code >>> and someone tried to use it with hardened memory allocator library. >>> >>> Execution time was increased to extreme (minutes) and as we found, the >>> problem >>> is in munlockall(). >>> >>> Here is a plain reproducer for the core without any external code - it >>> takes >>> unlocking on Fedora rawhide kernel more than 30 seconds! >>> I can reproduce it on 5.10 kernels and Linus' git. >>> >>> The reproducer below tries to mmap large amount memory with PROT_NONE >>> (later never used). >>> The real code of course does something more useful but the problem is >>> the same. >>> >>> #include >>> #include >>> #include >>> #include >>> >>> int main (int argc, char *argv[]) >>> { >>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | >>> MAP_ANONYMOUS, -1, 0); >> >> So, this is 2TB memory area, but PROT_NONE means it's never actually >> populated, >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | >> PROT_WRITE there, the mlockall() starts taking ages. >> >> So does that reflect your use case? munlockall() with large PROT_NONE areas? >> If >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would >> expect such scenario to be uncommon, so better clarify first. > > It is just a simple reproducer of the underlying problem, as suggested here > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 > > We use mlockall() in cryptsetup and with hardened malloc it slows down unlock > significantly. > (For the real case problem please read the whole issue report above.) OK, finally read through the bug report, and learned two things: 1) the PROT_NONE is indeed intentional part of the reproducer 2) Linux mailing lists still have a bad reputation and people avoid them. That's sad :( Well, thanks for overcoming that :) Daniel there says "I think the Linux kernel implementation of mlockall is quite broken and tries to lock all the reserved PROT_NONE regions in advance which doesn't make any sense." >From my testing this doesn't seem to be the case, as the mlockall() part is >very fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably can't just skip them, as they might actually contain mlocked pages if those were faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE. And the munlock (munlock_vma_pages_range()) is slow, because it uses follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's always traversing all levels of page tables from scratch. Funnily enough, speeding this up was my first linux-mm series years ago. But the speedup only works if pte's are present, which is not the case for unpopulated PROT_NONE areas. That use case was unexpected back then. We should probably convert this code to a proper page table walk. If there are large areas with unpopulated pmd entries (or even higher levels) we would traverse them very quickly.
Re: Very slow unlockall()
On 01/02/2021 19:55, Vlastimil Babka wrote: > On 2/1/21 7:00 PM, Milan Broz wrote: >> On 01/02/2021 14:08, Vlastimil Babka wrote: >>> On 1/8/21 3:39 PM, Milan Broz wrote: On 08/01/2021 14:41, Michal Hocko wrote: > On Wed 06-01-21 16:20:15, Milan Broz wrote: >> Hi, >> >> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup >> code >> and someone tried to use it with hardened memory allocator library. >> >> Execution time was increased to extreme (minutes) and as we found, the >> problem >> is in munlockall(). >> >> Here is a plain reproducer for the core without any external code - it >> takes >> unlocking on Fedora rawhide kernel more than 30 seconds! >> I can reproduce it on 5.10 kernels and Linus' git. >> >> The reproducer below tries to mmap large amount memory with PROT_NONE >> (later never used). >> The real code of course does something more useful but the problem is >> the same. >> >> #include >> #include >> #include >> #include >> >> int main (int argc, char *argv[]) >> { >> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | >> MAP_ANONYMOUS, -1, 0); > > So, this is 2TB memory area, but PROT_NONE means it's never actually > populated, > although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | > PROT_WRITE there, the mlockall() starts taking ages. > > So does that reflect your use case? munlockall() with large PROT_NONE areas? > If > so, munlock_vma_pages_range() is indeed not optimized for that, but I would > expect such scenario to be uncommon, so better clarify first. It is just a simple reproducer of the underlying problem, as suggested here https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly. (For the real case problem please read the whole issue report above.) m.
Re: Very slow unlockall()
On 2/1/21 7:00 PM, Milan Broz wrote: > On 01/02/2021 14:08, Vlastimil Babka wrote: >> On 1/8/21 3:39 PM, Milan Broz wrote: >>> On 08/01/2021 14:41, Michal Hocko wrote: On Wed 06-01-21 16:20:15, Milan Broz wrote: > Hi, > > we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup > code > and someone tried to use it with hardened memory allocator library. > > Execution time was increased to extreme (minutes) and as we found, the > problem > is in munlockall(). > > Here is a plain reproducer for the core without any external code - it > takes > unlocking on Fedora rawhide kernel more than 30 seconds! > I can reproduce it on 5.10 kernels and Linus' git. > > The reproducer below tries to mmap large amount memory with PROT_NONE > (later never used). > The real code of course does something more useful but the problem is the > same. > > #include > #include > #include > #include > > int main (int argc, char *argv[]) > { > void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | > MAP_ANONYMOUS, -1, 0); So, this is 2TB memory area, but PROT_NONE means it's never actually populated, although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | PROT_WRITE there, the mlockall() starts taking ages. So does that reflect your use case? munlockall() with large PROT_NONE areas? If so, munlock_vma_pages_range() is indeed not optimized for that, but I would expect such scenario to be uncommon, so better clarify first. > > if (p == MAP_FAILED) return 1; > > if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; > printf("locked\n"); > > if (munlockall()) return 1; > printf("unlocked\n"); > > return 0; > } > > ... > >>> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive >>> kernel debug options): >>> >>> # time ./lock >>> locked >>> unlocked >>> >>> real0m4.172s >>> user0m0.000s >>> sys 0m4.172s >> >> The perf report would be more interesting from this configuration. > > ok, I cannot run perf on that particular VM but tried the latest Fedora stable > kernel without debug options - 5.10.12-200.fc33.x86_64 > > This is the report running reproducer above: > > time: > real0m6.123s > user0m0.099s > sys 0m5.310s > > perf: > > # Total Lost Samples: 0 > # > # Samples: 20K of event 'cycles' > # Event count (approx.): 20397603279 > # > # Overhead Command Shared Object Symbol > # ... . > # > 47.26% lock [kernel.kallsyms] [k] follow_page_mask > 20.43% lock [kernel.kallsyms] [k] munlock_vma_pages_range > 15.92% lock [kernel.kallsyms] [k] follow_page > 7.40% lock [kernel.kallsyms] [k] rcu_all_qs > 5.87% lock [kernel.kallsyms] [k] _cond_resched > 3.08% lock [kernel.kallsyms] [k] follow_huge_addr > 0.01% lock [kernel.kallsyms] [k] __update_load_avg_cfs_rq > 0.01% lock [kernel.kallsyms] [k] fput > 0.01% lock [kernel.kallsyms] [k] rmap_walk_file > 0.00% lock [kernel.kallsyms] [k] page_mapped > 0.00% lock [kernel.kallsyms] [k] native_irq_return_iret > 0.00% lock [kernel.kallsyms] [k] _raw_spin_lock_irq > 0.00% lock [kernel.kallsyms] [k] perf_iterate_ctx > 0.00% lock [kernel.kallsyms] [k] finish_task_switch > 0.00% perf [kernel.kallsyms] [k] native_sched_clock > 0.00% lock [kernel.kallsyms] [k] native_write_msr > 0.00% perf [kernel.kallsyms] [k] native_write_msr > > > m. >
Re: Very slow unlockall()
On 01/02/2021 14:08, Vlastimil Babka wrote: > On 1/8/21 3:39 PM, Milan Broz wrote: >> On 08/01/2021 14:41, Michal Hocko wrote: >>> On Wed 06-01-21 16:20:15, Milan Broz wrote: Hi, we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code and someone tried to use it with hardened memory allocator library. Execution time was increased to extreme (minutes) and as we found, the problem is in munlockall(). Here is a plain reproducer for the core without any external code - it takes unlocking on Fedora rawhide kernel more than 30 seconds! I can reproduce it on 5.10 kernels and Linus' git. The reproducer below tries to mmap large amount memory with PROT_NONE (later never used). The real code of course does something more useful but the problem is the same. #include #include #include #include int main (int argc, char *argv[]) { void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) return 1; if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; printf("locked\n"); if (munlockall()) return 1; printf("unlocked\n"); return 0; } ... >> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel >> debug options): >> >> # time ./lock >> locked >> unlocked >> >> real0m4.172s >> user0m0.000s >> sys 0m4.172s > > The perf report would be more interesting from this configuration. ok, I cannot run perf on that particular VM but tried the latest Fedora stable kernel without debug options - 5.10.12-200.fc33.x86_64 This is the report running reproducer above: time: real0m6.123s user0m0.099s sys 0m5.310s perf: # Total Lost Samples: 0 # # Samples: 20K of event 'cycles' # Event count (approx.): 20397603279 # # Overhead Command Shared Object Symbol # ... . # 47.26% lock [kernel.kallsyms] [k] follow_page_mask 20.43% lock [kernel.kallsyms] [k] munlock_vma_pages_range 15.92% lock [kernel.kallsyms] [k] follow_page 7.40% lock [kernel.kallsyms] [k] rcu_all_qs 5.87% lock [kernel.kallsyms] [k] _cond_resched 3.08% lock [kernel.kallsyms] [k] follow_huge_addr 0.01% lock [kernel.kallsyms] [k] __update_load_avg_cfs_rq 0.01% lock [kernel.kallsyms] [k] fput 0.01% lock [kernel.kallsyms] [k] rmap_walk_file 0.00% lock [kernel.kallsyms] [k] page_mapped 0.00% lock [kernel.kallsyms] [k] native_irq_return_iret 0.00% lock [kernel.kallsyms] [k] _raw_spin_lock_irq 0.00% lock [kernel.kallsyms] [k] perf_iterate_ctx 0.00% lock [kernel.kallsyms] [k] finish_task_switch 0.00% perf [kernel.kallsyms] [k] native_sched_clock 0.00% lock [kernel.kallsyms] [k] native_write_msr 0.00% perf [kernel.kallsyms] [k] native_write_msr m.
Re: Very slow unlockall()
On 1/8/21 3:39 PM, Milan Broz wrote: > On 08/01/2021 14:41, Michal Hocko wrote: >> On Wed 06-01-21 16:20:15, Milan Broz wrote: >>> Hi, >>> >>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code >>> and someone tried to use it with hardened memory allocator library. >>> >>> Execution time was increased to extreme (minutes) and as we found, the >>> problem >>> is in munlockall(). >>> >>> Here is a plain reproducer for the core without any external code - it takes >>> unlocking on Fedora rawhide kernel more than 30 seconds! >>> I can reproduce it on 5.10 kernels and Linus' git. >>> >>> The reproducer below tries to mmap large amount memory with PROT_NONE >>> (later never used). >>> The real code of course does something more useful but the problem is the >>> same. >>> >>> #include >>> #include >>> #include >>> #include >>> >>> int main (int argc, char *argv[]) >>> { >>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | >>> MAP_ANONYMOUS, -1, 0); >>> >>> if (p == MAP_FAILED) return 1; >>> >>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; >>> printf("locked\n"); >>> >>> if (munlockall()) return 1; >>> printf("unlocked\n"); >>> >>> return 0; >>> } >>> >>> In traceback I see that time is spent in munlock_vma_pages_range. >>> >>> [ 2962.006813] Call Trace: >>> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0 >>> [ 2962.006814] ? vma_merge+0xf3/0x3c0 >>> [ 2962.006815] ? mlock_fixup+0x111/0x190 >>> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110 >>> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60 >>> [ 2962.006816] ? do_syscall_64+0x33/0x40 >>> ... >>> >>> Or with perf, I see >>> >>> # Overhead Command Shared Object Symbol >>> >>> # ... . >>> . >>> # >>> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type >>> 11.67% lock [kernel.kallsyms] [k] ___might_sleep >>> 10.65% lock [kernel.kallsyms] [k] follow_page_mask >>> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled >>> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range >>> ... This seems to be from the debug kernel, as there's lockdep enabled. That's expected to be very slow. >>> >>> Could please anyone check what's wrong here with the memory locking code? >>> Running it on my notebook I can effectively DoS the system :) >>> >>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617 >>> but this is apparently a kernel issue, just amplified by usage of >>> munlockall(). >> >> Which kernel version do you see this with? Have older releases worked >> better? > > Hi, > > I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 > was the oldest), > it seems to be very similar run time, so the problem is apparently old...(I > can test some specific kernel version if it make any sense). > > For mainline (reproducer above): > > With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora > rawhide kernel build - many debug options are on) >From that, the amount of debugging seems to be rather excessive in the Fedora rawhide kernel. Is that a special debug flavour? > # time ./lock > locked > unlocked > > real0m32.287s > user0m0.001s > sys 0m32.126s > > > Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel > debug options): > > # time ./lock > locked > unlocked > > real0m4.172s > user0m0.000s > sys 0m4.172s The perf report would be more interesting from this configuration. > m. >
Re: Very slow unlockall()
On 08/01/2021 15:39, Milan Broz wrote: > On 08/01/2021 14:41, Michal Hocko wrote: >> On Wed 06-01-21 16:20:15, Milan Broz wrote: >>> Hi, >>> >>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code >>> and someone tried to use it with hardened memory allocator library. >>> >>> Execution time was increased to extreme (minutes) and as we found, the >>> problem >>> is in munlockall(). >>> >>> Here is a plain reproducer for the core without any external code - it takes >>> unlocking on Fedora rawhide kernel more than 30 seconds! >>> I can reproduce it on 5.10 kernels and Linus' git. >>> >>> The reproducer below tries to mmap large amount memory with PROT_NONE >>> (later never used). >>> The real code of course does something more useful but the problem is the >>> same. >>> >>> #include >>> #include >>> #include >>> #include >>> >>> int main (int argc, char *argv[]) >>> { >>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | >>> MAP_ANONYMOUS, -1, 0); >>> >>> if (p == MAP_FAILED) return 1; >>> >>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; >>> printf("locked\n"); >>> >>> if (munlockall()) return 1; >>> printf("unlocked\n"); >>> >>> return 0; >>> } >>> >>> In traceback I see that time is spent in munlock_vma_pages_range. >>> >>> [ 2962.006813] Call Trace: >>> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0 >>> [ 2962.006814] ? vma_merge+0xf3/0x3c0 >>> [ 2962.006815] ? mlock_fixup+0x111/0x190 >>> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110 >>> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60 >>> [ 2962.006816] ? do_syscall_64+0x33/0x40 >>> ... >>> >>> Or with perf, I see >>> >>> # Overhead Command Shared Object Symbol >>> >>> # ... . >>> . >>> # >>> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type >>> 11.67% lock [kernel.kallsyms] [k] ___might_sleep >>> 10.65% lock [kernel.kallsyms] [k] follow_page_mask >>> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled >>> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range >>> ... >>> >>> >>> Could please anyone check what's wrong here with the memory locking code? >>> Running it on my notebook I can effectively DoS the system :) >>> >>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617 >>> but this is apparently a kernel issue, just amplified by usage of >>> munlockall(). >> >> Which kernel version do you see this with? Have older releases worked >> better? > > Hi, > > I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 > was the oldest), > it seems to be very similar run time, so the problem is apparently old...(I > can test some specific kernel version if it make any sense). > > For mainline (reproducer above): > > With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora > rawhide kernel build - many debug options are on) > > # time ./lock > locked > unlocked > > real0m32.287s > user0m0.001s > sys 0m32.126s > > > Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel > debug options): > > # time ./lock > locked > unlocked > > real0m4.172s > user0m0.000s > sys 0m4.172s > > m. Hi, so because there is no response, is this expected behavior of memory management subsystem then? Thanks, Milan
Re: Very slow unlockall()
On 08/01/2021 14:41, Michal Hocko wrote: > On Wed 06-01-21 16:20:15, Milan Broz wrote: >> Hi, >> >> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code >> and someone tried to use it with hardened memory allocator library. >> >> Execution time was increased to extreme (minutes) and as we found, the >> problem >> is in munlockall(). >> >> Here is a plain reproducer for the core without any external code - it takes >> unlocking on Fedora rawhide kernel more than 30 seconds! >> I can reproduce it on 5.10 kernels and Linus' git. >> >> The reproducer below tries to mmap large amount memory with PROT_NONE (later >> never used). >> The real code of course does something more useful but the problem is the >> same. >> >> #include >> #include >> #include >> #include >> >> int main (int argc, char *argv[]) >> { >> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | >> MAP_ANONYMOUS, -1, 0); >> >> if (p == MAP_FAILED) return 1; >> >> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; >> printf("locked\n"); >> >> if (munlockall()) return 1; >> printf("unlocked\n"); >> >> return 0; >> } >> >> In traceback I see that time is spent in munlock_vma_pages_range. >> >> [ 2962.006813] Call Trace: >> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0 >> [ 2962.006814] ? vma_merge+0xf3/0x3c0 >> [ 2962.006815] ? mlock_fixup+0x111/0x190 >> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110 >> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60 >> [ 2962.006816] ? do_syscall_64+0x33/0x40 >> ... >> >> Or with perf, I see >> >> # Overhead Command Shared Object Symbol >> # ... . . >> # >> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type >> 11.67% lock [kernel.kallsyms] [k] ___might_sleep >> 10.65% lock [kernel.kallsyms] [k] follow_page_mask >> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled >> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range >> ... >> >> >> Could please anyone check what's wrong here with the memory locking code? >> Running it on my notebook I can effectively DoS the system :) >> >> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617 >> but this is apparently a kernel issue, just amplified by usage of >> munlockall(). > > Which kernel version do you see this with? Have older releases worked > better? Hi, I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 was the oldest), it seems to be very similar run time, so the problem is apparently old...(I can test some specific kernel version if it make any sense). For mainline (reproducer above): With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora rawhide kernel build - many debug options are on) # time ./lock locked unlocked real0m32.287s user0m0.001s sys 0m32.126s Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options): # time ./lock locked unlocked real0m4.172s user0m0.000s sys 0m4.172s m.
Re: Very slow unlockall()
On Wed 06-01-21 16:20:15, Milan Broz wrote: > Hi, > > we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code > and someone tried to use it with hardened memory allocator library. > > Execution time was increased to extreme (minutes) and as we found, the problem > is in munlockall(). > > Here is a plain reproducer for the core without any external code - it takes > unlocking on Fedora rawhide kernel more than 30 seconds! > I can reproduce it on 5.10 kernels and Linus' git. > > The reproducer below tries to mmap large amount memory with PROT_NONE (later > never used). > The real code of course does something more useful but the problem is the > same. > > #include > #include > #include > #include > > int main (int argc, char *argv[]) > { > void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | > MAP_ANONYMOUS, -1, 0); > > if (p == MAP_FAILED) return 1; > > if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; > printf("locked\n"); > > if (munlockall()) return 1; > printf("unlocked\n"); > > return 0; > } > > In traceback I see that time is spent in munlock_vma_pages_range. > > [ 2962.006813] Call Trace: > [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0 > [ 2962.006814] ? vma_merge+0xf3/0x3c0 > [ 2962.006815] ? mlock_fixup+0x111/0x190 > [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110 > [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60 > [ 2962.006816] ? do_syscall_64+0x33/0x40 > ... > > Or with perf, I see > > # Overhead Command Shared Object Symbol > # ... . . > # > 48.18% lock [kernel.kallsyms] [k] lock_is_held_type > 11.67% lock [kernel.kallsyms] [k] ___might_sleep > 10.65% lock [kernel.kallsyms] [k] follow_page_mask > 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled > 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range > ... > > > Could please anyone check what's wrong here with the memory locking code? > Running it on my notebook I can effectively DoS the system :) > > Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617 > but this is apparently a kernel issue, just amplified by usage of > munlockall(). Which kernel version do you see this with? Have older releases worked better? -- Michal Hocko SUSE Labs
Very slow unlockall()
Hi, we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code and someone tried to use it with hardened memory allocator library. Execution time was increased to extreme (minutes) and as we found, the problem is in munlockall(). Here is a plain reproducer for the core without any external code - it takes unlocking on Fedora rawhide kernel more than 30 seconds! I can reproduce it on 5.10 kernels and Linus' git. The reproducer below tries to mmap large amount memory with PROT_NONE (later never used). The real code of course does something more useful but the problem is the same. #include #include #include #include int main (int argc, char *argv[]) { void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) return 1; if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1; printf("locked\n"); if (munlockall()) return 1; printf("unlocked\n"); return 0; } In traceback I see that time is spent in munlock_vma_pages_range. [ 2962.006813] Call Trace: [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0 [ 2962.006814] ? vma_merge+0xf3/0x3c0 [ 2962.006815] ? mlock_fixup+0x111/0x190 [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110 [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60 [ 2962.006816] ? do_syscall_64+0x33/0x40 ... Or with perf, I see # Overhead Command Shared Object Symbol # ... . . # 48.18% lock [kernel.kallsyms] [k] lock_is_held_type 11.67% lock [kernel.kallsyms] [k] ___might_sleep 10.65% lock [kernel.kallsyms] [k] follow_page_mask 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range ... Could please anyone check what's wrong here with the memory locking code? Running it on my notebook I can effectively DoS the system :) Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617 but this is apparently a kernel issue, just amplified by usage of munlockall(). Thanks, Milan