Re: Very slow unlockall()

2021-02-10 Thread Hugh Dickins
On Wed, 10 Feb 2021, Michal Hocko wrote:
> On Wed 10-02-21 17:57:29, Michal Hocko wrote:
> > On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> [...]
> > > And the munlock (munlock_vma_pages_range()) is slow, because it uses
> > > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so 
> > > that's
> > > always traversing all levels of page tables from scratch. Funnily enough,
> > > speeding this up was my first linux-mm series years ago. But the speedup 
> > > only
> > > works if pte's are present, which is not the case for unpopulated 
> > > PROT_NONE
> > > areas. That use case was unexpected back then. We should probably convert 
> > > this
> > > code to a proper page table walk. If there are large areas with 
> > > unpopulated pmd
> > > entries (or even higher levels) we would traverse them very quickly.
> > 
> > Yes, this is a good idea. I suspect it will be little bit tricky without
> > duplicating a large part of gup page table walker.
> 
> Thinking about it some more, unmap_page_range would be a better model
> for this operation.

Could do, I suppose; but I thought it was just a matter of going back to
using follow_page_mask() in munlock_vma_pages_range() (whose fear of THP
split looks overwrought, since an extra reference now prevents splitting);
and enhancing follow_page_mask() to let the no_page_table() FOLL_DUMP
case set ctx->page_mask appropriately (or perhaps it can be preset
at a higher level, without having to pass ctx so far down, dunno).

Nice little job, but I couldn't quite spare the time to do it: needs a
bit more care than I could afford (I suspect the page_increm business at
the end of munlock_vma_pages_range() is good enough while THP tails are
skipped one by one, but will need to be fixed to apply page_mask correctly
to the start - __get_user_pages()'s page_increm-entation looks superior).

Hugh


Re: Very slow unlockall()

2021-02-10 Thread Michal Hocko
On Wed 10-02-21 17:57:29, Michal Hocko wrote:
> On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
[...]
> > And the munlock (munlock_vma_pages_range()) is slow, because it uses
> > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> > always traversing all levels of page tables from scratch. Funnily enough,
> > speeding this up was my first linux-mm series years ago. But the speedup 
> > only
> > works if pte's are present, which is not the case for unpopulated PROT_NONE
> > areas. That use case was unexpected back then. We should probably convert 
> > this
> > code to a proper page table walk. If there are large areas with unpopulated 
> > pmd
> > entries (or even higher levels) we would traverse them very quickly.
> 
> Yes, this is a good idea. I suspect it will be little bit tricky without
> duplicating a large part of gup page table walker.

Thinking about it some more, unmap_page_range would be a better model
for this operation.
-- 
Michal Hocko
SUSE Labs


Re: Very slow unlockall()

2021-02-10 Thread Michal Hocko
On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> On 2/1/21 8:19 PM, Milan Broz wrote:
> > On 01/02/2021 19:55, Vlastimil Babka wrote:
> >> On 2/1/21 7:00 PM, Milan Broz wrote:
> >>> On 01/02/2021 14:08, Vlastimil Babka wrote:
>  On 1/8/21 3:39 PM, Milan Broz wrote:
> > On 08/01/2021 14:41, Michal Hocko wrote:
> >> On Wed 06-01-21 16:20:15, Milan Broz wrote:
> >>> Hi,
> >>>
> >>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in 
> >>> cryptsetup code
> >>> and someone tried to use it with hardened memory allocator library.
> >>>
> >>> Execution time was increased to extreme (minutes) and as we found, 
> >>> the problem
> >>> is in munlockall().
> >>>
> >>> Here is a plain reproducer for the core without any external code - 
> >>> it takes
> >>> unlocking on Fedora rawhide kernel more than 30 seconds!
> >>> I can reproduce it on 5.10 kernels and Linus' git.
> >>>
> >>> The reproducer below tries to mmap large amount memory with PROT_NONE 
> >>> (later never used).
> >>> The real code of course does something more useful but the problem is 
> >>> the same.
> >>>
> >>> #include 
> >>> #include 
> >>> #include 
> >>> #include 
> >>>
> >>> int main (int argc, char *argv[])
> >>> {
> >>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
> >>> MAP_ANONYMOUS, -1, 0);
> >> 
> >> So, this is 2TB memory area, but PROT_NONE means it's never actually 
> >> populated,
> >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
> >> PROT_WRITE there, the mlockall() starts taking ages.
> >> 
> >> So does that reflect your use case? munlockall() with large PROT_NONE 
> >> areas? If
> >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
> >> expect such scenario to be uncommon, so better clarify first.
> > 
> > It is just a simple reproducer of the underlying problem, as suggested here 
> > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301
> > 
> > We use mlockall() in cryptsetup and with hardened malloc it slows down 
> > unlock significantly.
> > (For the real case problem please read the whole issue report above.)
> 
> OK, finally read through the bug report, and learned two things:
> 
> 1) the PROT_NONE is indeed intentional part of the reproducer
> 2) Linux mailing lists still have a bad reputation and people avoid them. 
> That's
> sad :( Well, thanks for overcoming that :)
> 
> Daniel there says "I think the Linux kernel implementation of mlockall is 
> quite
> broken and tries to lock all the reserved PROT_NONE regions in advance which
> doesn't make any sense."
> 
> >From my testing this doesn't seem to be the case, as the mlockall() part is 
> >very
> fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts
> to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is
> slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We 
> probably
> can't just skip them, as they might actually contain mlocked pages if those 
> were
> faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE.

Mlock code is quite easy to misunderstand but IIRC the mlock part
should be rather straightforward. It will mark VMAs as locked, do some
merging/splitting where appropriate and finally populate the range by
gup. This should fail because VMA doesn't allow neither read nor write,
right? And mlock should report that. mlockall will not bother because it
will ignore errors on population. So there is no page table walk
happening.

> And the munlock (munlock_vma_pages_range()) is slow, because it uses
> follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> always traversing all levels of page tables from scratch. Funnily enough,
> speeding this up was my first linux-mm series years ago. But the speedup only
> works if pte's are present, which is not the case for unpopulated PROT_NONE
> areas. That use case was unexpected back then. We should probably convert this
> code to a proper page table walk. If there are large areas with unpopulated 
> pmd
> entries (or even higher levels) we would traverse them very quickly.

Yes, this is a good idea. I suspect it will be little bit tricky without
duplicating a large part of gup page table walker.
-- 
Michal Hocko
SUSE Labs


Re: Very slow unlockall()

2021-02-10 Thread Vlastimil Babka
On 2/1/21 8:19 PM, Milan Broz wrote:
> On 01/02/2021 19:55, Vlastimil Babka wrote:
>> On 2/1/21 7:00 PM, Milan Broz wrote:
>>> On 01/02/2021 14:08, Vlastimil Babka wrote:
 On 1/8/21 3:39 PM, Milan Broz wrote:
> On 08/01/2021 14:41, Michal Hocko wrote:
>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>> Hi,
>>>
>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup 
>>> code
>>> and someone tried to use it with hardened memory allocator library.
>>>
>>> Execution time was increased to extreme (minutes) and as we found, the 
>>> problem
>>> is in munlockall().
>>>
>>> Here is a plain reproducer for the core without any external code - it 
>>> takes
>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>
>>> The reproducer below tries to mmap large amount memory with PROT_NONE 
>>> (later never used).
>>> The real code of course does something more useful but the problem is 
>>> the same.
>>>
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
>>> MAP_ANONYMOUS, -1, 0);
>> 
>> So, this is 2TB memory area, but PROT_NONE means it's never actually 
>> populated,
>> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
>> PROT_WRITE there, the mlockall() starts taking ages.
>> 
>> So does that reflect your use case? munlockall() with large PROT_NONE areas? 
>> If
>> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
>> expect such scenario to be uncommon, so better clarify first.
> 
> It is just a simple reproducer of the underlying problem, as suggested here 
> https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301
> 
> We use mlockall() in cryptsetup and with hardened malloc it slows down unlock 
> significantly.
> (For the real case problem please read the whole issue report above.)

OK, finally read through the bug report, and learned two things:

1) the PROT_NONE is indeed intentional part of the reproducer
2) Linux mailing lists still have a bad reputation and people avoid them. That's
sad :( Well, thanks for overcoming that :)

Daniel there says "I think the Linux kernel implementation of mlockall is quite
broken and tries to lock all the reserved PROT_NONE regions in advance which
doesn't make any sense."

>From my testing this doesn't seem to be the case, as the mlockall() part is 
>very
fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts
to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is
slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably
can't just skip them, as they might actually contain mlocked pages if those were
faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE.

And the munlock (munlock_vma_pages_range()) is slow, because it uses
follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
always traversing all levels of page tables from scratch. Funnily enough,
speeding this up was my first linux-mm series years ago. But the speedup only
works if pte's are present, which is not the case for unpopulated PROT_NONE
areas. That use case was unexpected back then. We should probably convert this
code to a proper page table walk. If there are large areas with unpopulated pmd
entries (or even higher levels) we would traverse them very quickly.



Re: Very slow unlockall()

2021-02-01 Thread Milan Broz
On 01/02/2021 19:55, Vlastimil Babka wrote:
> On 2/1/21 7:00 PM, Milan Broz wrote:
>> On 01/02/2021 14:08, Vlastimil Babka wrote:
>>> On 1/8/21 3:39 PM, Milan Broz wrote:
 On 08/01/2021 14:41, Michal Hocko wrote:
> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>> Hi,
>>
>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup 
>> code
>> and someone tried to use it with hardened memory allocator library.
>>
>> Execution time was increased to extreme (minutes) and as we found, the 
>> problem
>> is in munlockall().
>>
>> Here is a plain reproducer for the core without any external code - it 
>> takes
>> unlocking on Fedora rawhide kernel more than 30 seconds!
>> I can reproduce it on 5.10 kernels and Linus' git.
>>
>> The reproducer below tries to mmap large amount memory with PROT_NONE 
>> (later never used).
>> The real code of course does something more useful but the problem is 
>> the same.
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main (int argc, char *argv[])
>> {
>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
>> MAP_ANONYMOUS, -1, 0);
> 
> So, this is 2TB memory area, but PROT_NONE means it's never actually 
> populated,
> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
> PROT_WRITE there, the mlockall() starts taking ages.
> 
> So does that reflect your use case? munlockall() with large PROT_NONE areas? 
> If
> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
> expect such scenario to be uncommon, so better clarify first.

It is just a simple reproducer of the underlying problem, as suggested here 
https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301

We use mlockall() in cryptsetup and with hardened malloc it slows down unlock 
significantly.
(For the real case problem please read the whole issue report above.)

m.


Re: Very slow unlockall()

2021-02-01 Thread Vlastimil Babka
On 2/1/21 7:00 PM, Milan Broz wrote:
> On 01/02/2021 14:08, Vlastimil Babka wrote:
>> On 1/8/21 3:39 PM, Milan Broz wrote:
>>> On 08/01/2021 14:41, Michal Hocko wrote:
 On Wed 06-01-21 16:20:15, Milan Broz wrote:
> Hi,
>
> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup 
> code
> and someone tried to use it with hardened memory allocator library.
>
> Execution time was increased to extreme (minutes) and as we found, the 
> problem
> is in munlockall().
>
> Here is a plain reproducer for the core without any external code - it 
> takes
> unlocking on Fedora rawhide kernel more than 30 seconds!
> I can reproduce it on 5.10 kernels and Linus' git.
>
> The reproducer below tries to mmap large amount memory with PROT_NONE 
> (later never used).
> The real code of course does something more useful but the problem is the 
> same.
>
> #include 
> #include 
> #include 
> #include 
>
> int main (int argc, char *argv[])
> {
> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
> MAP_ANONYMOUS, -1, 0);

So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
PROT_WRITE there, the mlockall() starts taking ages.

So does that reflect your use case? munlockall() with large PROT_NONE areas? If
so, munlock_vma_pages_range() is indeed not optimized for that, but I would
expect such scenario to be uncommon, so better clarify first.

>
> if (p == MAP_FAILED) return 1;
>
> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
> printf("locked\n");
>
> if (munlockall()) return 1;
> printf("unlocked\n");
>
> return 0;
> }
> 
> ...
> 
>>> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive 
>>> kernel debug options):
>>>
>>> # time ./lock
>>> locked
>>> unlocked
>>>
>>> real0m4.172s
>>> user0m0.000s
>>> sys 0m4.172s
>> 
>> The perf report would be more interesting from this configuration.
> 
> ok, I cannot run perf on that particular VM but tried the latest Fedora stable
> kernel without debug options  - 5.10.12-200.fc33.x86_64
> 
> This is the report running reproducer above:
> 
> time:
> real0m6.123s
> user0m0.099s
> sys 0m5.310s
> 
> perf:
> 
> # Total Lost Samples: 0
> #
> # Samples: 20K of event 'cycles'
> # Event count (approx.): 20397603279
> #
> # Overhead  Command  Shared Object  Symbol  
> #   ...  .  
> #
> 47.26%  lock [kernel.kallsyms]  [k] follow_page_mask
> 20.43%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
> 15.92%  lock [kernel.kallsyms]  [k] follow_page
>  7.40%  lock [kernel.kallsyms]  [k] rcu_all_qs
>  5.87%  lock [kernel.kallsyms]  [k] _cond_resched
>  3.08%  lock [kernel.kallsyms]  [k] follow_huge_addr
>  0.01%  lock [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
>  0.01%  lock [kernel.kallsyms]  [k] fput
>  0.01%  lock [kernel.kallsyms]  [k] rmap_walk_file
>  0.00%  lock [kernel.kallsyms]  [k] page_mapped
>  0.00%  lock [kernel.kallsyms]  [k] native_irq_return_iret
>  0.00%  lock [kernel.kallsyms]  [k] _raw_spin_lock_irq
>  0.00%  lock [kernel.kallsyms]  [k] perf_iterate_ctx
>  0.00%  lock [kernel.kallsyms]  [k] finish_task_switch
>  0.00%  perf [kernel.kallsyms]  [k] native_sched_clock
>  0.00%  lock [kernel.kallsyms]  [k] native_write_msr
>  0.00%  perf [kernel.kallsyms]  [k] native_write_msr
> 
> 
> m.
> 



Re: Very slow unlockall()

2021-02-01 Thread Milan Broz
On 01/02/2021 14:08, Vlastimil Babka wrote:
> On 1/8/21 3:39 PM, Milan Broz wrote:
>> On 08/01/2021 14:41, Michal Hocko wrote:
>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
 Hi,

 we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
 and someone tried to use it with hardened memory allocator library.

 Execution time was increased to extreme (minutes) and as we found, the 
 problem
 is in munlockall().

 Here is a plain reproducer for the core without any external code - it 
 takes
 unlocking on Fedora rawhide kernel more than 30 seconds!
 I can reproduce it on 5.10 kernels and Linus' git.

 The reproducer below tries to mmap large amount memory with PROT_NONE 
 (later never used).
 The real code of course does something more useful but the problem is the 
 same.

 #include 
 #include 
 #include 
 #include 

 int main (int argc, char *argv[])
 {
 void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
 MAP_ANONYMOUS, -1, 0);

 if (p == MAP_FAILED) return 1;

 if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
 printf("locked\n");

 if (munlockall()) return 1;
 printf("unlocked\n");

 return 0;
 }

...

>> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel 
>> debug options):
>>
>> # time ./lock
>> locked
>> unlocked
>>
>> real0m4.172s
>> user0m0.000s
>> sys 0m4.172s
> 
> The perf report would be more interesting from this configuration.

ok, I cannot run perf on that particular VM but tried the latest Fedora stable
kernel without debug options  - 5.10.12-200.fc33.x86_64

This is the report running reproducer above:

time:
real0m6.123s
user0m0.099s
sys 0m5.310s

perf:

# Total Lost Samples: 0
#
# Samples: 20K of event 'cycles'
# Event count (approx.): 20397603279
#
# Overhead  Command  Shared Object  Symbol  
#   ...  .  
#
47.26%  lock [kernel.kallsyms]  [k] follow_page_mask
20.43%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
15.92%  lock [kernel.kallsyms]  [k] follow_page
 7.40%  lock [kernel.kallsyms]  [k] rcu_all_qs
 5.87%  lock [kernel.kallsyms]  [k] _cond_resched
 3.08%  lock [kernel.kallsyms]  [k] follow_huge_addr
 0.01%  lock [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
 0.01%  lock [kernel.kallsyms]  [k] fput
 0.01%  lock [kernel.kallsyms]  [k] rmap_walk_file
 0.00%  lock [kernel.kallsyms]  [k] page_mapped
 0.00%  lock [kernel.kallsyms]  [k] native_irq_return_iret
 0.00%  lock [kernel.kallsyms]  [k] _raw_spin_lock_irq
 0.00%  lock [kernel.kallsyms]  [k] perf_iterate_ctx
 0.00%  lock [kernel.kallsyms]  [k] finish_task_switch
 0.00%  perf [kernel.kallsyms]  [k] native_sched_clock
 0.00%  lock [kernel.kallsyms]  [k] native_write_msr
 0.00%  perf [kernel.kallsyms]  [k] native_write_msr


m.


Re: Very slow unlockall()

2021-02-01 Thread Vlastimil Babka
On 1/8/21 3:39 PM, Milan Broz wrote:
> On 08/01/2021 14:41, Michal Hocko wrote:
>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>> Hi,
>>>
>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>> and someone tried to use it with hardened memory allocator library.
>>>
>>> Execution time was increased to extreme (minutes) and as we found, the 
>>> problem
>>> is in munlockall().
>>>
>>> Here is a plain reproducer for the core without any external code - it takes
>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>
>>> The reproducer below tries to mmap large amount memory with PROT_NONE 
>>> (later never used).
>>> The real code of course does something more useful but the problem is the 
>>> same.
>>>
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
>>> MAP_ANONYMOUS, -1, 0);
>>>
>>> if (p == MAP_FAILED) return 1;
>>>
>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>> printf("locked\n");
>>>
>>> if (munlockall()) return 1;
>>> printf("unlocked\n");
>>>
>>> return 0;
>>> }
>>>
>>> In traceback I see that time is spent in munlock_vma_pages_range.
>>>
>>> [ 2962.006813] Call Trace:
>>> [ 2962.006814]  ? munlock_vma_pages_range+0xe7/0x4b0
>>> [ 2962.006814]  ? vma_merge+0xf3/0x3c0
>>> [ 2962.006815]  ? mlock_fixup+0x111/0x190
>>> [ 2962.006815]  ? apply_mlockall_flags+0xa7/0x110
>>> [ 2962.006816]  ? __do_sys_munlockall+0x2e/0x60
>>> [ 2962.006816]  ? do_syscall_64+0x33/0x40
>>> ...
>>>
>>> Or with perf, I see
>>>
>>> # Overhead  Command  Shared Object  Symbol  
>>>  
>>> #   ...  .  
>>> .
>>> #
>>> 48.18%  lock [kernel.kallsyms]  [k] lock_is_held_type
>>> 11.67%  lock [kernel.kallsyms]  [k] ___might_sleep
>>> 10.65%  lock [kernel.kallsyms]  [k] follow_page_mask
>>>  9.17%  lock [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
>>>  6.73%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
>>> ...

This seems to be from the debug kernel, as there's lockdep enabled. That's
expected to be very slow.

>>>
>>> Could please anyone check what's wrong here with the memory locking code?
>>> Running it on my notebook I can effectively DoS the system :)
>>>
>>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>>> but this is apparently a kernel issue, just amplified by usage of 
>>> munlockall().
>> 
>> Which kernel version do you see this with? Have older releases worked
>> better?
> 
> Hi,
> 
> I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 
> was the oldest),
> it seems to be very similar run time, so the problem is apparently old...(I 
> can test some specific kernel version if it make any sense).
> 
> For mainline (reproducer above):
> 
> With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora 
> rawhide kernel build - many debug options are on)

>From that, the amount of debugging seems to be rather excessive in the Fedora
rawhide kernel. Is that a special debug flavour?

> # time ./lock 
> locked
> unlocked
> 
> real0m32.287s
> user0m0.001s
> sys 0m32.126s
> 
> 
> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel 
> debug options):
> 
> # time ./lock
> locked
> unlocked
> 
> real0m4.172s
> user0m0.000s
> sys 0m4.172s

The perf report would be more interesting from this configuration.

> m.
> 



Re: Very slow unlockall()

2021-01-31 Thread Milan Broz
On 08/01/2021 15:39, Milan Broz wrote:
> On 08/01/2021 14:41, Michal Hocko wrote:
>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>> Hi,
>>>
>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>> and someone tried to use it with hardened memory allocator library.
>>>
>>> Execution time was increased to extreme (minutes) and as we found, the 
>>> problem
>>> is in munlockall().
>>>
>>> Here is a plain reproducer for the core without any external code - it takes
>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>
>>> The reproducer below tries to mmap large amount memory with PROT_NONE 
>>> (later never used).
>>> The real code of course does something more useful but the problem is the 
>>> same.
>>>
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
>>> MAP_ANONYMOUS, -1, 0);
>>>
>>> if (p == MAP_FAILED) return 1;
>>>
>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>> printf("locked\n");
>>>
>>> if (munlockall()) return 1;
>>> printf("unlocked\n");
>>>
>>> return 0;
>>> }
>>>
>>> In traceback I see that time is spent in munlock_vma_pages_range.
>>>
>>> [ 2962.006813] Call Trace:
>>> [ 2962.006814]  ? munlock_vma_pages_range+0xe7/0x4b0
>>> [ 2962.006814]  ? vma_merge+0xf3/0x3c0
>>> [ 2962.006815]  ? mlock_fixup+0x111/0x190
>>> [ 2962.006815]  ? apply_mlockall_flags+0xa7/0x110
>>> [ 2962.006816]  ? __do_sys_munlockall+0x2e/0x60
>>> [ 2962.006816]  ? do_syscall_64+0x33/0x40
>>> ...
>>>
>>> Or with perf, I see
>>>
>>> # Overhead  Command  Shared Object  Symbol  
>>>  
>>> #   ...  .  
>>> .
>>> #
>>> 48.18%  lock [kernel.kallsyms]  [k] lock_is_held_type
>>> 11.67%  lock [kernel.kallsyms]  [k] ___might_sleep
>>> 10.65%  lock [kernel.kallsyms]  [k] follow_page_mask
>>>  9.17%  lock [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
>>>  6.73%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
>>> ...
>>>
>>>
>>> Could please anyone check what's wrong here with the memory locking code?
>>> Running it on my notebook I can effectively DoS the system :)
>>>
>>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>>> but this is apparently a kernel issue, just amplified by usage of 
>>> munlockall().
>>
>> Which kernel version do you see this with? Have older releases worked
>> better?
> 
> Hi,
> 
> I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 
> was the oldest),
> it seems to be very similar run time, so the problem is apparently old...(I 
> can test some specific kernel version if it make any sense).
> 
> For mainline (reproducer above):
> 
> With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora 
> rawhide kernel build - many debug options are on)
> 
> # time ./lock 
> locked
> unlocked
> 
> real0m32.287s
> user0m0.001s
> sys 0m32.126s
> 
> 
> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel 
> debug options):
> 
> # time ./lock
> locked
> unlocked
> 
> real0m4.172s
> user0m0.000s
> sys 0m4.172s
> 
> m.

Hi,

so because there is no response, is this expected behavior of memory management 
subsystem then?

Thanks,
Milan



 



Re: Very slow unlockall()

2021-01-08 Thread Milan Broz
On 08/01/2021 14:41, Michal Hocko wrote:
> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>> Hi,
>>
>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>> and someone tried to use it with hardened memory allocator library.
>>
>> Execution time was increased to extreme (minutes) and as we found, the 
>> problem
>> is in munlockall().
>>
>> Here is a plain reproducer for the core without any external code - it takes
>> unlocking on Fedora rawhide kernel more than 30 seconds!
>> I can reproduce it on 5.10 kernels and Linus' git.
>>
>> The reproducer below tries to mmap large amount memory with PROT_NONE (later 
>> never used).
>> The real code of course does something more useful but the problem is the 
>> same.
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main (int argc, char *argv[])
>> {
>> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
>> MAP_ANONYMOUS, -1, 0);
>>
>> if (p == MAP_FAILED) return 1;
>>
>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>> printf("locked\n");
>>
>> if (munlockall()) return 1;
>> printf("unlocked\n");
>>
>> return 0;
>> }
>>
>> In traceback I see that time is spent in munlock_vma_pages_range.
>>
>> [ 2962.006813] Call Trace:
>> [ 2962.006814]  ? munlock_vma_pages_range+0xe7/0x4b0
>> [ 2962.006814]  ? vma_merge+0xf3/0x3c0
>> [ 2962.006815]  ? mlock_fixup+0x111/0x190
>> [ 2962.006815]  ? apply_mlockall_flags+0xa7/0x110
>> [ 2962.006816]  ? __do_sys_munlockall+0x2e/0x60
>> [ 2962.006816]  ? do_syscall_64+0x33/0x40
>> ...
>>
>> Or with perf, I see
>>
>> # Overhead  Command  Shared Object  Symbol   
>> #   ...  .  .
>> #
>> 48.18%  lock [kernel.kallsyms]  [k] lock_is_held_type
>> 11.67%  lock [kernel.kallsyms]  [k] ___might_sleep
>> 10.65%  lock [kernel.kallsyms]  [k] follow_page_mask
>>  9.17%  lock [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
>>  6.73%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
>> ...
>>
>>
>> Could please anyone check what's wrong here with the memory locking code?
>> Running it on my notebook I can effectively DoS the system :)
>>
>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>> but this is apparently a kernel issue, just amplified by usage of 
>> munlockall().
> 
> Which kernel version do you see this with? Have older releases worked
> better?

Hi,

I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 
was the oldest),
it seems to be very similar run time, so the problem is apparently old...(I can 
test some specific kernel version if it make any sense).

For mainline (reproducer above):

With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora 
rawhide kernel build - many debug options are on)

# time ./lock 
locked
unlocked

real0m32.287s
user0m0.001s
sys 0m32.126s


Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel 
debug options):

# time ./lock
locked
unlocked

real0m4.172s
user0m0.000s
sys 0m4.172s

m.


Re: Very slow unlockall()

2021-01-08 Thread Michal Hocko
On Wed 06-01-21 16:20:15, Milan Broz wrote:
> Hi,
> 
> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
> and someone tried to use it with hardened memory allocator library.
> 
> Execution time was increased to extreme (minutes) and as we found, the problem
> is in munlockall().
> 
> Here is a plain reproducer for the core without any external code - it takes
> unlocking on Fedora rawhide kernel more than 30 seconds!
> I can reproduce it on 5.10 kernels and Linus' git.
> 
> The reproducer below tries to mmap large amount memory with PROT_NONE (later 
> never used).
> The real code of course does something more useful but the problem is the 
> same.
> 
> #include 
> #include 
> #include 
> #include 
> 
> int main (int argc, char *argv[])
> {
> void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
> MAP_ANONYMOUS, -1, 0);
> 
> if (p == MAP_FAILED) return 1;
> 
> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
> printf("locked\n");
> 
> if (munlockall()) return 1;
> printf("unlocked\n");
> 
> return 0;
> }
> 
> In traceback I see that time is spent in munlock_vma_pages_range.
> 
> [ 2962.006813] Call Trace:
> [ 2962.006814]  ? munlock_vma_pages_range+0xe7/0x4b0
> [ 2962.006814]  ? vma_merge+0xf3/0x3c0
> [ 2962.006815]  ? mlock_fixup+0x111/0x190
> [ 2962.006815]  ? apply_mlockall_flags+0xa7/0x110
> [ 2962.006816]  ? __do_sys_munlockall+0x2e/0x60
> [ 2962.006816]  ? do_syscall_64+0x33/0x40
> ...
> 
> Or with perf, I see
> 
> # Overhead  Command  Shared Object  Symbol   
> #   ...  .  .
> #
> 48.18%  lock [kernel.kallsyms]  [k] lock_is_held_type
> 11.67%  lock [kernel.kallsyms]  [k] ___might_sleep
> 10.65%  lock [kernel.kallsyms]  [k] follow_page_mask
>  9.17%  lock [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
>  6.73%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
> ...
> 
> 
> Could please anyone check what's wrong here with the memory locking code?
> Running it on my notebook I can effectively DoS the system :)
> 
> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
> but this is apparently a kernel issue, just amplified by usage of 
> munlockall().

Which kernel version do you see this with? Have older releases worked
better?
-- 
Michal Hocko
SUSE Labs


Very slow unlockall()

2021-01-06 Thread Milan Broz
Hi,

we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
and someone tried to use it with hardened memory allocator library.

Execution time was increased to extreme (minutes) and as we found, the problem
is in munlockall().

Here is a plain reproducer for the core without any external code - it takes
unlocking on Fedora rawhide kernel more than 30 seconds!
I can reproduce it on 5.10 kernels and Linus' git.

The reproducer below tries to mmap large amount memory with PROT_NONE (later 
never used).
The real code of course does something more useful but the problem is the same.

#include 
#include 
#include 
#include 

int main (int argc, char *argv[])
{
void *p  = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | 
MAP_ANONYMOUS, -1, 0);

if (p == MAP_FAILED) return 1;

if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
printf("locked\n");

if (munlockall()) return 1;
printf("unlocked\n");

return 0;
}

In traceback I see that time is spent in munlock_vma_pages_range.

[ 2962.006813] Call Trace:
[ 2962.006814]  ? munlock_vma_pages_range+0xe7/0x4b0
[ 2962.006814]  ? vma_merge+0xf3/0x3c0
[ 2962.006815]  ? mlock_fixup+0x111/0x190
[ 2962.006815]  ? apply_mlockall_flags+0xa7/0x110
[ 2962.006816]  ? __do_sys_munlockall+0x2e/0x60
[ 2962.006816]  ? do_syscall_64+0x33/0x40
...

Or with perf, I see

# Overhead  Command  Shared Object  Symbol   
#   ...  .  .
#
48.18%  lock [kernel.kallsyms]  [k] lock_is_held_type
11.67%  lock [kernel.kallsyms]  [k] ___might_sleep
10.65%  lock [kernel.kallsyms]  [k] follow_page_mask
 9.17%  lock [kernel.kallsyms]  [k] debug_lockdep_rcu_enabled
 6.73%  lock [kernel.kallsyms]  [k] munlock_vma_pages_range
...


Could please anyone check what's wrong here with the memory locking code?
Running it on my notebook I can effectively DoS the system :)

Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
but this is apparently a kernel issue, just amplified by usage of munlockall().

Thanks,
Milan