Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-29 Thread Michal Hocko
On Tue 28-11-17 21:14:23, John Hubbard wrote:
> On 11/28/2017 12:12 AM, Michal Hocko wrote:
> > On Mon 27-11-17 15:26:27, John Hubbard wrote:
> > [...]
> >> Let me add a belated report, then: we ran into this limit while 
> >> implementing 
> >> an early version of Unified Memory[1], back in 2013. The implementation
> >> at the time depended on tracking that assumed "one allocation == one vma".
> > 
> > And you tried hard to make those VMAs really separate? E.g. with
> > prot_none gaps?
> 
> We didn't do that, and in fact I'm probably failing to grasp the underlying 
> design idea that you have in mind there...hints welcome...

mmap code tries to merge vmas very aggressively so you have to try to
make too many vmas. One way to separate different vmas is to mprotect
holes to trap potential {over,under}flows.

> What we did was to hook into the mmap callbacks in the kernel driver, after 
> userspace mmap'd a region (via a custom allocator API). And we had an ioctl
> in there, to connect up other allocation attributes that couldn't be passed
> through via mmap. Again, this was for regions of memory that were to be
> migrated between CPU and device (GPU).

Or maybe your driver made the vma merging impossible by requesting
explicit ranges which are not adjacent.

> >> So, with only 64K vmas, we quickly ran out, and changed the design to work
> >> around that. (And later, the design was *completely* changed to use a 
> >> separate
> >> tracking system altogether). exag
> >>
> >> The existing limit seems rather too low, at least from my perspective. 
> >> Maybe
> >> it would be better, if expressed as a function of RAM size?
> > 
> > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> > after years when memory grown much more than the code author expected.
> > Just look how we scaled hash table sizes... But maybe you can come up
> > with something clever. In any case tuning this from the userspace is a
> > trivial thing to do and I am somehow skeptical that any early boot code
> > would trip over the limit.
> > 
> 
> I agree that this is not a limit that boot code is likely to hit. And maybe 
> tuning from userspace really is the right approach here, considering that
> there is a real cost to going too large. 
> 
> Just philosophically here, hard limits like this seem a little awkward if 
> they 
> are set once in, say, 1999 (gross exaggeration here, for effect) and then not
> updated to stay with the times, right? In other words, one should not 
> routinely 
> need to tune most things. That's why I was wondering if something crude and 
> silly
> would work, such as just a ratio of RAM to vma count. (I'm more just trying to
> understand the "rules" here, than to debate--I don't have a strong opinion 
> on this.)

Well, rlimits are in general not very usable. Yet I do not think we
should simply wipe them out.

> The fact that this apparently failed with hash tables is interesting, I'd
> love to read more if you have any notes or links. I spotted a 2014 LWN article
> ( https://lwn.net/Articles/612100 ) about hash table resizing, and some 
> commits
> that fixed resizing bugs, such as
> 
> 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")
> 
> ...was it just a storm of bugs that showed up?

No, it was just that large (TB) machines allocated insanely large hash
tables for things which will never have any way to fill them up. See
9017217b6f45 ("mm: adaptive hash table scaling").

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-29 Thread Michal Hocko
On Tue 28-11-17 21:14:23, John Hubbard wrote:
> On 11/28/2017 12:12 AM, Michal Hocko wrote:
> > On Mon 27-11-17 15:26:27, John Hubbard wrote:
> > [...]
> >> Let me add a belated report, then: we ran into this limit while 
> >> implementing 
> >> an early version of Unified Memory[1], back in 2013. The implementation
> >> at the time depended on tracking that assumed "one allocation == one vma".
> > 
> > And you tried hard to make those VMAs really separate? E.g. with
> > prot_none gaps?
> 
> We didn't do that, and in fact I'm probably failing to grasp the underlying 
> design idea that you have in mind there...hints welcome...

mmap code tries to merge vmas very aggressively so you have to try to
make too many vmas. One way to separate different vmas is to mprotect
holes to trap potential {over,under}flows.

> What we did was to hook into the mmap callbacks in the kernel driver, after 
> userspace mmap'd a region (via a custom allocator API). And we had an ioctl
> in there, to connect up other allocation attributes that couldn't be passed
> through via mmap. Again, this was for regions of memory that were to be
> migrated between CPU and device (GPU).

Or maybe your driver made the vma merging impossible by requesting
explicit ranges which are not adjacent.

> >> So, with only 64K vmas, we quickly ran out, and changed the design to work
> >> around that. (And later, the design was *completely* changed to use a 
> >> separate
> >> tracking system altogether). exag
> >>
> >> The existing limit seems rather too low, at least from my perspective. 
> >> Maybe
> >> it would be better, if expressed as a function of RAM size?
> > 
> > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> > after years when memory grown much more than the code author expected.
> > Just look how we scaled hash table sizes... But maybe you can come up
> > with something clever. In any case tuning this from the userspace is a
> > trivial thing to do and I am somehow skeptical that any early boot code
> > would trip over the limit.
> > 
> 
> I agree that this is not a limit that boot code is likely to hit. And maybe 
> tuning from userspace really is the right approach here, considering that
> there is a real cost to going too large. 
> 
> Just philosophically here, hard limits like this seem a little awkward if 
> they 
> are set once in, say, 1999 (gross exaggeration here, for effect) and then not
> updated to stay with the times, right? In other words, one should not 
> routinely 
> need to tune most things. That's why I was wondering if something crude and 
> silly
> would work, such as just a ratio of RAM to vma count. (I'm more just trying to
> understand the "rules" here, than to debate--I don't have a strong opinion 
> on this.)

Well, rlimits are in general not very usable. Yet I do not think we
should simply wipe them out.

> The fact that this apparently failed with hash tables is interesting, I'd
> love to read more if you have any notes or links. I spotted a 2014 LWN article
> ( https://lwn.net/Articles/612100 ) about hash table resizing, and some 
> commits
> that fixed resizing bugs, such as
> 
> 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")
> 
> ...was it just a storm of bugs that showed up?

No, it was just that large (TB) machines allocated insanely large hash
tables for things which will never have any way to fill them up. See
9017217b6f45 ("mm: adaptive hash table scaling").

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-28 Thread John Hubbard
On 11/28/2017 12:12 AM, Michal Hocko wrote:
> On Mon 27-11-17 15:26:27, John Hubbard wrote:
> [...]
>> Let me add a belated report, then: we ran into this limit while implementing 
>> an early version of Unified Memory[1], back in 2013. The implementation
>> at the time depended on tracking that assumed "one allocation == one vma".
> 
> And you tried hard to make those VMAs really separate? E.g. with
> prot_none gaps?

We didn't do that, and in fact I'm probably failing to grasp the underlying 
design idea that you have in mind there...hints welcome...

What we did was to hook into the mmap callbacks in the kernel driver, after 
userspace mmap'd a region (via a custom allocator API). And we had an ioctl
in there, to connect up other allocation attributes that couldn't be passed
through via mmap. Again, this was for regions of memory that were to be
migrated between CPU and device (GPU).

> 
>> So, with only 64K vmas, we quickly ran out, and changed the design to work
>> around that. (And later, the design was *completely* changed to use a 
>> separate
>> tracking system altogether). exag
>>
>> The existing limit seems rather too low, at least from my perspective. Maybe
>> it would be better, if expressed as a function of RAM size?
> 
> Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> after years when memory grown much more than the code author expected.
> Just look how we scaled hash table sizes... But maybe you can come up
> with something clever. In any case tuning this from the userspace is a
> trivial thing to do and I am somehow skeptical that any early boot code
> would trip over the limit.
> 

I agree that this is not a limit that boot code is likely to hit. And maybe 
tuning from userspace really is the right approach here, considering that
there is a real cost to going too large. 

Just philosophically here, hard limits like this seem a little awkward if they 
are set once in, say, 1999 (gross exaggeration here, for effect) and then not
updated to stay with the times, right? In other words, one should not routinely 
need to tune most things. That's why I was wondering if something crude and 
silly
would work, such as just a ratio of RAM to vma count. (I'm more just trying to
understand the "rules" here, than to debate--I don't have a strong opinion 
on this.)

The fact that this apparently failed with hash tables is interesting, I'd
love to read more if you have any notes or links. I spotted a 2014 LWN article
( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
that fixed resizing bugs, such as

12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")

...was it just a storm of bugs that showed up?


thanks,
John Hubbard


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-28 Thread John Hubbard
On 11/28/2017 12:12 AM, Michal Hocko wrote:
> On Mon 27-11-17 15:26:27, John Hubbard wrote:
> [...]
>> Let me add a belated report, then: we ran into this limit while implementing 
>> an early version of Unified Memory[1], back in 2013. The implementation
>> at the time depended on tracking that assumed "one allocation == one vma".
> 
> And you tried hard to make those VMAs really separate? E.g. with
> prot_none gaps?

We didn't do that, and in fact I'm probably failing to grasp the underlying 
design idea that you have in mind there...hints welcome...

What we did was to hook into the mmap callbacks in the kernel driver, after 
userspace mmap'd a region (via a custom allocator API). And we had an ioctl
in there, to connect up other allocation attributes that couldn't be passed
through via mmap. Again, this was for regions of memory that were to be
migrated between CPU and device (GPU).

> 
>> So, with only 64K vmas, we quickly ran out, and changed the design to work
>> around that. (And later, the design was *completely* changed to use a 
>> separate
>> tracking system altogether). exag
>>
>> The existing limit seems rather too low, at least from my perspective. Maybe
>> it would be better, if expressed as a function of RAM size?
> 
> Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> after years when memory grown much more than the code author expected.
> Just look how we scaled hash table sizes... But maybe you can come up
> with something clever. In any case tuning this from the userspace is a
> trivial thing to do and I am somehow skeptical that any early boot code
> would trip over the limit.
> 

I agree that this is not a limit that boot code is likely to hit. And maybe 
tuning from userspace really is the right approach here, considering that
there is a real cost to going too large. 

Just philosophically here, hard limits like this seem a little awkward if they 
are set once in, say, 1999 (gross exaggeration here, for effect) and then not
updated to stay with the times, right? In other words, one should not routinely 
need to tune most things. That's why I was wondering if something crude and 
silly
would work, such as just a ratio of RAM to vma count. (I'm more just trying to
understand the "rules" here, than to debate--I don't have a strong opinion 
on this.)

The fact that this apparently failed with hash tables is interesting, I'd
love to read more if you have any notes or links. I spotted a 2014 LWN article
( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
that fixed resizing bugs, such as

12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")

...was it just a storm of bugs that showed up?


thanks,
John Hubbard


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-28 Thread Michal Hocko
On Mon 27-11-17 15:26:27, John Hubbard wrote:
[...]
> Let me add a belated report, then: we ran into this limit while implementing 
> an early version of Unified Memory[1], back in 2013. The implementation
> at the time depended on tracking that assumed "one allocation == one vma".

And you tried hard to make those VMAs really separate? E.g. with
prot_none gaps?

> So, with only 64K vmas, we quickly ran out, and changed the design to work
> around that. (And later, the design was *completely* changed to use a separate
> tracking system altogether). 
> 
> The existing limit seems rather too low, at least from my perspective. Maybe
> it would be better, if expressed as a function of RAM size?

Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
after years when memory grown much more than the code author expected.
Just look how we scaled hash table sizes... But maybe you can come up
with something clever. In any case tuning this from the userspace is a
trivial thing to do and I am somehow skeptical that any early boot code
would trip over the limit.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-28 Thread Michal Hocko
On Mon 27-11-17 15:26:27, John Hubbard wrote:
[...]
> Let me add a belated report, then: we ran into this limit while implementing 
> an early version of Unified Memory[1], back in 2013. The implementation
> at the time depended on tracking that assumed "one allocation == one vma".

And you tried hard to make those VMAs really separate? E.g. with
prot_none gaps?

> So, with only 64K vmas, we quickly ran out, and changed the design to work
> around that. (And later, the design was *completely* changed to use a separate
> tracking system altogether). 
> 
> The existing limit seems rather too low, at least from my perspective. Maybe
> it would be better, if expressed as a function of RAM size?

Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
after years when memory grown much more than the code author expected.
Just look how we scaled hash table sizes... But maybe you can come up
with something clever. In any case tuning this from the userspace is a
trivial thing to do and I am somehow skeptical that any early boot code
would trip over the limit.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread John Hubbard
On 11/27/2017 11:52 AM, Michal Hocko wrote:
> On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
>> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
 I've kept the kernel tunable to not break the API towards user-space,
 but it's a no-op now.  Also the distinction between split_vma() and
 __split_vma() disappears, so they are merged.
>>>
>>> Could you be more explicit about _why_ we need to remove this tunable?
>>> I am not saying I disagree, the removal simplifies the code but I do not
>>> really see any justification here.
>>
>> In principle you don't "need" to, as those that know about it can bump it
>> to some insanely high value and get on with life.  Meanwhile those that don't
>> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
>> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
>> when one has loads of free RAM.
> 
> I agree that our error reporting is more than suboptimal in this regard.
> These are all historical mistakes and we have much more of those. The
> thing is that we have means to debug these issues (check
> /proc//maps e.g.).
> 
>> But what _is_ the justification for having this arbitrary limit?
>> There might have been historical reasons, but at least ELF core dumps
>> are no longer a problem.
> 
> Andi has already mentioned the the resource consumption. You can create
> a lot of unreclaimable memory and there should be some cap. Whether our
> default is good is questionable. Whether we can remove it altogether is
> a different thing.
> 
> As I've said I am not a great fan of the limit but "I've just notice it
> breaks on me" doesn't sound like a very good justification. You still
> have an option to increase it. Considering we do not have too many
> reports suggests that this is not such a big deal for most users.
> 

Let me add a belated report, then: we ran into this limit while implementing 
an early version of Unified Memory[1], back in 2013. The implementation
at the time depended on tracking that assumed "one allocation == one vma".
So, with only 64K vmas, we quickly ran out, and changed the design to work
around that. (And later, the design was *completely* changed to use a separate
tracking system altogether). 

The existing limit seems rather too low, at least from my perspective. Maybe
it would be better, if expressed as a function of RAM size?


[1] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

This is a way to automatically (via page faulting) migrate memory
between CPUs and devices (GPUs, here). This is before HMM, of course.

thanks,
John Hubbard
  


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread John Hubbard
On 11/27/2017 11:52 AM, Michal Hocko wrote:
> On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
>> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
 I've kept the kernel tunable to not break the API towards user-space,
 but it's a no-op now.  Also the distinction between split_vma() and
 __split_vma() disappears, so they are merged.
>>>
>>> Could you be more explicit about _why_ we need to remove this tunable?
>>> I am not saying I disagree, the removal simplifies the code but I do not
>>> really see any justification here.
>>
>> In principle you don't "need" to, as those that know about it can bump it
>> to some insanely high value and get on with life.  Meanwhile those that don't
>> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
>> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
>> when one has loads of free RAM.
> 
> I agree that our error reporting is more than suboptimal in this regard.
> These are all historical mistakes and we have much more of those. The
> thing is that we have means to debug these issues (check
> /proc//maps e.g.).
> 
>> But what _is_ the justification for having this arbitrary limit?
>> There might have been historical reasons, but at least ELF core dumps
>> are no longer a problem.
> 
> Andi has already mentioned the the resource consumption. You can create
> a lot of unreclaimable memory and there should be some cap. Whether our
> default is good is questionable. Whether we can remove it altogether is
> a different thing.
> 
> As I've said I am not a great fan of the limit but "I've just notice it
> breaks on me" doesn't sound like a very good justification. You still
> have an option to increase it. Considering we do not have too many
> reports suggests that this is not such a big deal for most users.
> 

Let me add a belated report, then: we ran into this limit while implementing 
an early version of Unified Memory[1], back in 2013. The implementation
at the time depended on tracking that assumed "one allocation == one vma".
So, with only 64K vmas, we quickly ran out, and changed the design to work
around that. (And later, the design was *completely* changed to use a separate
tracking system altogether). 

The existing limit seems rather too low, at least from my perspective. Maybe
it would be better, if expressed as a function of RAM size?


[1] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

This is a way to automatically (via page faulting) migrate memory
between CPUs and devices (GPUs, here). This is before HMM, of course.

thanks,
John Hubbard
  


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 12:21:21, Andi Kleen wrote:
> On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> > On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> > [...]
> > > > The reason the limit was there originally because it allows a DoS
> > > > attack against the kernel by filling all unswappable memory up with 
> > > > VMAs.
> > > 
> > > We can reduce the effect by accounting vmas to memory cgroups.
> > 
> > As it turned out we already do.
> > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> 
> That only helps if you have memory cgroups enabled. It would be a regression
> to break the accounting on all the systems that don't.

I agree. And I didn't say we should remove the existing limit. I am just
saying that we can reduce existing problems by increasing the limit and
relying on memcg accounting where possible.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 12:21:21, Andi Kleen wrote:
> On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> > On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> > [...]
> > > > The reason the limit was there originally because it allows a DoS
> > > > attack against the kernel by filling all unswappable memory up with 
> > > > VMAs.
> > > 
> > > We can reduce the effect by accounting vmas to memory cgroups.
> > 
> > As it turned out we already do.
> > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
> 
> That only helps if you have memory cgroups enabled. It would be a regression
> to break the accounting on all the systems that don't.

I agree. And I didn't say we should remove the existing limit. I am just
saying that we can reduce existing problems by increasing the limit and
relying on memcg accounting where possible.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Andi Kleen
On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> [...]
> > > The reason the limit was there originally because it allows a DoS
> > > attack against the kernel by filling all unswappable memory up with VMAs.
> > 
> > We can reduce the effect by accounting vmas to memory cgroups.
> 
> As it turned out we already do.
>   vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

That only helps if you have memory cgroups enabled. It would be a regression
to break the accounting on all the systems that don't.

-Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Andi Kleen
On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote:
> On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> > On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> [...]
> > > The reason the limit was there originally because it allows a DoS
> > > attack against the kernel by filling all unswappable memory up with VMAs.
> > 
> > We can reduce the effect by accounting vmas to memory cgroups.
> 
> As it turned out we already do.
>   vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

That only helps if you have memory cgroups enabled. It would be a regression
to break the accounting on all the systems that don't.

-Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> On Mon 27-11-17 09:25:16, Andi Kleen wrote:
[...]
> > The reason the limit was there originally because it allows a DoS
> > attack against the kernel by filling all unswappable memory up with VMAs.
> 
> We can reduce the effect by accounting vmas to memory cgroups.

As it turned out we already do.
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 19:32:18, Michal Hocko wrote:
> On Mon 27-11-17 09:25:16, Andi Kleen wrote:
[...]
> > The reason the limit was there originally because it allows a DoS
> > attack against the kernel by filling all unswappable memory up with VMAs.
> 
> We can reduce the effect by accounting vmas to memory cgroups.

As it turned out we already do.
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
> > > I've kept the kernel tunable to not break the API towards user-space,
> > > but it's a no-op now.  Also the distinction between split_vma() and
> > > __split_vma() disappears, so they are merged.
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> In principle you don't "need" to, as those that know about it can bump it
> to some insanely high value and get on with life.  Meanwhile those that don't
> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
> when one has loads of free RAM.

I agree that our error reporting is more than suboptimal in this regard.
These are all historical mistakes and we have much more of those. The
thing is that we have means to debug these issues (check
/proc//maps e.g.).

> But what _is_ the justification for having this arbitrary limit?
> There might have been historical reasons, but at least ELF core dumps
> are no longer a problem.

Andi has already mentioned the the resource consumption. You can create
a lot of unreclaimable memory and there should be some cap. Whether our
default is good is questionable. Whether we can remove it altogether is
a different thing.

As I've said I am not a great fan of the limit but "I've just notice it
breaks on me" doesn't sound like a very good justification. You still
have an option to increase it. Considering we do not have too many
reports suggests that this is not such a big deal for most users.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 20:18:00, Mikael Pettersson wrote:
> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
> > > I've kept the kernel tunable to not break the API towards user-space,
> > > but it's a no-op now.  Also the distinction between split_vma() and
> > > __split_vma() disappears, so they are merged.
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> In principle you don't "need" to, as those that know about it can bump it
> to some insanely high value and get on with life.  Meanwhile those that don't
> (and I was one of them until fairly recently, and I'm no newcomer to Unix or
> Linux) get to scratch their heads and wonder why the kernel says ENOMEM
> when one has loads of free RAM.

I agree that our error reporting is more than suboptimal in this regard.
These are all historical mistakes and we have much more of those. The
thing is that we have means to debug these issues (check
/proc//maps e.g.).

> But what _is_ the justification for having this arbitrary limit?
> There might have been historical reasons, but at least ELF core dumps
> are no longer a problem.

Andi has already mentioned the the resource consumption. You can create
a lot of unreclaimable memory and there should be some cap. Whether our
default is good is questionable. Whether we can remove it altogether is
a different thing.

As I've said I am not a great fan of the limit but "I've just notice it
breaks on me" doesn't sound like a very good justification. You still
have an option to increase it. Considering we do not have too many
reports suggests that this is not such a big deal for most users.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Alexey Dobriyan
[resent because I can't type]

> vm.max_map_count

I always thought it is some kind of algorithmic complexity limiter and
kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap
list stays:

$ chgrep -e 'for.*vma = vma->vm_next' | wc -l
41

In particular readdir on /proc/*/map_files .

I'm saying you can not simply remove this sysctl.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Alexey Dobriyan
[resent because I can't type]

> vm.max_map_count

I always thought it is some kind of algorithmic complexity limiter and
kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap
list stays:

$ chgrep -e 'for.*vma = vma->vm_next' | wc -l
41

In particular readdir on /proc/*/map_files .

I'm saying you can not simply remove this sysctl.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Alexey Dobriyan
> vm.max_map_count

I always thought it is some kind of algorithmic complexity limiter and
kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap
list stays:

$ chgrep -e 'for.*vma = vma->vm_next' | wc -l
41

In particular readdir on /proc/*/map_files .

I'm saying you can not simply remove this sysctl.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Alexey Dobriyan
> vm.max_map_count

I always thought it is some kind of algorithmic complexity limiter and
kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap
list stays:

$ chgrep -e 'for.*vma = vma->vm_next' | wc -l
41

In particular readdir on /proc/*/map_files .

I'm saying you can not simply remove this sysctl.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 6:25 PM, Andi Kleen  wrote:
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.
>
> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.
>
> The old limit was designed for much smaller systems than we have
> today.
>
> There needs to be some limit, but it should be on the number of memory
> pinned by the VMAs, and needs to scale with the available memory,
> so that large systems are not penalized.

Fully agreed.  One problem with the current limit is that number of VMAs
is only weakly related to the amount of memory one has mapped, and is
also prone to grow due to memory fragmentation.  I've seen processes
differ by 3X number of VMAs, even though they ran the same code and
had similar memory sizes; they only differed on how long they had been
running and which servers they ran on (and how long those had been up).

> Unfortunately just making it part of the existing mlock limit could
> break some existing setups which max out the mlock limit with something
> else. Maybe we need a new rlimit for this?
>
> -Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 6:25 PM, Andi Kleen  wrote:
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.
>
> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.
>
> The old limit was designed for much smaller systems than we have
> today.
>
> There needs to be some limit, but it should be on the number of memory
> pinned by the VMAs, and needs to scale with the available memory,
> so that large systems are not penalized.

Fully agreed.  One problem with the current limit is that number of VMAs
is only weakly related to the amount of memory one has mapped, and is
also prone to grow due to memory fragmentation.  I've seen processes
differ by 3X number of VMAs, even though they ran the same code and
had similar memory sizes; they only differed on how long they had been
running and which servers they ran on (and how long those had been up).

> Unfortunately just making it part of the existing mlock limit could
> break some existing setups which max out the mlock limit with something
> else. Maybe we need a new rlimit for this?
>
> -Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 5:22 PM, Matthew Wilcox  wrote:
>> Could you be more explicit about _why_ we need to remove this tunable?
>> I am not saying I disagree, the removal simplifies the code but I do not
>> really see any justification here.
>
> I imagine he started seeing random syscalls failing with ENOMEM and
> eventually tracked it down to this stupid limit we used to need.

Exactly, except the origin (mmap() failing) was hidden behind layers upon layers
of user-space memory management code (not ours), which just said "failed to
allocate N bytes" (with N about 0.001% of the free RAM).  And it
wasn't reproducible.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 5:22 PM, Matthew Wilcox  wrote:
>> Could you be more explicit about _why_ we need to remove this tunable?
>> I am not saying I disagree, the removal simplifies the code but I do not
>> really see any justification here.
>
> I imagine he started seeing random syscalls failing with ENOMEM and
> eventually tracked it down to this stupid limit we used to need.

Exactly, except the origin (mmap() failing) was hidden behind layers upon layers
of user-space memory management code (not ours), which just said "failed to
allocate N bytes" (with N about 0.001% of the free RAM).  And it
wasn't reproducible.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
> > I've kept the kernel tunable to not break the API towards user-space,
> > but it's a no-op now.  Also the distinction between split_vma() and
> > __split_vma() disappears, so they are merged.
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

In principle you don't "need" to, as those that know about it can bump it
to some insanely high value and get on with life.  Meanwhile those that don't
(and I was one of them until fairly recently, and I'm no newcomer to Unix or
Linux) get to scratch their heads and wonder why the kernel says ENOMEM
when one has loads of free RAM.

But what _is_ the justification for having this arbitrary limit?
There might have
been historical reasons, but at least ELF core dumps are no longer a problem.

/Mikael


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Mikael Pettersson
On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko  wrote:
> > I've kept the kernel tunable to not break the API towards user-space,
> > but it's a no-op now.  Also the distinction between split_vma() and
> > __split_vma() disappears, so they are merged.
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

In principle you don't "need" to, as those that know about it can bump it
to some insanely high value and get on with life.  Meanwhile those that don't
(and I was one of them until fairly recently, and I'm no newcomer to Unix or
Linux) get to scratch their heads and wonder why the kernel says ENOMEM
when one has loads of free RAM.

But what _is_ the justification for having this arbitrary limit?
There might have
been historical reasons, but at least ELF core dumps are no longer a problem.

/Mikael


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> Michal Hocko  writes:
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.

Why cannot you increase the limit?

> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.

We can reduce the effect by accounting vmas to memory cgroups.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Mon 27-11-17 09:25:16, Andi Kleen wrote:
> Michal Hocko  writes:
> >
> > Could you be more explicit about _why_ we need to remove this tunable?
> > I am not saying I disagree, the removal simplifies the code but I do not
> > really see any justification here.
> 
> It's an arbitrary scaling limit on the how many mappings the process
> has. The more memory you have the bigger a problem it is. We've
> ran into this problem too on larger systems.

Why cannot you increase the limit?

> The reason the limit was there originally because it allows a DoS
> attack against the kernel by filling all unswappable memory up with VMAs.

We can reduce the effect by accounting vmas to memory cgroups.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Andi Kleen
Michal Hocko  writes:
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

It's an arbitrary scaling limit on the how many mappings the process
has. The more memory you have the bigger a problem it is. We've
ran into this problem too on larger systems.

The reason the limit was there originally because it allows a DoS
attack against the kernel by filling all unswappable memory up with VMAs.

The old limit was designed for much smaller systems than we have
today.

There needs to be some limit, but it should be on the number of memory
pinned by the VMAs, and needs to scale with the available memory,
so that large systems are not penalized.

Unfortunately just making it part of the existing mlock limit could
break some existing setups which max out the mlock limit with something
else. Maybe we need a new rlimit for this?

-Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Andi Kleen
Michal Hocko  writes:
>
> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

It's an arbitrary scaling limit on the how many mappings the process
has. The more memory you have the bigger a problem it is. We've
ran into this problem too on larger systems.

The reason the limit was there originally because it allows a DoS
attack against the kernel by filling all unswappable memory up with VMAs.

The old limit was designed for much smaller systems than we have
today.

There needs to be some limit, but it should be on the number of memory
pinned by the VMAs, and needs to scale with the available memory,
so that large systems are not penalized.

Unfortunately just making it part of the existing mlock limit could
break some existing setups which max out the mlock limit with something
else. Maybe we need a new rlimit for this?

-Andi


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Matthew Wilcox
On Mon, Nov 27, 2017 at 11:12:32AM +0100, Michal Hocko wrote:
> On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> > - Reaching the limit causes various memory management system calls to
> >   fail with ENOMEM, which is a lie.  Combined with the unpredictability
> >   of the number of mappings in a process, especially when non-trivial
> >   memory management or heavy file mapping is used, it can be difficult
> >   to reproduce these events and debug them.  It's also confusing to get
> >   ENOMEM when you know you have lots of free RAM.

[snip]

> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

I imagine he started seeing random syscalls failing with ENOMEM and
eventually tracked it down to this stupid limit we used to need.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Matthew Wilcox
On Mon, Nov 27, 2017 at 11:12:32AM +0100, Michal Hocko wrote:
> On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> > - Reaching the limit causes various memory management system calls to
> >   fail with ENOMEM, which is a lie.  Combined with the unpredictability
> >   of the number of mappings in a process, especially when non-trivial
> >   memory management or heavy file mapping is used, it can be difficult
> >   to reproduce these events and debug them.  It's also confusing to get
> >   ENOMEM when you know you have lots of free RAM.

[snip]

> Could you be more explicit about _why_ we need to remove this tunable?
> I am not saying I disagree, the removal simplifies the code but I do not
> really see any justification here.

I imagine he started seeing random syscalls failing with ENOMEM and
eventually tracked it down to this stupid limit we used to need.


Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
> this patch disables it.
> 
> - Old ELF had a limit of 64K segments, making core dumps from processes
>   with more mappings than that problematic, but that was fixed in March
>   2010 ("elf coredump: add extended numbering support").
> 
> - There are no internal data structures sized by this limit, making it
>   entirely artificial.

each mapping has its vma structure and that in turn can be tracked by
other data structures so this is not entirely true.

> - When viewed as a limit on memory consumption, it is ineffective since
>   the number of mappings does not correspond directly to the amount of
>   memory consumed, since each mapping is variable-length.
> 
> - Reaching the limit causes various memory management system calls to
>   fail with ENOMEM, which is a lie.  Combined with the unpredictability
>   of the number of mappings in a process, especially when non-trivial
>   memory management or heavy file mapping is used, it can be difficult
>   to reproduce these events and debug them.  It's also confusing to get
>   ENOMEM when you know you have lots of free RAM.
> 
> This limit was apparently introduced in the 2.1.80 kernel (first as a
> compile-time constant, later changed to a sysctl), but I haven't been
> able to find any description for it in Git or the LKML archives, so I
> don't know what the original motivation was.
> 
> I've kept the kernel tunable to not break the API towards user-space,
> but it's a no-op now.  Also the distinction between split_vma() and
> __split_vma() disappears, so they are merged.

Could you be more explicit about _why_ we need to remove this tunable?
I am not saying I disagree, the removal simplifies the code but I do not
really see any justification here.

> Tested on x86_64 with Fedora 26 user-space.  Also built an ARM NOMMU
> kernel to make sure NOMMU compiles and links cleanly.
> 
> Signed-off-by: Mikael Pettersson 
> ---
>  Documentation/sysctl/vm.txt   | 17 +-
>  Documentation/vm/ksm.txt  |  4 
>  Documentation/vm/remap_file_pages.txt |  4 
>  fs/binfmt_elf.c   |  4 
>  include/linux/mm.h| 23 ---
>  kernel/sysctl.c   |  3 +++
>  mm/madvise.c  | 12 ++
>  mm/mmap.c | 42 
> ++-
>  mm/mremap.c   |  7 --
>  mm/nommu.c|  3 ---
>  mm/util.c |  1 -
>  11 files changed, 13 insertions(+), 107 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b920423f88cb..0fcb511d07e6 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm:
>  - laptop_mode
>  - legacy_va_layout
>  - lowmem_reserve_ratio
> -- max_map_count
> +- max_map_count (unused, kept for backwards compatibility)
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%).
>  
>  ==
>  
> -max_map_count:
> -
> -This file contains the maximum number of memory map areas a process
> -may have. Memory map areas are used as a side-effect of calling
> -malloc, directly by mmap, mprotect, and madvise, and also when loading
> -shared libraries.
> -
> -While most applications need less than a thousand maps, certain
> -programs, particularly malloc debuggers, may consume lots of them,
> -e.g., up to one or two maps per allocation.
> -
> -The default value is 65536.
> -
> -=
> -
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index 6686bd267dc9..4a917f88cb11 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if 
> the range
>  cannot contain any pages which KSM could actually merge; even if
>  MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
>  
> -If a region of memory must be split into at least one new MADV_MERGEABLE
> -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
> -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
> -
>  Like other madvise calls, they are intended for use on mapped areas of
>  the user address space: they will report ENOMEM if the specified range
>  includes unmapped gaps (though working on the intervening mapped areas),
> diff --git a/Documentation/vm/remap_file_pages.txt 
> b/Documentation/vm/remap_file_pages.txt
> index f609142f406a..85985a89f05d 100644
> 

Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

2017-11-27 Thread Michal Hocko
On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
> this patch disables it.
> 
> - Old ELF had a limit of 64K segments, making core dumps from processes
>   with more mappings than that problematic, but that was fixed in March
>   2010 ("elf coredump: add extended numbering support").
> 
> - There are no internal data structures sized by this limit, making it
>   entirely artificial.

each mapping has its vma structure and that in turn can be tracked by
other data structures so this is not entirely true.

> - When viewed as a limit on memory consumption, it is ineffective since
>   the number of mappings does not correspond directly to the amount of
>   memory consumed, since each mapping is variable-length.
> 
> - Reaching the limit causes various memory management system calls to
>   fail with ENOMEM, which is a lie.  Combined with the unpredictability
>   of the number of mappings in a process, especially when non-trivial
>   memory management or heavy file mapping is used, it can be difficult
>   to reproduce these events and debug them.  It's also confusing to get
>   ENOMEM when you know you have lots of free RAM.
> 
> This limit was apparently introduced in the 2.1.80 kernel (first as a
> compile-time constant, later changed to a sysctl), but I haven't been
> able to find any description for it in Git or the LKML archives, so I
> don't know what the original motivation was.
> 
> I've kept the kernel tunable to not break the API towards user-space,
> but it's a no-op now.  Also the distinction between split_vma() and
> __split_vma() disappears, so they are merged.

Could you be more explicit about _why_ we need to remove this tunable?
I am not saying I disagree, the removal simplifies the code but I do not
really see any justification here.

> Tested on x86_64 with Fedora 26 user-space.  Also built an ARM NOMMU
> kernel to make sure NOMMU compiles and links cleanly.
> 
> Signed-off-by: Mikael Pettersson 
> ---
>  Documentation/sysctl/vm.txt   | 17 +-
>  Documentation/vm/ksm.txt  |  4 
>  Documentation/vm/remap_file_pages.txt |  4 
>  fs/binfmt_elf.c   |  4 
>  include/linux/mm.h| 23 ---
>  kernel/sysctl.c   |  3 +++
>  mm/madvise.c  | 12 ++
>  mm/mmap.c | 42 
> ++-
>  mm/mremap.c   |  7 --
>  mm/nommu.c|  3 ---
>  mm/util.c |  1 -
>  11 files changed, 13 insertions(+), 107 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b920423f88cb..0fcb511d07e6 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm:
>  - laptop_mode
>  - legacy_va_layout
>  - lowmem_reserve_ratio
> -- max_map_count
> +- max_map_count (unused, kept for backwards compatibility)
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%).
>  
>  ==
>  
> -max_map_count:
> -
> -This file contains the maximum number of memory map areas a process
> -may have. Memory map areas are used as a side-effect of calling
> -malloc, directly by mmap, mprotect, and madvise, and also when loading
> -shared libraries.
> -
> -While most applications need less than a thousand maps, certain
> -programs, particularly malloc debuggers, may consume lots of them,
> -e.g., up to one or two maps per allocation.
> -
> -The default value is 65536.
> -
> -=
> -
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index 6686bd267dc9..4a917f88cb11 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if 
> the range
>  cannot contain any pages which KSM could actually merge; even if
>  MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
>  
> -If a region of memory must be split into at least one new MADV_MERGEABLE
> -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
> -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
> -
>  Like other madvise calls, they are intended for use on mapped areas of
>  the user address space: they will report ENOMEM if the specified range
>  includes unmapped gaps (though working on the intervening mapped areas),
> diff --git a/Documentation/vm/remap_file_pages.txt 
> b/Documentation/vm/remap_file_pages.txt
> index f609142f406a..85985a89f05d 100644
> ---