Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Tue 28-11-17 21:14:23, John Hubbard wrote: > On 11/28/2017 12:12 AM, Michal Hocko wrote: > > On Mon 27-11-17 15:26:27, John Hubbard wrote: > > [...] > >> Let me add a belated report, then: we ran into this limit while > >> implementing > >> an early version of Unified Memory[1], back in 2013. The implementation > >> at the time depended on tracking that assumed "one allocation == one vma". > > > > And you tried hard to make those VMAs really separate? E.g. with > > prot_none gaps? > > We didn't do that, and in fact I'm probably failing to grasp the underlying > design idea that you have in mind there...hints welcome... mmap code tries to merge vmas very aggressively so you have to try to make too many vmas. One way to separate different vmas is to mprotect holes to trap potential {over,under}flows. > What we did was to hook into the mmap callbacks in the kernel driver, after > userspace mmap'd a region (via a custom allocator API). And we had an ioctl > in there, to connect up other allocation attributes that couldn't be passed > through via mmap. Again, this was for regions of memory that were to be > migrated between CPU and device (GPU). Or maybe your driver made the vma merging impossible by requesting explicit ranges which are not adjacent. > >> So, with only 64K vmas, we quickly ran out, and changed the design to work > >> around that. (And later, the design was *completely* changed to use a > >> separate > >> tracking system altogether). exag > >> > >> The existing limit seems rather too low, at least from my perspective. > >> Maybe > >> it would be better, if expressed as a function of RAM size? > > > > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea > > after years when memory grown much more than the code author expected. > > Just look how we scaled hash table sizes... But maybe you can come up > > with something clever. In any case tuning this from the userspace is a > > trivial thing to do and I am somehow skeptical that any early boot code > > would trip over the limit. > > > > I agree that this is not a limit that boot code is likely to hit. And maybe > tuning from userspace really is the right approach here, considering that > there is a real cost to going too large. > > Just philosophically here, hard limits like this seem a little awkward if > they > are set once in, say, 1999 (gross exaggeration here, for effect) and then not > updated to stay with the times, right? In other words, one should not > routinely > need to tune most things. That's why I was wondering if something crude and > silly > would work, such as just a ratio of RAM to vma count. (I'm more just trying to > understand the "rules" here, than to debate--I don't have a strong opinion > on this.) Well, rlimits are in general not very usable. Yet I do not think we should simply wipe them out. > The fact that this apparently failed with hash tables is interesting, I'd > love to read more if you have any notes or links. I spotted a 2014 LWN article > ( https://lwn.net/Articles/612100 ) about hash table resizing, and some > commits > that fixed resizing bugs, such as > > 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking") > > ...was it just a storm of bugs that showed up? No, it was just that large (TB) machines allocated insanely large hash tables for things which will never have any way to fill them up. See 9017217b6f45 ("mm: adaptive hash table scaling"). -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Tue 28-11-17 21:14:23, John Hubbard wrote: > On 11/28/2017 12:12 AM, Michal Hocko wrote: > > On Mon 27-11-17 15:26:27, John Hubbard wrote: > > [...] > >> Let me add a belated report, then: we ran into this limit while > >> implementing > >> an early version of Unified Memory[1], back in 2013. The implementation > >> at the time depended on tracking that assumed "one allocation == one vma". > > > > And you tried hard to make those VMAs really separate? E.g. with > > prot_none gaps? > > We didn't do that, and in fact I'm probably failing to grasp the underlying > design idea that you have in mind there...hints welcome... mmap code tries to merge vmas very aggressively so you have to try to make too many vmas. One way to separate different vmas is to mprotect holes to trap potential {over,under}flows. > What we did was to hook into the mmap callbacks in the kernel driver, after > userspace mmap'd a region (via a custom allocator API). And we had an ioctl > in there, to connect up other allocation attributes that couldn't be passed > through via mmap. Again, this was for regions of memory that were to be > migrated between CPU and device (GPU). Or maybe your driver made the vma merging impossible by requesting explicit ranges which are not adjacent. > >> So, with only 64K vmas, we quickly ran out, and changed the design to work > >> around that. (And later, the design was *completely* changed to use a > >> separate > >> tracking system altogether). exag > >> > >> The existing limit seems rather too low, at least from my perspective. > >> Maybe > >> it would be better, if expressed as a function of RAM size? > > > > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea > > after years when memory grown much more than the code author expected. > > Just look how we scaled hash table sizes... But maybe you can come up > > with something clever. In any case tuning this from the userspace is a > > trivial thing to do and I am somehow skeptical that any early boot code > > would trip over the limit. > > > > I agree that this is not a limit that boot code is likely to hit. And maybe > tuning from userspace really is the right approach here, considering that > there is a real cost to going too large. > > Just philosophically here, hard limits like this seem a little awkward if > they > are set once in, say, 1999 (gross exaggeration here, for effect) and then not > updated to stay with the times, right? In other words, one should not > routinely > need to tune most things. That's why I was wondering if something crude and > silly > would work, such as just a ratio of RAM to vma count. (I'm more just trying to > understand the "rules" here, than to debate--I don't have a strong opinion > on this.) Well, rlimits are in general not very usable. Yet I do not think we should simply wipe them out. > The fact that this apparently failed with hash tables is interesting, I'd > love to read more if you have any notes or links. I spotted a 2014 LWN article > ( https://lwn.net/Articles/612100 ) about hash table resizing, and some > commits > that fixed resizing bugs, such as > > 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking") > > ...was it just a storm of bugs that showed up? No, it was just that large (TB) machines allocated insanely large hash tables for things which will never have any way to fill them up. See 9017217b6f45 ("mm: adaptive hash table scaling"). -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On 11/28/2017 12:12 AM, Michal Hocko wrote: > On Mon 27-11-17 15:26:27, John Hubbard wrote: > [...] >> Let me add a belated report, then: we ran into this limit while implementing >> an early version of Unified Memory[1], back in 2013. The implementation >> at the time depended on tracking that assumed "one allocation == one vma". > > And you tried hard to make those VMAs really separate? E.g. with > prot_none gaps? We didn't do that, and in fact I'm probably failing to grasp the underlying design idea that you have in mind there...hints welcome... What we did was to hook into the mmap callbacks in the kernel driver, after userspace mmap'd a region (via a custom allocator API). And we had an ioctl in there, to connect up other allocation attributes that couldn't be passed through via mmap. Again, this was for regions of memory that were to be migrated between CPU and device (GPU). > >> So, with only 64K vmas, we quickly ran out, and changed the design to work >> around that. (And later, the design was *completely* changed to use a >> separate >> tracking system altogether). exag >> >> The existing limit seems rather too low, at least from my perspective. Maybe >> it would be better, if expressed as a function of RAM size? > > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea > after years when memory grown much more than the code author expected. > Just look how we scaled hash table sizes... But maybe you can come up > with something clever. In any case tuning this from the userspace is a > trivial thing to do and I am somehow skeptical that any early boot code > would trip over the limit. > I agree that this is not a limit that boot code is likely to hit. And maybe tuning from userspace really is the right approach here, considering that there is a real cost to going too large. Just philosophically here, hard limits like this seem a little awkward if they are set once in, say, 1999 (gross exaggeration here, for effect) and then not updated to stay with the times, right? In other words, one should not routinely need to tune most things. That's why I was wondering if something crude and silly would work, such as just a ratio of RAM to vma count. (I'm more just trying to understand the "rules" here, than to debate--I don't have a strong opinion on this.) The fact that this apparently failed with hash tables is interesting, I'd love to read more if you have any notes or links. I spotted a 2014 LWN article ( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits that fixed resizing bugs, such as 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking") ...was it just a storm of bugs that showed up? thanks, John Hubbard
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On 11/28/2017 12:12 AM, Michal Hocko wrote: > On Mon 27-11-17 15:26:27, John Hubbard wrote: > [...] >> Let me add a belated report, then: we ran into this limit while implementing >> an early version of Unified Memory[1], back in 2013. The implementation >> at the time depended on tracking that assumed "one allocation == one vma". > > And you tried hard to make those VMAs really separate? E.g. with > prot_none gaps? We didn't do that, and in fact I'm probably failing to grasp the underlying design idea that you have in mind there...hints welcome... What we did was to hook into the mmap callbacks in the kernel driver, after userspace mmap'd a region (via a custom allocator API). And we had an ioctl in there, to connect up other allocation attributes that couldn't be passed through via mmap. Again, this was for regions of memory that were to be migrated between CPU and device (GPU). > >> So, with only 64K vmas, we quickly ran out, and changed the design to work >> around that. (And later, the design was *completely* changed to use a >> separate >> tracking system altogether). exag >> >> The existing limit seems rather too low, at least from my perspective. Maybe >> it would be better, if expressed as a function of RAM size? > > Dunno. Whenever we tried to do RAM scaling it turned out a bad idea > after years when memory grown much more than the code author expected. > Just look how we scaled hash table sizes... But maybe you can come up > with something clever. In any case tuning this from the userspace is a > trivial thing to do and I am somehow skeptical that any early boot code > would trip over the limit. > I agree that this is not a limit that boot code is likely to hit. And maybe tuning from userspace really is the right approach here, considering that there is a real cost to going too large. Just philosophically here, hard limits like this seem a little awkward if they are set once in, say, 1999 (gross exaggeration here, for effect) and then not updated to stay with the times, right? In other words, one should not routinely need to tune most things. That's why I was wondering if something crude and silly would work, such as just a ratio of RAM to vma count. (I'm more just trying to understand the "rules" here, than to debate--I don't have a strong opinion on this.) The fact that this apparently failed with hash tables is interesting, I'd love to read more if you have any notes or links. I spotted a 2014 LWN article ( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits that fixed resizing bugs, such as 12311959ecf8a ("rhashtable: fix shift by 64 when shrinking") ...was it just a storm of bugs that showed up? thanks, John Hubbard
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 15:26:27, John Hubbard wrote: [...] > Let me add a belated report, then: we ran into this limit while implementing > an early version of Unified Memory[1], back in 2013. The implementation > at the time depended on tracking that assumed "one allocation == one vma". And you tried hard to make those VMAs really separate? E.g. with prot_none gaps? > So, with only 64K vmas, we quickly ran out, and changed the design to work > around that. (And later, the design was *completely* changed to use a separate > tracking system altogether). > > The existing limit seems rather too low, at least from my perspective. Maybe > it would be better, if expressed as a function of RAM size? Dunno. Whenever we tried to do RAM scaling it turned out a bad idea after years when memory grown much more than the code author expected. Just look how we scaled hash table sizes... But maybe you can come up with something clever. In any case tuning this from the userspace is a trivial thing to do and I am somehow skeptical that any early boot code would trip over the limit. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 15:26:27, John Hubbard wrote: [...] > Let me add a belated report, then: we ran into this limit while implementing > an early version of Unified Memory[1], back in 2013. The implementation > at the time depended on tracking that assumed "one allocation == one vma". And you tried hard to make those VMAs really separate? E.g. with prot_none gaps? > So, with only 64K vmas, we quickly ran out, and changed the design to work > around that. (And later, the design was *completely* changed to use a separate > tracking system altogether). > > The existing limit seems rather too low, at least from my perspective. Maybe > it would be better, if expressed as a function of RAM size? Dunno. Whenever we tried to do RAM scaling it turned out a bad idea after years when memory grown much more than the code author expected. Just look how we scaled hash table sizes... But maybe you can come up with something clever. In any case tuning this from the userspace is a trivial thing to do and I am somehow skeptical that any early boot code would trip over the limit. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On 11/27/2017 11:52 AM, Michal Hocko wrote: > On Mon 27-11-17 20:18:00, Mikael Pettersson wrote: >> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hockowrote: I've kept the kernel tunable to not break the API towards user-space, but it's a no-op now. Also the distinction between split_vma() and __split_vma() disappears, so they are merged. >>> >>> Could you be more explicit about _why_ we need to remove this tunable? >>> I am not saying I disagree, the removal simplifies the code but I do not >>> really see any justification here. >> >> In principle you don't "need" to, as those that know about it can bump it >> to some insanely high value and get on with life. Meanwhile those that don't >> (and I was one of them until fairly recently, and I'm no newcomer to Unix or >> Linux) get to scratch their heads and wonder why the kernel says ENOMEM >> when one has loads of free RAM. > > I agree that our error reporting is more than suboptimal in this regard. > These are all historical mistakes and we have much more of those. The > thing is that we have means to debug these issues (check > /proc//maps e.g.). > >> But what _is_ the justification for having this arbitrary limit? >> There might have been historical reasons, but at least ELF core dumps >> are no longer a problem. > > Andi has already mentioned the the resource consumption. You can create > a lot of unreclaimable memory and there should be some cap. Whether our > default is good is questionable. Whether we can remove it altogether is > a different thing. > > As I've said I am not a great fan of the limit but "I've just notice it > breaks on me" doesn't sound like a very good justification. You still > have an option to increase it. Considering we do not have too many > reports suggests that this is not such a big deal for most users. > Let me add a belated report, then: we ran into this limit while implementing an early version of Unified Memory[1], back in 2013. The implementation at the time depended on tracking that assumed "one allocation == one vma". So, with only 64K vmas, we quickly ran out, and changed the design to work around that. (And later, the design was *completely* changed to use a separate tracking system altogether). The existing limit seems rather too low, at least from my perspective. Maybe it would be better, if expressed as a function of RAM size? [1] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ This is a way to automatically (via page faulting) migrate memory between CPUs and devices (GPUs, here). This is before HMM, of course. thanks, John Hubbard
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On 11/27/2017 11:52 AM, Michal Hocko wrote: > On Mon 27-11-17 20:18:00, Mikael Pettersson wrote: >> On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko wrote: I've kept the kernel tunable to not break the API towards user-space, but it's a no-op now. Also the distinction between split_vma() and __split_vma() disappears, so they are merged. >>> >>> Could you be more explicit about _why_ we need to remove this tunable? >>> I am not saying I disagree, the removal simplifies the code but I do not >>> really see any justification here. >> >> In principle you don't "need" to, as those that know about it can bump it >> to some insanely high value and get on with life. Meanwhile those that don't >> (and I was one of them until fairly recently, and I'm no newcomer to Unix or >> Linux) get to scratch their heads and wonder why the kernel says ENOMEM >> when one has loads of free RAM. > > I agree that our error reporting is more than suboptimal in this regard. > These are all historical mistakes and we have much more of those. The > thing is that we have means to debug these issues (check > /proc//maps e.g.). > >> But what _is_ the justification for having this arbitrary limit? >> There might have been historical reasons, but at least ELF core dumps >> are no longer a problem. > > Andi has already mentioned the the resource consumption. You can create > a lot of unreclaimable memory and there should be some cap. Whether our > default is good is questionable. Whether we can remove it altogether is > a different thing. > > As I've said I am not a great fan of the limit but "I've just notice it > breaks on me" doesn't sound like a very good justification. You still > have an option to increase it. Considering we do not have too many > reports suggests that this is not such a big deal for most users. > Let me add a belated report, then: we ran into this limit while implementing an early version of Unified Memory[1], back in 2013. The implementation at the time depended on tracking that assumed "one allocation == one vma". So, with only 64K vmas, we quickly ran out, and changed the design to work around that. (And later, the design was *completely* changed to use a separate tracking system altogether). The existing limit seems rather too low, at least from my perspective. Maybe it would be better, if expressed as a function of RAM size? [1] https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ This is a way to automatically (via page faulting) migrate memory between CPUs and devices (GPUs, here). This is before HMM, of course. thanks, John Hubbard
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 12:21:21, Andi Kleen wrote: > On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote: > > On Mon 27-11-17 19:32:18, Michal Hocko wrote: > > > On Mon 27-11-17 09:25:16, Andi Kleen wrote: > > [...] > > > > The reason the limit was there originally because it allows a DoS > > > > attack against the kernel by filling all unswappable memory up with > > > > VMAs. > > > > > > We can reduce the effect by accounting vmas to memory cgroups. > > > > As it turned out we already do. > > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); > > That only helps if you have memory cgroups enabled. It would be a regression > to break the accounting on all the systems that don't. I agree. And I didn't say we should remove the existing limit. I am just saying that we can reduce existing problems by increasing the limit and relying on memcg accounting where possible. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 12:21:21, Andi Kleen wrote: > On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote: > > On Mon 27-11-17 19:32:18, Michal Hocko wrote: > > > On Mon 27-11-17 09:25:16, Andi Kleen wrote: > > [...] > > > > The reason the limit was there originally because it allows a DoS > > > > attack against the kernel by filling all unswappable memory up with > > > > VMAs. > > > > > > We can reduce the effect by accounting vmas to memory cgroups. > > > > As it turned out we already do. > > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); > > That only helps if you have memory cgroups enabled. It would be a regression > to break the accounting on all the systems that don't. I agree. And I didn't say we should remove the existing limit. I am just saying that we can reduce existing problems by increasing the limit and relying on memcg accounting where possible. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote: > On Mon 27-11-17 19:32:18, Michal Hocko wrote: > > On Mon 27-11-17 09:25:16, Andi Kleen wrote: > [...] > > > The reason the limit was there originally because it allows a DoS > > > attack against the kernel by filling all unswappable memory up with VMAs. > > > > We can reduce the effect by accounting vmas to memory cgroups. > > As it turned out we already do. > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); That only helps if you have memory cgroups enabled. It would be a regression to break the accounting on all the systems that don't. -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 08:57:32PM +0100, Michal Hocko wrote: > On Mon 27-11-17 19:32:18, Michal Hocko wrote: > > On Mon 27-11-17 09:25:16, Andi Kleen wrote: > [...] > > > The reason the limit was there originally because it allows a DoS > > > attack against the kernel by filling all unswappable memory up with VMAs. > > > > We can reduce the effect by accounting vmas to memory cgroups. > > As it turned out we already do. > vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); That only helps if you have memory cgroups enabled. It would be a regression to break the accounting on all the systems that don't. -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 19:32:18, Michal Hocko wrote: > On Mon 27-11-17 09:25:16, Andi Kleen wrote: [...] > > The reason the limit was there originally because it allows a DoS > > attack against the kernel by filling all unswappable memory up with VMAs. > > We can reduce the effect by accounting vmas to memory cgroups. As it turned out we already do. vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 19:32:18, Michal Hocko wrote: > On Mon 27-11-17 09:25:16, Andi Kleen wrote: [...] > > The reason the limit was there originally because it allows a DoS > > attack against the kernel by filling all unswappable memory up with VMAs. > > We can reduce the effect by accounting vmas to memory cgroups. As it turned out we already do. vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT); -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 20:18:00, Mikael Pettersson wrote: > On Mon, Nov 27, 2017 at 11:12 AM, Michal Hockowrote: > > > I've kept the kernel tunable to not break the API towards user-space, > > > but it's a no-op now. Also the distinction between split_vma() and > > > __split_vma() disappears, so they are merged. > > > > Could you be more explicit about _why_ we need to remove this tunable? > > I am not saying I disagree, the removal simplifies the code but I do not > > really see any justification here. > > In principle you don't "need" to, as those that know about it can bump it > to some insanely high value and get on with life. Meanwhile those that don't > (and I was one of them until fairly recently, and I'm no newcomer to Unix or > Linux) get to scratch their heads and wonder why the kernel says ENOMEM > when one has loads of free RAM. I agree that our error reporting is more than suboptimal in this regard. These are all historical mistakes and we have much more of those. The thing is that we have means to debug these issues (check /proc//maps e.g.). > But what _is_ the justification for having this arbitrary limit? > There might have been historical reasons, but at least ELF core dumps > are no longer a problem. Andi has already mentioned the the resource consumption. You can create a lot of unreclaimable memory and there should be some cap. Whether our default is good is questionable. Whether we can remove it altogether is a different thing. As I've said I am not a great fan of the limit but "I've just notice it breaks on me" doesn't sound like a very good justification. You still have an option to increase it. Considering we do not have too many reports suggests that this is not such a big deal for most users. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 20:18:00, Mikael Pettersson wrote: > On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko wrote: > > > I've kept the kernel tunable to not break the API towards user-space, > > > but it's a no-op now. Also the distinction between split_vma() and > > > __split_vma() disappears, so they are merged. > > > > Could you be more explicit about _why_ we need to remove this tunable? > > I am not saying I disagree, the removal simplifies the code but I do not > > really see any justification here. > > In principle you don't "need" to, as those that know about it can bump it > to some insanely high value and get on with life. Meanwhile those that don't > (and I was one of them until fairly recently, and I'm no newcomer to Unix or > Linux) get to scratch their heads and wonder why the kernel says ENOMEM > when one has loads of free RAM. I agree that our error reporting is more than suboptimal in this regard. These are all historical mistakes and we have much more of those. The thing is that we have means to debug these issues (check /proc//maps e.g.). > But what _is_ the justification for having this arbitrary limit? > There might have been historical reasons, but at least ELF core dumps > are no longer a problem. Andi has already mentioned the the resource consumption. You can create a lot of unreclaimable memory and there should be some cap. Whether our default is good is questionable. Whether we can remove it altogether is a different thing. As I've said I am not a great fan of the limit but "I've just notice it breaks on me" doesn't sound like a very good justification. You still have an option to increase it. Considering we do not have too many reports suggests that this is not such a big deal for most users. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
[resent because I can't type] > vm.max_map_count I always thought it is some kind of algorithmic complexity limiter and kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap list stays: $ chgrep -e 'for.*vma = vma->vm_next' | wc -l 41 In particular readdir on /proc/*/map_files . I'm saying you can not simply remove this sysctl.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
[resent because I can't type] > vm.max_map_count I always thought it is some kind of algorithmic complexity limiter and kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap list stays: $ chgrep -e 'for.*vma = vma->vm_next' | wc -l 41 In particular readdir on /proc/*/map_files . I'm saying you can not simply remove this sysctl.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
> vm.max_map_count I always thought it is some kind of algorithmic complexity limiter and kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap list stays: $ chgrep -e 'for.*vma = vma->vm_next' | wc -l 41 In particular readdir on /proc/*/map_files . I'm saying you can not simply remove this sysctl.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
> vm.max_map_count I always thought it is some kind of algorithmic complexity limiter and kernel memory limiter. VMAs are under SLAB_ACCOUNT nowadays but ->mmap list stays: $ chgrep -e 'for.*vma = vma->vm_next' | wc -l 41 In particular readdir on /proc/*/map_files . I'm saying you can not simply remove this sysctl.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 6:25 PM, Andi Kleenwrote: > It's an arbitrary scaling limit on the how many mappings the process > has. The more memory you have the bigger a problem it is. We've > ran into this problem too on larger systems. > > The reason the limit was there originally because it allows a DoS > attack against the kernel by filling all unswappable memory up with VMAs. > > The old limit was designed for much smaller systems than we have > today. > > There needs to be some limit, but it should be on the number of memory > pinned by the VMAs, and needs to scale with the available memory, > so that large systems are not penalized. Fully agreed. One problem with the current limit is that number of VMAs is only weakly related to the amount of memory one has mapped, and is also prone to grow due to memory fragmentation. I've seen processes differ by 3X number of VMAs, even though they ran the same code and had similar memory sizes; they only differed on how long they had been running and which servers they ran on (and how long those had been up). > Unfortunately just making it part of the existing mlock limit could > break some existing setups which max out the mlock limit with something > else. Maybe we need a new rlimit for this? > > -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 6:25 PM, Andi Kleen wrote: > It's an arbitrary scaling limit on the how many mappings the process > has. The more memory you have the bigger a problem it is. We've > ran into this problem too on larger systems. > > The reason the limit was there originally because it allows a DoS > attack against the kernel by filling all unswappable memory up with VMAs. > > The old limit was designed for much smaller systems than we have > today. > > There needs to be some limit, but it should be on the number of memory > pinned by the VMAs, and needs to scale with the available memory, > so that large systems are not penalized. Fully agreed. One problem with the current limit is that number of VMAs is only weakly related to the amount of memory one has mapped, and is also prone to grow due to memory fragmentation. I've seen processes differ by 3X number of VMAs, even though they ran the same code and had similar memory sizes; they only differed on how long they had been running and which servers they ran on (and how long those had been up). > Unfortunately just making it part of the existing mlock limit could > break some existing setups which max out the mlock limit with something > else. Maybe we need a new rlimit for this? > > -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 5:22 PM, Matthew Wilcoxwrote: >> Could you be more explicit about _why_ we need to remove this tunable? >> I am not saying I disagree, the removal simplifies the code but I do not >> really see any justification here. > > I imagine he started seeing random syscalls failing with ENOMEM and > eventually tracked it down to this stupid limit we used to need. Exactly, except the origin (mmap() failing) was hidden behind layers upon layers of user-space memory management code (not ours), which just said "failed to allocate N bytes" (with N about 0.001% of the free RAM). And it wasn't reproducible.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 5:22 PM, Matthew Wilcox wrote: >> Could you be more explicit about _why_ we need to remove this tunable? >> I am not saying I disagree, the removal simplifies the code but I do not >> really see any justification here. > > I imagine he started seeing random syscalls failing with ENOMEM and > eventually tracked it down to this stupid limit we used to need. Exactly, except the origin (mmap() failing) was hidden behind layers upon layers of user-space memory management code (not ours), which just said "failed to allocate N bytes" (with N about 0.001% of the free RAM). And it wasn't reproducible.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 11:12 AM, Michal Hockowrote: > > I've kept the kernel tunable to not break the API towards user-space, > > but it's a no-op now. Also the distinction between split_vma() and > > __split_vma() disappears, so they are merged. > > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. In principle you don't "need" to, as those that know about it can bump it to some insanely high value and get on with life. Meanwhile those that don't (and I was one of them until fairly recently, and I'm no newcomer to Unix or Linux) get to scratch their heads and wonder why the kernel says ENOMEM when one has loads of free RAM. But what _is_ the justification for having this arbitrary limit? There might have been historical reasons, but at least ELF core dumps are no longer a problem. /Mikael
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 11:12 AM, Michal Hocko wrote: > > I've kept the kernel tunable to not break the API towards user-space, > > but it's a no-op now. Also the distinction between split_vma() and > > __split_vma() disappears, so they are merged. > > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. In principle you don't "need" to, as those that know about it can bump it to some insanely high value and get on with life. Meanwhile those that don't (and I was one of them until fairly recently, and I'm no newcomer to Unix or Linux) get to scratch their heads and wonder why the kernel says ENOMEM when one has loads of free RAM. But what _is_ the justification for having this arbitrary limit? There might have been historical reasons, but at least ELF core dumps are no longer a problem. /Mikael
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 09:25:16, Andi Kleen wrote: > Michal Hockowrites: > > > > Could you be more explicit about _why_ we need to remove this tunable? > > I am not saying I disagree, the removal simplifies the code but I do not > > really see any justification here. > > It's an arbitrary scaling limit on the how many mappings the process > has. The more memory you have the bigger a problem it is. We've > ran into this problem too on larger systems. Why cannot you increase the limit? > The reason the limit was there originally because it allows a DoS > attack against the kernel by filling all unswappable memory up with VMAs. We can reduce the effect by accounting vmas to memory cgroups. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon 27-11-17 09:25:16, Andi Kleen wrote: > Michal Hocko writes: > > > > Could you be more explicit about _why_ we need to remove this tunable? > > I am not saying I disagree, the removal simplifies the code but I do not > > really see any justification here. > > It's an arbitrary scaling limit on the how many mappings the process > has. The more memory you have the bigger a problem it is. We've > ran into this problem too on larger systems. Why cannot you increase the limit? > The reason the limit was there originally because it allows a DoS > attack against the kernel by filling all unswappable memory up with VMAs. We can reduce the effect by accounting vmas to memory cgroups. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
Michal Hockowrites: > > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. It's an arbitrary scaling limit on the how many mappings the process has. The more memory you have the bigger a problem it is. We've ran into this problem too on larger systems. The reason the limit was there originally because it allows a DoS attack against the kernel by filling all unswappable memory up with VMAs. The old limit was designed for much smaller systems than we have today. There needs to be some limit, but it should be on the number of memory pinned by the VMAs, and needs to scale with the available memory, so that large systems are not penalized. Unfortunately just making it part of the existing mlock limit could break some existing setups which max out the mlock limit with something else. Maybe we need a new rlimit for this? -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
Michal Hocko writes: > > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. It's an arbitrary scaling limit on the how many mappings the process has. The more memory you have the bigger a problem it is. We've ran into this problem too on larger systems. The reason the limit was there originally because it allows a DoS attack against the kernel by filling all unswappable memory up with VMAs. The old limit was designed for much smaller systems than we have today. There needs to be some limit, but it should be on the number of memory pinned by the VMAs, and needs to scale with the available memory, so that large systems are not penalized. Unfortunately just making it part of the existing mlock limit could break some existing setups which max out the mlock limit with something else. Maybe we need a new rlimit for this? -Andi
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 11:12:32AM +0100, Michal Hocko wrote: > On Sun 26-11-17 17:09:32, Mikael Pettersson wrote: > > - Reaching the limit causes various memory management system calls to > > fail with ENOMEM, which is a lie. Combined with the unpredictability > > of the number of mappings in a process, especially when non-trivial > > memory management or heavy file mapping is used, it can be difficult > > to reproduce these events and debug them. It's also confusing to get > > ENOMEM when you know you have lots of free RAM. [snip] > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. I imagine he started seeing random syscalls failing with ENOMEM and eventually tracked it down to this stupid limit we used to need.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Mon, Nov 27, 2017 at 11:12:32AM +0100, Michal Hocko wrote: > On Sun 26-11-17 17:09:32, Mikael Pettersson wrote: > > - Reaching the limit causes various memory management system calls to > > fail with ENOMEM, which is a lie. Combined with the unpredictability > > of the number of mappings in a process, especially when non-trivial > > memory management or heavy file mapping is used, it can be difficult > > to reproduce these events and debug them. It's also confusing to get > > ENOMEM when you know you have lots of free RAM. [snip] > Could you be more explicit about _why_ we need to remove this tunable? > I am not saying I disagree, the removal simplifies the code but I do not > really see any justification here. I imagine he started seeing random syscalls failing with ENOMEM and eventually tracked it down to this stupid limit we used to need.
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Sun 26-11-17 17:09:32, Mikael Pettersson wrote: > The `vm.max_map_count' sysctl limit is IMO useless and confusing, so > this patch disables it. > > - Old ELF had a limit of 64K segments, making core dumps from processes > with more mappings than that problematic, but that was fixed in March > 2010 ("elf coredump: add extended numbering support"). > > - There are no internal data structures sized by this limit, making it > entirely artificial. each mapping has its vma structure and that in turn can be tracked by other data structures so this is not entirely true. > - When viewed as a limit on memory consumption, it is ineffective since > the number of mappings does not correspond directly to the amount of > memory consumed, since each mapping is variable-length. > > - Reaching the limit causes various memory management system calls to > fail with ENOMEM, which is a lie. Combined with the unpredictability > of the number of mappings in a process, especially when non-trivial > memory management or heavy file mapping is used, it can be difficult > to reproduce these events and debug them. It's also confusing to get > ENOMEM when you know you have lots of free RAM. > > This limit was apparently introduced in the 2.1.80 kernel (first as a > compile-time constant, later changed to a sysctl), but I haven't been > able to find any description for it in Git or the LKML archives, so I > don't know what the original motivation was. > > I've kept the kernel tunable to not break the API towards user-space, > but it's a no-op now. Also the distinction between split_vma() and > __split_vma() disappears, so they are merged. Could you be more explicit about _why_ we need to remove this tunable? I am not saying I disagree, the removal simplifies the code but I do not really see any justification here. > Tested on x86_64 with Fedora 26 user-space. Also built an ARM NOMMU > kernel to make sure NOMMU compiles and links cleanly. > > Signed-off-by: Mikael Pettersson> --- > Documentation/sysctl/vm.txt | 17 +- > Documentation/vm/ksm.txt | 4 > Documentation/vm/remap_file_pages.txt | 4 > fs/binfmt_elf.c | 4 > include/linux/mm.h| 23 --- > kernel/sysctl.c | 3 +++ > mm/madvise.c | 12 ++ > mm/mmap.c | 42 > ++- > mm/mremap.c | 7 -- > mm/nommu.c| 3 --- > mm/util.c | 1 - > 11 files changed, 13 insertions(+), 107 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index b920423f88cb..0fcb511d07e6 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm: > - laptop_mode > - legacy_va_layout > - lowmem_reserve_ratio > -- max_map_count > +- max_map_count (unused, kept for backwards compatibility) > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%). > > == > > -max_map_count: > - > -This file contains the maximum number of memory map areas a process > -may have. Memory map areas are used as a side-effect of calling > -malloc, directly by mmap, mprotect, and madvise, and also when loading > -shared libraries. > - > -While most applications need less than a thousand maps, certain > -programs, particularly malloc debuggers, may consume lots of them, > -e.g., up to one or two maps per allocation. > - > -The default value is 65536. > - > -= > - > memory_failure_early_kill: > > Control how to kill processes when uncorrected memory error (typically > diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt > index 6686bd267dc9..4a917f88cb11 100644 > --- a/Documentation/vm/ksm.txt > +++ b/Documentation/vm/ksm.txt > @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if > the range > cannot contain any pages which KSM could actually merge; even if > MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. > > -If a region of memory must be split into at least one new MADV_MERGEABLE > -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process > -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). > - > Like other madvise calls, they are intended for use on mapped areas of > the user address space: they will report ENOMEM if the specified range > includes unmapped gaps (though working on the intervening mapped areas), > diff --git a/Documentation/vm/remap_file_pages.txt > b/Documentation/vm/remap_file_pages.txt > index f609142f406a..85985a89f05d 100644 >
Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit
On Sun 26-11-17 17:09:32, Mikael Pettersson wrote: > The `vm.max_map_count' sysctl limit is IMO useless and confusing, so > this patch disables it. > > - Old ELF had a limit of 64K segments, making core dumps from processes > with more mappings than that problematic, but that was fixed in March > 2010 ("elf coredump: add extended numbering support"). > > - There are no internal data structures sized by this limit, making it > entirely artificial. each mapping has its vma structure and that in turn can be tracked by other data structures so this is not entirely true. > - When viewed as a limit on memory consumption, it is ineffective since > the number of mappings does not correspond directly to the amount of > memory consumed, since each mapping is variable-length. > > - Reaching the limit causes various memory management system calls to > fail with ENOMEM, which is a lie. Combined with the unpredictability > of the number of mappings in a process, especially when non-trivial > memory management or heavy file mapping is used, it can be difficult > to reproduce these events and debug them. It's also confusing to get > ENOMEM when you know you have lots of free RAM. > > This limit was apparently introduced in the 2.1.80 kernel (first as a > compile-time constant, later changed to a sysctl), but I haven't been > able to find any description for it in Git or the LKML archives, so I > don't know what the original motivation was. > > I've kept the kernel tunable to not break the API towards user-space, > but it's a no-op now. Also the distinction between split_vma() and > __split_vma() disappears, so they are merged. Could you be more explicit about _why_ we need to remove this tunable? I am not saying I disagree, the removal simplifies the code but I do not really see any justification here. > Tested on x86_64 with Fedora 26 user-space. Also built an ARM NOMMU > kernel to make sure NOMMU compiles and links cleanly. > > Signed-off-by: Mikael Pettersson > --- > Documentation/sysctl/vm.txt | 17 +- > Documentation/vm/ksm.txt | 4 > Documentation/vm/remap_file_pages.txt | 4 > fs/binfmt_elf.c | 4 > include/linux/mm.h| 23 --- > kernel/sysctl.c | 3 +++ > mm/madvise.c | 12 ++ > mm/mmap.c | 42 > ++- > mm/mremap.c | 7 -- > mm/nommu.c| 3 --- > mm/util.c | 1 - > 11 files changed, 13 insertions(+), 107 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index b920423f88cb..0fcb511d07e6 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm: > - laptop_mode > - legacy_va_layout > - lowmem_reserve_ratio > -- max_map_count > +- max_map_count (unused, kept for backwards compatibility) > - memory_failure_early_kill > - memory_failure_recovery > - min_free_kbytes > @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%). > > == > > -max_map_count: > - > -This file contains the maximum number of memory map areas a process > -may have. Memory map areas are used as a side-effect of calling > -malloc, directly by mmap, mprotect, and madvise, and also when loading > -shared libraries. > - > -While most applications need less than a thousand maps, certain > -programs, particularly malloc debuggers, may consume lots of them, > -e.g., up to one or two maps per allocation. > - > -The default value is 65536. > - > -= > - > memory_failure_early_kill: > > Control how to kill processes when uncorrected memory error (typically > diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt > index 6686bd267dc9..4a917f88cb11 100644 > --- a/Documentation/vm/ksm.txt > +++ b/Documentation/vm/ksm.txt > @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if > the range > cannot contain any pages which KSM could actually merge; even if > MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. > > -If a region of memory must be split into at least one new MADV_MERGEABLE > -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process > -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). > - > Like other madvise calls, they are intended for use on mapped areas of > the user address space: they will report ENOMEM if the specified range > includes unmapped gaps (though working on the intervening mapped areas), > diff --git a/Documentation/vm/remap_file_pages.txt > b/Documentation/vm/remap_file_pages.txt > index f609142f406a..85985a89f05d 100644 > ---