Re: [PATCH 0/6 v3] kvmalloc
Is there anything more to be done before this can get merged? I would relly like to target this to the next merge window. I already have some more changes which depend on this. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
Is there anything more to be done before this can get merged? I would relly like to target this to the next merge window. I already have some more changes which depend on this. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/30/2017 05:28 PM, Michal Hocko wrote: On Mon 30-01-17 17:15:08, Daniel Borkmann wrote: On 01/30/2017 08:56 AM, Michal Hocko wrote: On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. Looks like I got only Cc'ed on the cover letter of your v3 from today (should have been v4 actually?). Yes Anyway, I looked up the last patch on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? I misread your response. I thought you were OK with the FIXME explanation. At least that was what was discussed above (insisting on __GFP_NORETRY plus FIXME comment) for providing my Acked-by then. Can you still fix that up in a final respin? I will probably just drop that last patch instead. I am not convinced that we should bend the new API over and let people mimic that throughout the code. I have just seen too many examples of this pattern already. I would also like to prevent the next rebase, unless there any issues with some patches of course. Ok, I'm fine with that as well. Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/30/2017 05:28 PM, Michal Hocko wrote: On Mon 30-01-17 17:15:08, Daniel Borkmann wrote: On 01/30/2017 08:56 AM, Michal Hocko wrote: On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. Looks like I got only Cc'ed on the cover letter of your v3 from today (should have been v4 actually?). Yes Anyway, I looked up the last patch on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? I misread your response. I thought you were OK with the FIXME explanation. At least that was what was discussed above (insisting on __GFP_NORETRY plus FIXME comment) for providing my Acked-by then. Can you still fix that up in a final respin? I will probably just drop that last patch instead. I am not convinced that we should bend the new API over and let people mimic that throughout the code. I have just seen too many examples of this pattern already. I would also like to prevent the next rebase, unless there any issues with some patches of course. Ok, I'm fine with that as well. Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/30/2017 08:56 AM, Michal Hocko wrote: On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. Looks like I got only Cc'ed on the cover letter of your v3 from today (should have been v4 actually?). Anyway, I looked up the last patch on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? At least that was what was discussed above (insisting on __GFP_NORETRY plus FIXME comment) for providing my Acked-by then. Can you still fix that up in a final respin? Thanks again, Daniel [1] https://lkml.org/lkml/2017/1/30/129
Re: [PATCH 0/6 v3] kvmalloc
On 01/30/2017 08:56 AM, Michal Hocko wrote: On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. Looks like I got only Cc'ed on the cover letter of your v3 from today (should have been v4 actually?). Anyway, I looked up the last patch on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? At least that was what was discussed above (insisting on __GFP_NORETRY plus FIXME comment) for providing my Acked-by then. Can you still fix that up in a final respin? Thanks again, Daniel [1] https://lkml.org/lkml/2017/1/30/129
Re: [PATCH 0/6 v3] kvmalloc
On Mon 30-01-17 17:15:08, Daniel Borkmann wrote: > On 01/30/2017 08:56 AM, Michal Hocko wrote: > > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: > > > On 01/27/2017 11:05 AM, Michal Hocko wrote: > > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: > > [...] > > > > > So to answer your second email with the bpf and netfilter hunks, why > > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > > > > > big fat FIXME comment above there, saying explicitly that > > > > > __GFP_NORETRY > > > > > is not harmful though has only /partial/ effect right now and that > > > > > full > > > > > support needs to be implemented in future. That would still be better > > > > > that not having it, imo, and the FIXME would make expectations clear > > > > > to anyone reading that code. > > > > > > > > Well, we can do that, I just would like to prevent from this (ab)use > > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug > > > > > > Understandable. > > > > > > > report or a fallback mechanism you are mentioning above would justify > > > > the (ab)use IMHO. But that abuse would be documented properly and have a > > > > real reason to exist. That sounds like a better approach to me. > > > > > > > > But if you absolutely _insist_ I can change that. > > > > > > Yeah, please do (with a big FIXME comment as mentioned), this originally > > > came from a real bug report. Anyway, feel free to add my Acked-by then. > > > > Thanks! I will repost the whole series today. > > Looks like I got only Cc'ed on the cover letter of your v3 from today > (should have been v4 actually?). Yes > Anyway, I looked up the last patch > on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? I misread your response. I thought you were OK with the FIXME explanation. > At least that was what was discussed above (insisting on __GFP_NORETRY > plus FIXME comment) for providing my Acked-by then. Can you still fix > that up in a final respin? I will probably just drop that last patch instead. I am not convinced that we should bend the new API over and let people mimic that throughout the code. I have just seen too many examples of this pattern already. I would also like to prevent the next rebase, unless there any issues with some patches of course. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Mon 30-01-17 17:15:08, Daniel Borkmann wrote: > On 01/30/2017 08:56 AM, Michal Hocko wrote: > > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: > > > On 01/27/2017 11:05 AM, Michal Hocko wrote: > > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: > > [...] > > > > > So to answer your second email with the bpf and netfilter hunks, why > > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > > > > > big fat FIXME comment above there, saying explicitly that > > > > > __GFP_NORETRY > > > > > is not harmful though has only /partial/ effect right now and that > > > > > full > > > > > support needs to be implemented in future. That would still be better > > > > > that not having it, imo, and the FIXME would make expectations clear > > > > > to anyone reading that code. > > > > > > > > Well, we can do that, I just would like to prevent from this (ab)use > > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug > > > > > > Understandable. > > > > > > > report or a fallback mechanism you are mentioning above would justify > > > > the (ab)use IMHO. But that abuse would be documented properly and have a > > > > real reason to exist. That sounds like a better approach to me. > > > > > > > > But if you absolutely _insist_ I can change that. > > > > > > Yeah, please do (with a big FIXME comment as mentioned), this originally > > > came from a real bug report. Anyway, feel free to add my Acked-by then. > > > > Thanks! I will repost the whole series today. > > Looks like I got only Cc'ed on the cover letter of your v3 from today > (should have been v4 actually?). Yes > Anyway, I looked up the last patch > on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about? I misread your response. I thought you were OK with the FIXME explanation. > At least that was what was discussed above (insisting on __GFP_NORETRY > plus FIXME comment) for providing my Acked-by then. Can you still fix > that up in a final respin? I will probably just drop that last patch instead. I am not convinced that we should bend the new API over and let people mimic that throughout the code. I have just seen too many examples of this pattern already. I would also like to prevent the next rebase, unless there any issues with some patches of course. -- Michal Hocko SUSE Labs
[PATCH 0/6 v3] kvmalloc
Hi, this has been previously posted here [1] and it received quite some feedback. As a result the number of patches has grown again. We are at 9 patches right now. I have rebased the series on top of the current next-20170130. There were some changes since the last posting, namely a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") which has created a kvmalloc alternative for bpf code. Both have been changed to use the mm kvmalloc but it is worth noting this dependency during the merge window. I hope there are no further obstacles to have this merged into the mmotm tree and go in in the next merge window. Original cover: There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 5. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170112153717.28943-1-mho...@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.ca...@edumazet-glaptop3.roam.corp.google.com Michal Hocko (9): mm: introduce kv[mz]alloc helpers mm: support __GFP_REPEAT in kvmalloc_node for >32kB rhashtable: simplify a strange allocation pattern ila: simplify a strange allocation pattern treewide: use kv[mz]alloc* rather than opencoded variants net: use kvmalloc with __GFP_REPEAT rather than open coded variant md: use kvmalloc rather than opencoded variant bcache: use kvmalloc net, bpf: use kvzalloc helper arch/s390/kvm/kvm-s390.c | 10 +--- arch/x86/kvm/lapic.c | 4 +- arch/x86/kvm/page_track.c | 4 +- arch/x86/kvm/x86.c | 4 +- crypto/lzo.c | 4 +- drivers/acpi/apei/erst.c | 8 +-- drivers/char/agp/generic.c | 8 +-- drivers/gpu/drm/nouveau/nouveau_gem.c | 4 +- drivers/md/bcache/super.c | 8 +-- drivers/md/bcache/util.h | 12 + drivers/md/dm-ioctl.c | 13 ++--- drivers/md/dm-stats.c | 7 +-- drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h| 3 -- drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++- drivers/net/ethernet/chelsio/cxgb3/l2t.c | 8 +-- drivers/net/ethernet/chelsio/cxgb3/l2t.h | 1 - drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c | 12 ++--- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 3 -- drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c | 8 +-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 31 ++-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c | 13 +++-- drivers/net/ethernet/chelsio/cxgb4/l2t.c | 2 +- drivers/net/ethernet/chelsio/cxgb4/sched.c | 12 ++--- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 9 ++-- drivers/net/ethernet/mellanox/mlx4/mr.c| 9 ++-- drivers/nvdimm/dimm_devs.c | 5 +- .../staging/lustre/lnet/libcfs/linux/linux-mem.c | 11 + drivers/vhost/net.c| 9 ++-- drivers/vhost/vhost.c | 15 ++ drivers/vhost/vsock.c | 9 ++-- drivers/xen/evtchn.c | 14 +- fs/btrfs/ctree.c | 9 ++-- fs/btrfs/ioctl.c | 9 ++-- fs/btrfs/send.c| 27 -- fs/ceph/file.c | 9 ++-- fs/ext4/mballoc.c | 2 +- fs/ext4/super.c| 4 +- fs/f2fs/f2fs.h | 20 fs/f2fs/file.c | 4 +- fs/f2fs/segment.c | 14 +++--- fs/select.c
[PATCH 0/6 v3] kvmalloc
Hi, this has been previously posted here [1] and it received quite some feedback. As a result the number of patches has grown again. We are at 9 patches right now. I have rebased the series on top of the current next-20170130. There were some changes since the last posting, namely a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") which has created a kvmalloc alternative for bpf code. Both have been changed to use the mm kvmalloc but it is worth noting this dependency during the merge window. I hope there are no further obstacles to have this merged into the mmotm tree and go in in the next merge window. Original cover: There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 5. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170112153717.28943-1-mho...@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.ca...@edumazet-glaptop3.roam.corp.google.com Michal Hocko (9): mm: introduce kv[mz]alloc helpers mm: support __GFP_REPEAT in kvmalloc_node for >32kB rhashtable: simplify a strange allocation pattern ila: simplify a strange allocation pattern treewide: use kv[mz]alloc* rather than opencoded variants net: use kvmalloc with __GFP_REPEAT rather than open coded variant md: use kvmalloc rather than opencoded variant bcache: use kvmalloc net, bpf: use kvzalloc helper arch/s390/kvm/kvm-s390.c | 10 +--- arch/x86/kvm/lapic.c | 4 +- arch/x86/kvm/page_track.c | 4 +- arch/x86/kvm/x86.c | 4 +- crypto/lzo.c | 4 +- drivers/acpi/apei/erst.c | 8 +-- drivers/char/agp/generic.c | 8 +-- drivers/gpu/drm/nouveau/nouveau_gem.c | 4 +- drivers/md/bcache/super.c | 8 +-- drivers/md/bcache/util.h | 12 + drivers/md/dm-ioctl.c | 13 ++--- drivers/md/dm-stats.c | 7 +-- drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h| 3 -- drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++- drivers/net/ethernet/chelsio/cxgb3/l2t.c | 8 +-- drivers/net/ethernet/chelsio/cxgb3/l2t.h | 1 - drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c | 12 ++--- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 3 -- drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c | 8 +-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 31 ++-- drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c | 13 +++-- drivers/net/ethernet/chelsio/cxgb4/l2t.c | 2 +- drivers/net/ethernet/chelsio/cxgb4/sched.c | 12 ++--- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 9 ++-- drivers/net/ethernet/mellanox/mlx4/mr.c| 9 ++-- drivers/nvdimm/dimm_devs.c | 5 +- .../staging/lustre/lnet/libcfs/linux/linux-mem.c | 11 + drivers/vhost/net.c| 9 ++-- drivers/vhost/vhost.c | 15 ++ drivers/vhost/vsock.c | 9 ++-- drivers/xen/evtchn.c | 14 +- fs/btrfs/ctree.c | 9 ++-- fs/btrfs/ioctl.c | 9 ++-- fs/btrfs/send.c| 27 -- fs/ceph/file.c | 9 ++-- fs/ext4/mballoc.c | 2 +- fs/ext4/super.c| 4 +- fs/f2fs/f2fs.h | 20 fs/f2fs/file.c | 4 +- fs/f2fs/segment.c | 14 +++--- fs/select.c
Re: [PATCH 0/6 v3] kvmalloc
On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: > On 01/27/2017 11:05 AM, Michal Hocko wrote: > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] > > > So to answer your second email with the bpf and netfilter hunks, why > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY > > > is not harmful though has only /partial/ effect right now and that full > > > support needs to be implemented in future. That would still be better > > > that not having it, imo, and the FIXME would make expectations clear > > > to anyone reading that code. > > > > Well, we can do that, I just would like to prevent from this (ab)use > > if there is no _real_ and _sensible_ usecase for it. Having a real bug > > Understandable. > > > report or a fallback mechanism you are mentioning above would justify > > the (ab)use IMHO. But that abuse would be documented properly and have a > > real reason to exist. That sounds like a better approach to me. > > > > But if you absolutely _insist_ I can change that. > > Yeah, please do (with a big FIXME comment as mentioned), this originally > came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Fri 27-01-17 21:12:26, Daniel Borkmann wrote: > On 01/27/2017 11:05 AM, Michal Hocko wrote: > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: [...] > > > So to answer your second email with the bpf and netfilter hunks, why > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY > > > is not harmful though has only /partial/ effect right now and that full > > > support needs to be implemented in future. That would still be better > > > that not having it, imo, and the FIXME would make expectations clear > > > to anyone reading that code. > > > > Well, we can do that, I just would like to prevent from this (ab)use > > if there is no _real_ and _sensible_ usecase for it. Having a real bug > > Understandable. > > > report or a fallback mechanism you are mentioning above would justify > > the (ab)use IMHO. But that abuse would be documented properly and have a > > real reason to exist. That sounds like a better approach to me. > > > > But if you absolutely _insist_ I can change that. > > Yeah, please do (with a big FIXME comment as mentioned), this originally > came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks! I will repost the whole series today. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: On 01/26/2017 02:40 PM, Michal Hocko wrote: [...] But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? The case I was concerned about was from vmalloc() path, not kmalloc(). That was where the stack trace indicating OOM pointed to. As an example, there could be really large allocation requests for maps where the map has pre-allocated memory for its elements. Thus, if we get to the point where we need to kill others due to shortage of mem for satisfying this, I'd much much rather prefer to just not let vmalloc() work really hard and fail early on instead. I see, but as already mentioned, chances are that by the time you get close to the OOM somebody else will hit the OOM before the vmalloc path manages to free the allocated memory. In my (crafted) test case, I was connected via ssh and it each time reliably killed my connection, which is really suboptimal. F.e., I could also imagine a buggy or miscalculated map definition for a prog that is provisioned to multiple places, which then accidentally triggers this. Or if large on purpose, but we crossed the line, it could be handled more gracefully, f.e. I could imagine an option to falling back to a non-pre-allocated map flavor from the application loading the program. Trade-off for sure, but still allowing it to operate up to a certain extend. Granted, if vmalloc() succeeded without trying hard and we then OOM elsewhere, too bad, but we don't have much control over that one anyway, only about our own request. Reason I asked above was whether having __GFP_NORETRY in would be fatal somewhere down the path, but seems not as you say. So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks again, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/27/2017 11:05 AM, Michal Hocko wrote: On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: On 01/26/2017 02:40 PM, Michal Hocko wrote: [...] But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? The case I was concerned about was from vmalloc() path, not kmalloc(). That was where the stack trace indicating OOM pointed to. As an example, there could be really large allocation requests for maps where the map has pre-allocated memory for its elements. Thus, if we get to the point where we need to kill others due to shortage of mem for satisfying this, I'd much much rather prefer to just not let vmalloc() work really hard and fail early on instead. I see, but as already mentioned, chances are that by the time you get close to the OOM somebody else will hit the OOM before the vmalloc path manages to free the allocated memory. In my (crafted) test case, I was connected via ssh and it each time reliably killed my connection, which is really suboptimal. F.e., I could also imagine a buggy or miscalculated map definition for a prog that is provisioned to multiple places, which then accidentally triggers this. Or if large on purpose, but we crossed the line, it could be handled more gracefully, f.e. I could imagine an option to falling back to a non-pre-allocated map flavor from the application loading the program. Trade-off for sure, but still allowing it to operate up to a certain extend. Granted, if vmalloc() succeeded without trying hard and we then OOM elsewhere, too bad, but we don't have much control over that one anyway, only about our own request. Reason I asked above was whether having __GFP_NORETRY in would be fatal somewhere down the path, but seems not as you say. So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug Understandable. report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. Yeah, please do (with a big FIXME comment as mentioned), this originally came from a real bug report. Anyway, feel free to add my Acked-by then. Thanks again, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: > On 01/26/2017 02:40 PM, Michal Hocko wrote: [...] > > But realistically, how big is this problem really? Is it really worth > > it? You said this is an admin only interface and admin can kill the > > machine by OOM and other means already. > > > > Moreover and I should probably mention it explicitly, your d407bd25a204b > > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER > > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this > > could indeed hit the OOM e.g. due to memory fragmentation. It would be > > much harder to hit the OOM killer from vmalloc which doesn't issue > > higher order allocation requests. Or have you ever seen the OOM killer > > pointing to the vmalloc fallback path? > > The case I was concerned about was from vmalloc() path, not kmalloc(). > That was where the stack trace indicating OOM pointed to. As an example, > there could be really large allocation requests for maps where the map > has pre-allocated memory for its elements. Thus, if we get to the point > where we need to kill others due to shortage of mem for satisfying this, > I'd much much rather prefer to just not let vmalloc() work really hard > and fail early on instead. I see, but as already mentioned, chances are that by the time you get close to the OOM somebody else will hit the OOM before the vmalloc path manages to free the allocated memory. > In my (crafted) test case, I was connected > via ssh and it each time reliably killed my connection, which is really > suboptimal. > > F.e., I could also imagine a buggy or miscalculated map definition for > a prog that is provisioned to multiple places, which then accidentally > triggers this. Or if large on purpose, but we crossed the line, it > could be handled more gracefully, f.e. I could imagine an option to > falling back to a non-pre-allocated map flavor from the application > loading the program. Trade-off for sure, but still allowing it to > operate up to a certain extend. Granted, if vmalloc() succeeded without > trying hard and we then OOM elsewhere, too bad, but we don't have much > control over that one anyway, only about our own request. Reason I > asked above was whether having __GFP_NORETRY in would be fatal > somewhere down the path, but seems not as you say. > > So to answer your second email with the bpf and netfilter hunks, why > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY > is not harmful though has only /partial/ effect right now and that full > support needs to be implemented in future. That would still be better > that not having it, imo, and the FIXME would make expectations clear > to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 21:34:04, Daniel Borkmann wrote: > On 01/26/2017 02:40 PM, Michal Hocko wrote: [...] > > But realistically, how big is this problem really? Is it really worth > > it? You said this is an admin only interface and admin can kill the > > machine by OOM and other means already. > > > > Moreover and I should probably mention it explicitly, your d407bd25a204b > > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER > > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this > > could indeed hit the OOM e.g. due to memory fragmentation. It would be > > much harder to hit the OOM killer from vmalloc which doesn't issue > > higher order allocation requests. Or have you ever seen the OOM killer > > pointing to the vmalloc fallback path? > > The case I was concerned about was from vmalloc() path, not kmalloc(). > That was where the stack trace indicating OOM pointed to. As an example, > there could be really large allocation requests for maps where the map > has pre-allocated memory for its elements. Thus, if we get to the point > where we need to kill others due to shortage of mem for satisfying this, > I'd much much rather prefer to just not let vmalloc() work really hard > and fail early on instead. I see, but as already mentioned, chances are that by the time you get close to the OOM somebody else will hit the OOM before the vmalloc path manages to free the allocated memory. > In my (crafted) test case, I was connected > via ssh and it each time reliably killed my connection, which is really > suboptimal. > > F.e., I could also imagine a buggy or miscalculated map definition for > a prog that is provisioned to multiple places, which then accidentally > triggers this. Or if large on purpose, but we crossed the line, it > could be handled more gracefully, f.e. I could imagine an option to > falling back to a non-pre-allocated map flavor from the application > loading the program. Trade-off for sure, but still allowing it to > operate up to a certain extend. Granted, if vmalloc() succeeded without > trying hard and we then OOM elsewhere, too bad, but we don't have much > control over that one anyway, only about our own request. Reason I > asked above was whether having __GFP_NORETRY in would be fatal > somewhere down the path, but seems not as you say. > > So to answer your second email with the bpf and netfilter hunks, why > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY > is not harmful though has only /partial/ effect right now and that full > support needs to be implemented in future. That would still be better > that not having it, imo, and the FIXME would make expectations clear > to anyone reading that code. Well, we can do that, I just would like to prevent from this (ab)use if there is no _real_ and _sensible_ usecase for it. Having a real bug report or a fallback mechanism you are mentioning above would justify the (ab)use IMHO. But that abuse would be documented properly and have a real reason to exist. That sounds like a better approach to me. But if you absolutely _insist_ I can change that. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 02:40 PM, Michal Hocko wrote: On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: On 01/26/2017 12:58 PM, Michal Hocko wrote: On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with Hmm. 'On my own' means, we could potentially BUG somewhere down the vmalloc implementation, etc, presumably? So it might in-fact be harmful to pass that, right? No it would mean that it might eventually hit the behavior which you are trying to avoid - in other words it may invoke OOM killer even though __GFP_NORETRY means giving up before any system wide disruptive actions a re taken. Ok, thanks for clarifying, more on that further below. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ Ok, I know this is out of scope for this series, but since i) this is _not_ the _only_ spot right now which has such a construct and ii) I am already kind of nagging a bit ;), my question would be, what would it take to start supporting it? propagate gfp mask all the way down from vmalloc to all places which might allocate down the path and especially page table allocation function are PITA because they are really deep. This is a lot of work... But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? The case I was concerned about was from vmalloc() path, not kmalloc(). That was where the stack trace indicating OOM pointed to. As an example, there could be really large allocation requests for maps where the map has pre-allocated memory for its elements. Thus, if we get to the point where we need to kill others due to shortage of mem for satisfying this, I'd much much rather prefer to just not let vmalloc() work really hard and fail early on instead. In my (crafted) test case, I was connected via ssh and it each time reliably killed my connection, which is really suboptimal. F.e., I could also imagine a buggy or miscalculated map definition for a prog that is provisioned to multiple places, which then accidentally triggers this. Or if large on purpose, but we crossed the line, it could be handled more gracefully, f.e. I could imagine an option to falling back to a non-pre-allocated map flavor from the application loading the program. Trade-off for sure, but still allowing it to operate up to a certain extend. Granted, if vmalloc() succeeded without trying hard and we then OOM elsewhere, too bad, but we don't have much control over that one anyway, only about our own request. Reason I asked above was whether having __GFP_NORETRY in would be fatal somewhere down the path, but seems not as you say. So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 02:40 PM, Michal Hocko wrote: On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: On 01/26/2017 12:58 PM, Michal Hocko wrote: On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with Hmm. 'On my own' means, we could potentially BUG somewhere down the vmalloc implementation, etc, presumably? So it might in-fact be harmful to pass that, right? No it would mean that it might eventually hit the behavior which you are trying to avoid - in other words it may invoke OOM killer even though __GFP_NORETRY means giving up before any system wide disruptive actions a re taken. Ok, thanks for clarifying, more on that further below. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ Ok, I know this is out of scope for this series, but since i) this is _not_ the _only_ spot right now which has such a construct and ii) I am already kind of nagging a bit ;), my question would be, what would it take to start supporting it? propagate gfp mask all the way down from vmalloc to all places which might allocate down the path and especially page table allocation function are PITA because they are really deep. This is a lot of work... But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? The case I was concerned about was from vmalloc() path, not kmalloc(). That was where the stack trace indicating OOM pointed to. As an example, there could be really large allocation requests for maps where the map has pre-allocated memory for its elements. Thus, if we get to the point where we need to kill others due to shortage of mem for satisfying this, I'd much much rather prefer to just not let vmalloc() work really hard and fail early on instead. In my (crafted) test case, I was connected via ssh and it each time reliably killed my connection, which is really suboptimal. F.e., I could also imagine a buggy or miscalculated map definition for a prog that is provisioned to multiple places, which then accidentally triggers this. Or if large on purpose, but we crossed the line, it could be handled more gracefully, f.e. I could imagine an option to falling back to a non-pre-allocated map flavor from the application loading the program. Trade-off for sure, but still allowing it to operate up to a certain extend. Granted, if vmalloc() succeeded without trying hard and we then OOM elsewhere, too bad, but we don't have much control over that one anyway, only about our own request. Reason I asked above was whether having __GFP_NORETRY in would be fatal somewhere down the path, but seems not as you say. So to answer your second email with the bpf and netfilter hunks, why not replacing them with kvmalloc() and __GFP_NORETRY flag and add that big fat FIXME comment above there, saying explicitly that __GFP_NORETRY is not harmful though has only /partial/ effect right now and that full support needs to be implemented in future. That would still be better that not having it, imo, and the FIXME would make expectations clear to anyone reading that code. Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 14:40:04, Michal Hocko wrote: > On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: > > On 01/26/2017 12:58 PM, Michal Hocko wrote: > > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > > > > On 01/26/2017 11:08 AM, Michal Hocko wrote: > > > [...] > > > > > If you disagree I can drop the bpf part of course... > > > > > > > > If we could consolidate these spots with kvmalloc() eventually, I'm > > > > all for it. But even if __GFP_NORETRY is not covered down to all > > > > possible paths, it kind of does have an effect already of saying > > > > 'don't try too hard', so would it be harmful to still keep that for > > > > now? If it's not, I'd personally prefer to just leave it as is until > > > > there's some form of support by kvmalloc() and friends. > > > > > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not > > > disallowed. It is not _supported_ which means that if it doesn't work as > > > you expect you are on your own. Which is actually the situation right > > > now as well. But I still think that this is just not right thing to do. > > > Even though it might happen to work in some cases it gives a false > > > impression of a solution. So I would rather go with > > > > Hmm. 'On my own' means, we could potentially BUG somewhere down the > > vmalloc implementation, etc, presumably? So it might in-fact be > > harmful to pass that, right? > > No it would mean that it might eventually hit the behavior which you are > trying to avoid - in other words it may invoke OOM killer even though > __GFP_NORETRY means giving up before any system wide disruptive actions > a re taken. I will separate both bpf and netfilter hunks into its own patch with the clarification. Does the following look better? --- >From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Thu, 26 Jan 2017 14:59:21 +0100 Subject: [PATCH] net, bpf: use kvzalloc helper both bpf_map_area_alloc and xt_alloc_table_info try really hard to play nicely with large memory requests which can be triggered from the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc"). The current allocation pattern strongly resembles kvmalloc helper except for one thing __GFP_NORETRY is not used for the vmalloc fallback. The main reason why kvmalloc doesn't really support __GFP_NORETRY is because vmalloc doesn't support this flag properly and it is far from straightforward to make it understand it because there are some hard coded GFP_KERNEL allocation deep in the call chains. This patch simply replaces the open coded variants with kvmalloc and puts a note to push on MM people to support __GFP_NORETRY in kvmalloc it this turns out to be really needed along with OOM report pointing at vmalloc. If there is an immediate need and no full support yet then kvmalloc(size, gfp | __GFP_NORETRY) will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it might trigger the OOM in some cases. Cc: Daniel Borkmann Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Marcelo Ricardo Leitner Cc: Pablo Neira Ayuso Signed-off-by: Michal Hocko --- kernel/bpf/syscall.c | 19 +-- net/netfilter/x_tables.c | 16 ++-- 2 files changed, 11 insertions(+), 24 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. */ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index d529989f5791..ba8ba633da72 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 >
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 14:40:04, Michal Hocko wrote: > On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: > > On 01/26/2017 12:58 PM, Michal Hocko wrote: > > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > > > > On 01/26/2017 11:08 AM, Michal Hocko wrote: > > > [...] > > > > > If you disagree I can drop the bpf part of course... > > > > > > > > If we could consolidate these spots with kvmalloc() eventually, I'm > > > > all for it. But even if __GFP_NORETRY is not covered down to all > > > > possible paths, it kind of does have an effect already of saying > > > > 'don't try too hard', so would it be harmful to still keep that for > > > > now? If it's not, I'd personally prefer to just leave it as is until > > > > there's some form of support by kvmalloc() and friends. > > > > > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not > > > disallowed. It is not _supported_ which means that if it doesn't work as > > > you expect you are on your own. Which is actually the situation right > > > now as well. But I still think that this is just not right thing to do. > > > Even though it might happen to work in some cases it gives a false > > > impression of a solution. So I would rather go with > > > > Hmm. 'On my own' means, we could potentially BUG somewhere down the > > vmalloc implementation, etc, presumably? So it might in-fact be > > harmful to pass that, right? > > No it would mean that it might eventually hit the behavior which you are > trying to avoid - in other words it may invoke OOM killer even though > __GFP_NORETRY means giving up before any system wide disruptive actions > a re taken. I will separate both bpf and netfilter hunks into its own patch with the clarification. Does the following look better? --- >From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 26 Jan 2017 14:59:21 +0100 Subject: [PATCH] net, bpf: use kvzalloc helper both bpf_map_area_alloc and xt_alloc_table_info try really hard to play nicely with large memory requests which can be triggered from the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc"). The current allocation pattern strongly resembles kvmalloc helper except for one thing __GFP_NORETRY is not used for the vmalloc fallback. The main reason why kvmalloc doesn't really support __GFP_NORETRY is because vmalloc doesn't support this flag properly and it is far from straightforward to make it understand it because there are some hard coded GFP_KERNEL allocation deep in the call chains. This patch simply replaces the open coded variants with kvmalloc and puts a note to push on MM people to support __GFP_NORETRY in kvmalloc it this turns out to be really needed along with OOM report pointing at vmalloc. If there is an immediate need and no full support yet then kvmalloc(size, gfp | __GFP_NORETRY) will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it might trigger the OOM in some cases. Cc: Daniel Borkmann Cc: Alexei Starovoitov Cc: Andrey Konovalov Cc: Marcelo Ricardo Leitner Cc: Pablo Neira Ayuso Signed-off-by: Michal Hocko --- kernel/bpf/syscall.c | 19 +-- net/netfilter/x_tables.c | 16 ++-- 2 files changed, 11 insertions(+), 24 deletions(-) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. */ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index d529989f5791..ba8ba633da72 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) return NULL; - if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) - info = kmalloc(sz,
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: > On 01/26/2017 12:58 PM, Michal Hocko wrote: > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > > > On 01/26/2017 11:08 AM, Michal Hocko wrote: > > [...] > > > > If you disagree I can drop the bpf part of course... > > > > > > If we could consolidate these spots with kvmalloc() eventually, I'm > > > all for it. But even if __GFP_NORETRY is not covered down to all > > > possible paths, it kind of does have an effect already of saying > > > 'don't try too hard', so would it be harmful to still keep that for > > > now? If it's not, I'd personally prefer to just leave it as is until > > > there's some form of support by kvmalloc() and friends. > > > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not > > disallowed. It is not _supported_ which means that if it doesn't work as > > you expect you are on your own. Which is actually the situation right > > now as well. But I still think that this is just not right thing to do. > > Even though it might happen to work in some cases it gives a false > > impression of a solution. So I would rather go with > > Hmm. 'On my own' means, we could potentially BUG somewhere down the > vmalloc implementation, etc, presumably? So it might in-fact be > harmful to pass that, right? No it would mean that it might eventually hit the behavior which you are trying to avoid - in other words it may invoke OOM killer even though __GFP_NORETRY means giving up before any system wide disruptive actions a re taken. > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 8697f43cf93c..a6dc4d596f14 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) > > > > void *bpf_map_area_alloc(size_t size) > > { > > + /* > > +* FIXME: we would really like to not trigger the OOM killer and rather > > +* fail instead. This is not supported right now. Please nag MM people > > +* if these OOM start bothering people. > > +*/ > > Ok, I know this is out of scope for this series, but since i) this > is _not_ the _only_ spot right now which has such a construct and ii) > I am already kind of nagging a bit ;), my question would be, what > would it take to start supporting it? propagate gfp mask all the way down from vmalloc to all places which might allocate down the path and especially page table allocation function are PITA because they are really deep. This is a lot of work... But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 14:10:06, Daniel Borkmann wrote: > On 01/26/2017 12:58 PM, Michal Hocko wrote: > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > > > On 01/26/2017 11:08 AM, Michal Hocko wrote: > > [...] > > > > If you disagree I can drop the bpf part of course... > > > > > > If we could consolidate these spots with kvmalloc() eventually, I'm > > > all for it. But even if __GFP_NORETRY is not covered down to all > > > possible paths, it kind of does have an effect already of saying > > > 'don't try too hard', so would it be harmful to still keep that for > > > now? If it's not, I'd personally prefer to just leave it as is until > > > there's some form of support by kvmalloc() and friends. > > > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not > > disallowed. It is not _supported_ which means that if it doesn't work as > > you expect you are on your own. Which is actually the situation right > > now as well. But I still think that this is just not right thing to do. > > Even though it might happen to work in some cases it gives a false > > impression of a solution. So I would rather go with > > Hmm. 'On my own' means, we could potentially BUG somewhere down the > vmalloc implementation, etc, presumably? So it might in-fact be > harmful to pass that, right? No it would mean that it might eventually hit the behavior which you are trying to avoid - in other words it may invoke OOM killer even though __GFP_NORETRY means giving up before any system wide disruptive actions a re taken. > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 8697f43cf93c..a6dc4d596f14 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) > > > > void *bpf_map_area_alloc(size_t size) > > { > > + /* > > +* FIXME: we would really like to not trigger the OOM killer and rather > > +* fail instead. This is not supported right now. Please nag MM people > > +* if these OOM start bothering people. > > +*/ > > Ok, I know this is out of scope for this series, but since i) this > is _not_ the _only_ spot right now which has such a construct and ii) > I am already kind of nagging a bit ;), my question would be, what > would it take to start supporting it? propagate gfp mask all the way down from vmalloc to all places which might allocate down the path and especially page table allocation function are PITA because they are really deep. This is a lot of work... But realistically, how big is this problem really? Is it really worth it? You said this is an admin only interface and admin can kill the machine by OOM and other means already. Moreover and I should probably mention it explicitly, your d407bd25a204b reduced the likelyhood of oom for other reason. kmalloc used GPF_USER previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this could indeed hit the OOM e.g. due to memory fragmentation. It would be much harder to hit the OOM killer from vmalloc which doesn't issue higher order allocation requests. Or have you ever seen the OOM killer pointing to the vmalloc fallback path? -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 12:58 PM, Michal Hocko wrote: On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with Hmm. 'On my own' means, we could potentially BUG somewhere down the vmalloc implementation, etc, presumably? So it might in-fact be harmful to pass that, right? diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ Ok, I know this is out of scope for this series, but since i) this is _not_ the _only_ spot right now which has such a construct and ii) I am already kind of nagging a bit ;), my question would be, what would it take to start supporting it? return kvzalloc(size, GFP_USER); } Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 12:58 PM, Michal Hocko wrote: On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with Hmm. 'On my own' means, we could potentially BUG somewhere down the vmalloc implementation, etc, presumably? So it might in-fact be harmful to pass that, right? diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ Ok, I know this is out of scope for this series, but since i) this is _not_ the _only_ spot right now which has such a construct and ii) I am already kind of nagging a bit ;), my question would be, what would it take to start supporting it? return kvzalloc(size, GFP_USER); } Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 04:14:37, Joe Perches wrote: > On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote: > > So I have folded the following to the patch 1. It is in line with > > kvmalloc and hopefully at least tell more than the current code. > [] > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > [] > > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > > unsigned long align, > > * Allocate enough pages to cover @size from the page level > > * allocator with @gfp_mask flags. Map them into contiguous > > * kernel virtual space, using a pagetable protection of @prot. > > + * > > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > > + * and __GFP_NOFAIL are not supported > > Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences? I would really like to not touch vmalloc in this series. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 04:14:37, Joe Perches wrote: > On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote: > > So I have folded the following to the patch 1. It is in line with > > kvmalloc and hopefully at least tell more than the current code. > [] > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > [] > > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > > unsigned long align, > > * Allocate enough pages to cover @size from the page level > > * allocator with @gfp_mask flags. Map them into contiguous > > * kernel virtual space, using a pagetable protection of @prot. > > + * > > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > > + * and __GFP_NOFAIL are not supported > > Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences? I would really like to not touch vmalloc in this series. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote: > So I have folded the following to the patch 1. It is in line with > kvmalloc and hopefully at least tell more than the current code. [] > diff --git a/mm/vmalloc.c b/mm/vmalloc.c [] > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > unsigned long align, > * Allocate enough pages to cover @size from the page level > * allocator with @gfp_mask flags. Map them into contiguous > * kernel virtual space, using a pagetable protection of @prot. > + * > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > + * and __GFP_NOFAIL are not supported Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?
Re: [PATCH 0/6 v3] kvmalloc
On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote: > So I have folded the following to the patch 1. It is in line with > kvmalloc and hopefully at least tell more than the current code. [] > diff --git a/mm/vmalloc.c b/mm/vmalloc.c [] > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > unsigned long align, > * Allocate enough pages to cover @size from the page level > * allocator with @gfp_mask flags. Map them into contiguous > * kernel virtual space, using a pagetable protection of @prot. > + * > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > + * and __GFP_NOFAIL are not supported Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] > > If you disagree I can drop the bpf part of course... > > If we could consolidate these spots with kvmalloc() eventually, I'm > all for it. But even if __GFP_NORETRY is not covered down to all > possible paths, it kind of does have an effect already of saying > 'don't try too hard', so would it be harmful to still keep that for > now? If it's not, I'd personally prefer to just leave it as is until > there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ return kvzalloc(size, GFP_USER); } -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 12:33:55, Daniel Borkmann wrote: > On 01/26/2017 11:08 AM, Michal Hocko wrote: [...] > > If you disagree I can drop the bpf part of course... > > If we could consolidate these spots with kvmalloc() eventually, I'm > all for it. But even if __GFP_NORETRY is not covered down to all > possible paths, it kind of does have an effect already of saying > 'don't try too hard', so would it be harmful to still keep that for > now? If it's not, I'd personally prefer to just leave it as is until > there's some form of support by kvmalloc() and friends. Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not disallowed. It is not _supported_ which means that if it doesn't work as you expect you are on your own. Which is actually the situation right now as well. But I still think that this is just not right thing to do. Even though it might happen to work in some cases it gives a false impression of a solution. So I would rather go with diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8697f43cf93c..a6dc4d596f14 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { + /* +* FIXME: we would really like to not trigger the OOM killer and rather +* fail instead. This is not supported right now. Please nag MM people +* if these OOM start bothering people. +*/ return kvzalloc(size, GFP_USER); } -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 12:04:13, Daniel Borkmann wrote: > On 01/26/2017 11:32 AM, Michal Hocko wrote: > > On Thu 26-01-17 11:08:02, Michal Hocko wrote: > > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > > > > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > > > [...] > > > > > > I assume that kvzalloc() is still the same from [1], right? If so, > > > > > > then > > > > > > it would unfortunately (partially) reintroduce the issue that was > > > > > > fixed. > > > > > > If you look above at flags, they're also passed to __vmalloc() to > > > > > > not > > > > > > trigger OOM in these situations I've experienced. > > > > > > > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > > > > think it would. It can still trigger the OOM killer becauset the flags > > > > > are no propagated all the way down to all allocations requests (e.g. > > > > > page tables). This is the same reason why GFP_NOFS is not supported in > > > > > vmalloc. > > > > > > > > Ok, good to know, is that somewhere clearly documented (like for the > > > > case with kmalloc())? > > > > > > I am afraid that we really suck on this front. I will add something. > > > > So I have folded the following to the patch 1. It is in line with > > kvmalloc and hopefully at least tell more than the current code. > > --- > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index d89034a393f2..6c1aa2c68887 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > > unsigned long align, > >*Allocate enough pages to cover @size from the page level > >*allocator with @gfp_mask flags. Map them into contiguous > >*kernel virtual space, using a pagetable protection of @prot. > > + * > > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > > + * and __GFP_NOFAIL are not supported > > We could probably also mention that __GFP_ZERO in @gfp_mask is > supported, though. There are others which would be supported so I would rather stay with explicit unsupported. > > > + * Any use of gfp flags outside of GFP_KERNEL should be consulted > > + * with mm people. > > Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as > that is what vmalloc() resp. vzalloc() and others pass as flags? yes, even though I think that specifying __GFP_HIGHMEM shouldn't be really necessary. Are there any users who would really insist on vmalloc pages in lowmem? Anyway this made me recheck kvmalloc_node implementation and I am not adding this flags which would mean a regression from the current state. Will fix it up. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 12:04:13, Daniel Borkmann wrote: > On 01/26/2017 11:32 AM, Michal Hocko wrote: > > On Thu 26-01-17 11:08:02, Michal Hocko wrote: > > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > > > > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > > > [...] > > > > > > I assume that kvzalloc() is still the same from [1], right? If so, > > > > > > then > > > > > > it would unfortunately (partially) reintroduce the issue that was > > > > > > fixed. > > > > > > If you look above at flags, they're also passed to __vmalloc() to > > > > > > not > > > > > > trigger OOM in these situations I've experienced. > > > > > > > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > > > > think it would. It can still trigger the OOM killer becauset the flags > > > > > are no propagated all the way down to all allocations requests (e.g. > > > > > page tables). This is the same reason why GFP_NOFS is not supported in > > > > > vmalloc. > > > > > > > > Ok, good to know, is that somewhere clearly documented (like for the > > > > case with kmalloc())? > > > > > > I am afraid that we really suck on this front. I will add something. > > > > So I have folded the following to the patch 1. It is in line with > > kvmalloc and hopefully at least tell more than the current code. > > --- > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index d89034a393f2..6c1aa2c68887 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, > > unsigned long align, > >*Allocate enough pages to cover @size from the page level > >*allocator with @gfp_mask flags. Map them into contiguous > >*kernel virtual space, using a pagetable protection of @prot. > > + * > > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT > > + * and __GFP_NOFAIL are not supported > > We could probably also mention that __GFP_ZERO in @gfp_mask is > supported, though. There are others which would be supported so I would rather stay with explicit unsupported. > > > + * Any use of gfp flags outside of GFP_KERNEL should be consulted > > + * with mm people. > > Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as > that is what vmalloc() resp. vzalloc() and others pass as flags? yes, even though I think that specifying __GFP_HIGHMEM shouldn't be really necessary. Are there any users who would really insist on vmalloc pages in lowmem? Anyway this made me recheck kvmalloc_node implementation and I am not adding this flags which would mean a regression from the current state. Will fix it up. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 11:08 AM, Michal Hocko wrote: On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? I am afraid that we really suck on this front. I will add something. Thanks for doing that, much appreciated! If not, could we do that for non-mm folks, or at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make it obvious to users that a given flag combination is not supported all the way down? I am not sure that triggering a warning that somebody has used __GFP_NOWARN is very helpful ;). I also do not think that covering all the supported flags is really feasible. Most of them will not have bad side effects. I have added the warning because this API is new and I wanted to catch new abusers. Old ones would have to die slowly. Okay, makes sense then. Just the kdoc comment from your other mail should help fine already. This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? For both mentioned commits, (privileged) user space can potentially create large allocation requests, where we thus switch to vmalloc() flavor eventually and then OOM starts killing processes to try to satisfy the allocation request. This is bad, because we want the request to just fail instead as it's non-critical and f.e. not kill ssh connection et al. Failing is totally fine in this case, whereas triggering OOM is not. I see your intention but does it really make any real difference? Consider you would back off right before you would have OOMed. Any parallel request would just hit the OOM for you. You are (almost) never doing an allocation in an isolation. In my testing, __GFP_NORETRY did satisfy this just fine, but as you say it seems it's not enough. Yeah, ptes have been most probably popullated already. Given there are multiple places like these in the kernel, could we instead add an option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? As said above I do not really think that suppressing the OOM killer makes any difference because it might be just somebody else doing that for you. Also the OOM killer is the MM internal implementation "detail" users shouldn't really care. I agree that callers should have a way to say they do not want to try really hard and that is not that simple for vmalloc unfortunatelly. The main problem here is that gfp mask propagation is not that easy to fix without a lot of code churn as some of those hardcoded allocation requests are deep in call chains. I see, that's unfortunate. I understand that there are requests in parallel and that we might end up with OOM eventually if we're unlucky, but having some way to tell vmalloc to just not try as hard as usual would be nice. I know this sucks and it would be great to support __GFP_NORETRY to [k]vmalloc and maybe we will get there eventually. But for the mean time I really think that using kvmalloc wherever possible is much better than open coded variants whith expectations which do not hold sometimes. I totally agree with you that having kvmalloc() as helper is awesome and probably long overdue as well. :) If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Thanks for your input, Michal! Cheers, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 11:08 AM, Michal Hocko wrote: On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? I am afraid that we really suck on this front. I will add something. Thanks for doing that, much appreciated! If not, could we do that for non-mm folks, or at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make it obvious to users that a given flag combination is not supported all the way down? I am not sure that triggering a warning that somebody has used __GFP_NOWARN is very helpful ;). I also do not think that covering all the supported flags is really feasible. Most of them will not have bad side effects. I have added the warning because this API is new and I wanted to catch new abusers. Old ones would have to die slowly. Okay, makes sense then. Just the kdoc comment from your other mail should help fine already. This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? For both mentioned commits, (privileged) user space can potentially create large allocation requests, where we thus switch to vmalloc() flavor eventually and then OOM starts killing processes to try to satisfy the allocation request. This is bad, because we want the request to just fail instead as it's non-critical and f.e. not kill ssh connection et al. Failing is totally fine in this case, whereas triggering OOM is not. I see your intention but does it really make any real difference? Consider you would back off right before you would have OOMed. Any parallel request would just hit the OOM for you. You are (almost) never doing an allocation in an isolation. In my testing, __GFP_NORETRY did satisfy this just fine, but as you say it seems it's not enough. Yeah, ptes have been most probably popullated already. Given there are multiple places like these in the kernel, could we instead add an option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? As said above I do not really think that suppressing the OOM killer makes any difference because it might be just somebody else doing that for you. Also the OOM killer is the MM internal implementation "detail" users shouldn't really care. I agree that callers should have a way to say they do not want to try really hard and that is not that simple for vmalloc unfortunatelly. The main problem here is that gfp mask propagation is not that easy to fix without a lot of code churn as some of those hardcoded allocation requests are deep in call chains. I see, that's unfortunate. I understand that there are requests in parallel and that we might end up with OOM eventually if we're unlucky, but having some way to tell vmalloc to just not try as hard as usual would be nice. I know this sucks and it would be great to support __GFP_NORETRY to [k]vmalloc and maybe we will get there eventually. But for the mean time I really think that using kvmalloc wherever possible is much better than open coded variants whith expectations which do not hold sometimes. I totally agree with you that having kvmalloc() as helper is awesome and probably long overdue as well. :) If you disagree I can drop the bpf part of course... If we could consolidate these spots with kvmalloc() eventually, I'm all for it. But even if __GFP_NORETRY is not covered down to all possible paths, it kind of does have an effect already of saying 'don't try too hard', so would it be harmful to still keep that for now? If it's not, I'd personally prefer to just leave it as is until there's some form of support by kvmalloc() and friends. Thanks for your input, Michal! Cheers, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 11:32 AM, Michal Hocko wrote: On Thu 26-01-17 11:08:02, Michal Hocko wrote: On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? I am afraid that we really suck on this front. I will add something. So I have folded the following to the patch 1. It is in line with kvmalloc and hopefully at least tell more than the current code. --- diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d89034a393f2..6c1aa2c68887 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, *Allocate enough pages to cover @size from the page level *allocator with @gfp_mask flags. Map them into contiguous *kernel virtual space, using a pagetable protection of @prot. + * + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT + * and __GFP_NOFAIL are not supported We could probably also mention that __GFP_ZERO in @gfp_mask is supported, though. + * Any use of gfp flags outside of GFP_KERNEL should be consulted + * with mm people. Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as that is what vmalloc() resp. vzalloc() and others pass as flags? + * */ Sounds good otherwise, thanks Michal! static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot,
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 11:32 AM, Michal Hocko wrote: On Thu 26-01-17 11:08:02, Michal Hocko wrote: On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? I am afraid that we really suck on this front. I will add something. So I have folded the following to the patch 1. It is in line with kvmalloc and hopefully at least tell more than the current code. --- diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d89034a393f2..6c1aa2c68887 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, *Allocate enough pages to cover @size from the page level *allocator with @gfp_mask flags. Map them into contiguous *kernel virtual space, using a pagetable protection of @prot. + * + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT + * and __GFP_NOFAIL are not supported We could probably also mention that __GFP_ZERO in @gfp_mask is supported, though. + * Any use of gfp flags outside of GFP_KERNEL should be consulted + * with mm people. Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as that is what vmalloc() resp. vzalloc() and others pass as flags? + * */ Sounds good otherwise, thanks Michal! static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot,
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 11:08:02, Michal Hocko wrote: > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > [...] > > > > I assume that kvzalloc() is still the same from [1], right? If so, then > > > > it would unfortunately (partially) reintroduce the issue that was fixed. > > > > If you look above at flags, they're also passed to __vmalloc() to not > > > > trigger OOM in these situations I've experienced. > > > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > > think it would. It can still trigger the OOM killer becauset the flags > > > are no propagated all the way down to all allocations requests (e.g. > > > page tables). This is the same reason why GFP_NOFS is not supported in > > > vmalloc. > > > > Ok, good to know, is that somewhere clearly documented (like for the > > case with kmalloc())? > > I am afraid that we really suck on this front. I will add something. So I have folded the following to the patch 1. It is in line with kvmalloc and hopefully at least tell more than the current code. --- diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d89034a393f2..6c1aa2c68887 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, * Allocate enough pages to cover @size from the page level * allocator with @gfp_mask flags. Map them into contiguous * kernel virtual space, using a pagetable protection of @prot. + * + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT + * and __GFP_NOFAIL are not supported + * + * Any use of gfp flags outside of GFP_KERNEL should be consulted + * with mm people. + * */ static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot, -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 11:08:02, Michal Hocko wrote: > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > [...] > > > > I assume that kvzalloc() is still the same from [1], right? If so, then > > > > it would unfortunately (partially) reintroduce the issue that was fixed. > > > > If you look above at flags, they're also passed to __vmalloc() to not > > > > trigger OOM in these situations I've experienced. > > > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > > think it would. It can still trigger the OOM killer becauset the flags > > > are no propagated all the way down to all allocations requests (e.g. > > > page tables). This is the same reason why GFP_NOFS is not supported in > > > vmalloc. > > > > Ok, good to know, is that somewhere clearly documented (like for the > > case with kmalloc())? > > I am afraid that we really suck on this front. I will add something. So I have folded the following to the patch 1. It is in line with kvmalloc and hopefully at least tell more than the current code. --- diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d89034a393f2..6c1aa2c68887 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, * Allocate enough pages to cover @size from the page level * allocator with @gfp_mask flags. Map them into contiguous * kernel virtual space, using a pagetable protection of @prot. + * + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT + * and __GFP_NOFAIL are not supported + * + * Any use of gfp flags outside of GFP_KERNEL should be consulted + * with mm people. + * */ static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot, -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] > > > I assume that kvzalloc() is still the same from [1], right? If so, then > > > it would unfortunately (partially) reintroduce the issue that was fixed. > > > If you look above at flags, they're also passed to __vmalloc() to not > > > trigger OOM in these situations I've experienced. > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > think it would. It can still trigger the OOM killer becauset the flags > > are no propagated all the way down to all allocations requests (e.g. > > page tables). This is the same reason why GFP_NOFS is not supported in > > vmalloc. > > Ok, good to know, is that somewhere clearly documented (like for the > case with kmalloc())? I am afraid that we really suck on this front. I will add something. > If not, could we do that for non-mm folks, or > at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make > it obvious to users that a given flag combination is not supported all > the way down? I am not sure that triggering a warning that somebody has used __GFP_NOWARN is very helpful ;). I also do not think that covering all the supported flags is really feasible. Most of them will not have bad side effects. I have added the warning because this API is new and I wanted to catch new abusers. Old ones would have to die slowly. > > > This is effectively the > > > same requirement as in other networking areas f.e. that 5bad87348c70 > > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. > > > In your comment in kvzalloc() you eventually say that some of the above > > > modifiers are not supported. So there would be two options, i) just leave > > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and > > > tackle > > > it later (along with similar code from 5bad87348c70), or ii) implement > > > support for these modifiers as well to your original set. I guess it's not > > > too urgent, so we could also proceed with i) if that is easier for you to > > > proceed (I don't mind either way). > > > > Could you clarify why the oom killer in vmalloc matters actually? > > For both mentioned commits, (privileged) user space can potentially > create large allocation requests, where we thus switch to vmalloc() > flavor eventually and then OOM starts killing processes to try to > satisfy the allocation request. This is bad, because we want the > request to just fail instead as it's non-critical and f.e. not kill > ssh connection et al. Failing is totally fine in this case, whereas > triggering OOM is not. I see your intention but does it really make any real difference? Consider you would back off right before you would have OOMed. Any parallel request would just hit the OOM for you. You are (almost) never doing an allocation in an isolation. > In my testing, __GFP_NORETRY did satisfy this > just fine, but as you say it seems it's not enough. Yeah, ptes have been most probably popullated already. > Given there are > multiple places like these in the kernel, could we instead add an > option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? As said above I do not really think that suppressing the OOM killer makes any difference because it might be just somebody else doing that for you. Also the OOM killer is the MM internal implementation "detail" users shouldn't really care. I agree that callers should have a way to say they do not want to try really hard and that is not that simple for vmalloc unfortunatelly. The main problem here is that gfp mask propagation is not that easy to fix without a lot of code churn as some of those hardcoded allocation requests are deep in call chains. I know this sucks and it would be great to support __GFP_NORETRY to [k]vmalloc and maybe we will get there eventually. But for the mean time I really think that using kvmalloc wherever possible is much better than open coded variants whith expectations which do not hold sometimes. If you disagree I can drop the bpf part of course... -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 26-01-17 10:36:49, Daniel Borkmann wrote: > On 01/26/2017 08:43 AM, Michal Hocko wrote: > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: [...] > > > I assume that kvzalloc() is still the same from [1], right? If so, then > > > it would unfortunately (partially) reintroduce the issue that was fixed. > > > If you look above at flags, they're also passed to __vmalloc() to not > > > trigger OOM in these situations I've experienced. > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > think it would. It can still trigger the OOM killer becauset the flags > > are no propagated all the way down to all allocations requests (e.g. > > page tables). This is the same reason why GFP_NOFS is not supported in > > vmalloc. > > Ok, good to know, is that somewhere clearly documented (like for the > case with kmalloc())? I am afraid that we really suck on this front. I will add something. > If not, could we do that for non-mm folks, or > at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make > it obvious to users that a given flag combination is not supported all > the way down? I am not sure that triggering a warning that somebody has used __GFP_NOWARN is very helpful ;). I also do not think that covering all the supported flags is really feasible. Most of them will not have bad side effects. I have added the warning because this API is new and I wanted to catch new abusers. Old ones would have to die slowly. > > > This is effectively the > > > same requirement as in other networking areas f.e. that 5bad87348c70 > > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. > > > In your comment in kvzalloc() you eventually say that some of the above > > > modifiers are not supported. So there would be two options, i) just leave > > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and > > > tackle > > > it later (along with similar code from 5bad87348c70), or ii) implement > > > support for these modifiers as well to your original set. I guess it's not > > > too urgent, so we could also proceed with i) if that is easier for you to > > > proceed (I don't mind either way). > > > > Could you clarify why the oom killer in vmalloc matters actually? > > For both mentioned commits, (privileged) user space can potentially > create large allocation requests, where we thus switch to vmalloc() > flavor eventually and then OOM starts killing processes to try to > satisfy the allocation request. This is bad, because we want the > request to just fail instead as it's non-critical and f.e. not kill > ssh connection et al. Failing is totally fine in this case, whereas > triggering OOM is not. I see your intention but does it really make any real difference? Consider you would back off right before you would have OOMed. Any parallel request would just hit the OOM for you. You are (almost) never doing an allocation in an isolation. > In my testing, __GFP_NORETRY did satisfy this > just fine, but as you say it seems it's not enough. Yeah, ptes have been most probably popullated already. > Given there are > multiple places like these in the kernel, could we instead add an > option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? As said above I do not really think that suppressing the OOM killer makes any difference because it might be just somebody else doing that for you. Also the OOM killer is the MM internal implementation "detail" users shouldn't really care. I agree that callers should have a way to say they do not want to try really hard and that is not that simple for vmalloc unfortunatelly. The main problem here is that gfp mask propagation is not that easy to fix without a lot of code churn as some of those hardcoded allocation requests are deep in call chains. I know this sucks and it would be great to support __GFP_NORETRY to [k]vmalloc and maybe we will get there eventually. But for the mean time I really think that using kvmalloc wherever possible is much better than open coded variants whith expectations which do not hold sometimes. If you disagree I can drop the bpf part of course... -- Michal Hocko SUSE Labs
RE: [PATCH 0/6 v3] kvmalloc
From: Daniel Borkmann > Sent: 26 January 2017 09:37 ... > >> I assume that kvzalloc() is still the same from [1], right? If so, then > >> it would unfortunately (partially) reintroduce the issue that was fixed. > >> If you look above at flags, they're also passed to __vmalloc() to not > >> trigger OOM in these situations I've experienced. > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > think it would. It can still trigger the OOM killer becauset the flags > > are no propagated all the way down to all allocations requests (e.g. > > page tables). This is the same reason why GFP_NOFS is not supported in > > vmalloc. > > Ok, good to know, is that somewhere clearly documented (like for the > case with kmalloc())? If not, could we do that for non-mm folks, or > at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make > it obvious to users that a given flag combination is not supported all > the way down? ISTM that requests for the relatively small memory blocks needed for page tables aren't really likely to invoke the OOM killer when it isn't already being invoked by other actions. So that isn't really a problem. More of a problem is that requests that you really don't mind failing can use the last 'reasonably available' memory. This will cause the next allocate to fail when it would be better for the earlier one to fail instead. David
RE: [PATCH 0/6 v3] kvmalloc
From: Daniel Borkmann > Sent: 26 January 2017 09:37 ... > >> I assume that kvzalloc() is still the same from [1], right? If so, then > >> it would unfortunately (partially) reintroduce the issue that was fixed. > >> If you look above at flags, they're also passed to __vmalloc() to not > >> trigger OOM in these situations I've experienced. > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might > > think it would. It can still trigger the OOM killer becauset the flags > > are no propagated all the way down to all allocations requests (e.g. > > page tables). This is the same reason why GFP_NOFS is not supported in > > vmalloc. > > Ok, good to know, is that somewhere clearly documented (like for the > case with kmalloc())? If not, could we do that for non-mm folks, or > at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make > it obvious to users that a given flag combination is not supported all > the way down? ISTM that requests for the relatively small memory blocks needed for page tables aren't really likely to invoke the OOM killer when it isn't already being invoked by other actions. So that isn't really a problem. More of a problem is that requests that you really don't mind failing can use the last 'reasonably available' memory. This will cause the next allocate to fail when it would be better for the earlier one to fail instead. David
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: On Wed, Jan 25, 2017 at 5:21 AM, Michal Hockowrote: On Wed 25-01-17 14:10:06, Michal Hocko wrote: On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: [...] Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts? I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? If not, could we do that for non-mm folks, or at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make it obvious to users that a given flag combination is not supported all the way down? This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? For both mentioned commits, (privileged) user space can potentially create large allocation requests, where we thus switch to vmalloc() flavor eventually and then OOM starts killing processes to try to satisfy the allocation request. This is bad, because we want the request to just fail instead as it's non-critical and f.e. not kill ssh connection et al. Failing is totally fine in this case, whereas triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this just fine, but as you say it seems it's not enough. Given there are multiple places like these in the kernel, could we instead add an option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On 01/26/2017 08:43 AM, Michal Hocko wrote: On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko wrote: On Wed 25-01-17 14:10:06, Michal Hocko wrote: On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: [...] Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts? I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. Ok, good to know, is that somewhere clearly documented (like for the case with kmalloc())? If not, could we do that for non-mm folks, or at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make it obvious to users that a given flag combination is not supported all the way down? This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? For both mentioned commits, (privileged) user space can potentially create large allocation requests, where we thus switch to vmalloc() flavor eventually and then OOM starts killing processes to try to satisfy the allocation request. This is bad, because we want the request to just fail instead as it's non-critical and f.e. not kill ssh connection et al. Failing is totally fine in this case, whereas triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this just fine, but as you say it seems it's not enough. Given there are multiple places like these in the kernel, could we instead add an option such as __GFP_NOOOM, or just make __GFP_NORETRY supported? Thanks, Daniel
Re: [PATCH 0/6 v3] kvmalloc
On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: > > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hockowrote: > > > On Wed 25-01-17 14:10:06, Michal Hocko wrote: > > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > [...] > > > > > > Are there any more comments? I would really appreciate to hear from > > > > > > networking folks before I resubmit the series. > > > > > > > > > > while this patchset was baking the bpf side switched to use > > > > > bpf_map_area_alloc() > > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix > > > > > quickly. > > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under > > > > > pressure with map alloc") > > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in > > > > > this set. > > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). > > > > > > > > OK, will do. Thanks for the heads up. > > > > > > Just for the record, I will fold the following into the patch 1 > > > --- > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 19b6129eab23..8697f43cf93c 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list > > > *tl) > > > > > > void *bpf_map_area_alloc(size_t size) > > > { > > > - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > > > -* trigger under memory pressure as we really just want to > > > -* fail instead. > > > -*/ > > > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > > > - void *area; > > > - > > > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > > > - area = kmalloc(size, GFP_USER | flags); > > > - if (area != NULL) > > > - return area; > > > - } > > > - > > > - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, > > > -PAGE_KERNEL); > > > + return kvzalloc(size, GFP_USER); > > > } > > > > > > void bpf_map_area_free(void *area) > > > > Looks fine by me. > > Daniel, thoughts? > > I assume that kvzalloc() is still the same from [1], right? If so, then > it would unfortunately (partially) reintroduce the issue that was fixed. > If you look above at flags, they're also passed to __vmalloc() to not > trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. > This is effectively the > same requirement as in other networking areas f.e. that 5bad87348c70 > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. > In your comment in kvzalloc() you eventually say that some of the above > modifiers are not supported. So there would be two options, i) just leave > out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle > it later (along with similar code from 5bad87348c70), or ii) implement > support for these modifiers as well to your original set. I guess it's not > too urgent, so we could also proceed with i) if that is easier for you to > proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Wed 25-01-17 21:16:42, Daniel Borkmann wrote: > On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: > > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko wrote: > > > On Wed 25-01-17 14:10:06, Michal Hocko wrote: > > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > [...] > > > > > > Are there any more comments? I would really appreciate to hear from > > > > > > networking folks before I resubmit the series. > > > > > > > > > > while this patchset was baking the bpf side switched to use > > > > > bpf_map_area_alloc() > > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix > > > > > quickly. > > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under > > > > > pressure with map alloc") > > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in > > > > > this set. > > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). > > > > > > > > OK, will do. Thanks for the heads up. > > > > > > Just for the record, I will fold the following into the patch 1 > > > --- > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 19b6129eab23..8697f43cf93c 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list > > > *tl) > > > > > > void *bpf_map_area_alloc(size_t size) > > > { > > > - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > > > -* trigger under memory pressure as we really just want to > > > -* fail instead. > > > -*/ > > > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > > > - void *area; > > > - > > > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > > > - area = kmalloc(size, GFP_USER | flags); > > > - if (area != NULL) > > > - return area; > > > - } > > > - > > > - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, > > > -PAGE_KERNEL); > > > + return kvzalloc(size, GFP_USER); > > > } > > > > > > void bpf_map_area_free(void *area) > > > > Looks fine by me. > > Daniel, thoughts? > > I assume that kvzalloc() is still the same from [1], right? If so, then > it would unfortunately (partially) reintroduce the issue that was fixed. > If you look above at flags, they're also passed to __vmalloc() to not > trigger OOM in these situations I've experienced. Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might think it would. It can still trigger the OOM killer becauset the flags are no propagated all the way down to all allocations requests (e.g. page tables). This is the same reason why GFP_NOFS is not supported in vmalloc. > This is effectively the > same requirement as in other networking areas f.e. that 5bad87348c70 > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. > In your comment in kvzalloc() you eventually say that some of the above > modifiers are not supported. So there would be two options, i) just leave > out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle > it later (along with similar code from 5bad87348c70), or ii) implement > support for these modifiers as well to your original set. I guess it's not > too urgent, so we could also proceed with i) if that is easier for you to > proceed (I don't mind either way). Could you clarify why the oom killer in vmalloc matters actually? -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: On Wed, Jan 25, 2017 at 5:21 AM, Michal Hockowrote: On Wed 25-01-17 14:10:06, Michal Hocko wrote: On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: [...] Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts? I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Thanks a lot, Daniel [1] https://lkml.org/lkml/2017/1/12/442
Re: [PATCH 0/6 v3] kvmalloc
On 01/25/2017 07:14 PM, Alexei Starovoitov wrote: On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko wrote: On Wed 25-01-17 14:10:06, Michal Hocko wrote: On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: [...] Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts? I assume that kvzalloc() is still the same from [1], right? If so, then it would unfortunately (partially) reintroduce the issue that was fixed. If you look above at flags, they're also passed to __vmalloc() to not trigger OOM in these situations I've experienced. This is effectively the same requirement as in other networking areas f.e. that 5bad87348c70 ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has. In your comment in kvzalloc() you eventually say that some of the above modifiers are not supported. So there would be two options, i) just leave out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle it later (along with similar code from 5bad87348c70), or ii) implement support for these modifiers as well to your original set. I guess it's not too urgent, so we could also proceed with i) if that is easier for you to proceed (I don't mind either way). Thanks a lot, Daniel [1] https://lkml.org/lkml/2017/1/12/442
Re: [PATCH 0/6 v3] kvmalloc
On Wed, Jan 25, 2017 at 5:21 AM, Michal Hockowrote: > On Wed 25-01-17 14:10:06, Michal Hocko wrote: >> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: >> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: >> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: >> > > > Hi, >> > > > this has been previously posted as a single patch [1] but later on more >> > > > built on top. It turned out that there are users who would like to have >> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B >> > > > requests. Doing the same for smaller requests would require to redefine >> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of >> > > > this series. >> > > > >> > > > There are many open coded kmalloc with vmalloc fallback instances in >> > > > the tree. Most of them are not careful enough or simply do not care >> > > > about the underlying semantic of the kmalloc/page allocator which means >> > > > that a) some vmalloc fallbacks are basically unreachable because the >> > > > kmalloc part will keep retrying until it succeeds b) the page allocator >> > > > can invoke a really disruptive steps like the OOM killer to move >> > > > forward >> > > > which doesn't sound appropriate when we consider that the vmalloc >> > > > fallback is available. >> > > > >> > > > As it can be seen implementing kvmalloc requires quite an intimate >> > > > knowledge if the page allocator and the memory reclaim internals which >> > > > strongly suggests that a helper should be implemented in the memory >> > > > subsystem proper. >> > > > >> > > > Most callers I could find have been converted to use the helper >> > > > instead. >> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the >> > > > networking stack which I have converted as well but considering we do >> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I >> > > > have marked it RFC. >> > > >> > > Are there any more comments? I would really appreciate to hear from >> > > networking folks before I resubmit the series. >> > >> > while this patchset was baking the bpf side switched to use >> > bpf_map_area_alloc() >> > which fixes the issue with missing __GFP_NORETRY that we had to fix >> > quickly. >> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure >> > with map alloc") >> > it covers all kmalloc/vmalloc pairs instead of just one place as in this >> > set. >> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). >> >> OK, will do. Thanks for the heads up. > > Just for the record, I will fold the following into the patch 1 > --- > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 19b6129eab23..8697f43cf93c 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) > > void *bpf_map_area_alloc(size_t size) > { > - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > -* trigger under memory pressure as we really just want to > -* fail instead. > -*/ > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > - void *area; > - > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > - area = kmalloc(size, GFP_USER | flags); > - if (area != NULL) > - return area; > - } > - > - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, > -PAGE_KERNEL); > + return kvzalloc(size, GFP_USER); > } > > void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts?
Re: [PATCH 0/6 v3] kvmalloc
On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko wrote: > On Wed 25-01-17 14:10:06, Michal Hocko wrote: >> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: >> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: >> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: >> > > > Hi, >> > > > this has been previously posted as a single patch [1] but later on more >> > > > built on top. It turned out that there are users who would like to have >> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B >> > > > requests. Doing the same for smaller requests would require to redefine >> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of >> > > > this series. >> > > > >> > > > There are many open coded kmalloc with vmalloc fallback instances in >> > > > the tree. Most of them are not careful enough or simply do not care >> > > > about the underlying semantic of the kmalloc/page allocator which means >> > > > that a) some vmalloc fallbacks are basically unreachable because the >> > > > kmalloc part will keep retrying until it succeeds b) the page allocator >> > > > can invoke a really disruptive steps like the OOM killer to move >> > > > forward >> > > > which doesn't sound appropriate when we consider that the vmalloc >> > > > fallback is available. >> > > > >> > > > As it can be seen implementing kvmalloc requires quite an intimate >> > > > knowledge if the page allocator and the memory reclaim internals which >> > > > strongly suggests that a helper should be implemented in the memory >> > > > subsystem proper. >> > > > >> > > > Most callers I could find have been converted to use the helper >> > > > instead. >> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the >> > > > networking stack which I have converted as well but considering we do >> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I >> > > > have marked it RFC. >> > > >> > > Are there any more comments? I would really appreciate to hear from >> > > networking folks before I resubmit the series. >> > >> > while this patchset was baking the bpf side switched to use >> > bpf_map_area_alloc() >> > which fixes the issue with missing __GFP_NORETRY that we had to fix >> > quickly. >> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure >> > with map alloc") >> > it covers all kmalloc/vmalloc pairs instead of just one place as in this >> > set. >> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). >> >> OK, will do. Thanks for the heads up. > > Just for the record, I will fold the following into the patch 1 > --- > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 19b6129eab23..8697f43cf93c 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) > > void *bpf_map_area_alloc(size_t size) > { > - /* We definitely need __GFP_NORETRY, so OOM killer doesn't > -* trigger under memory pressure as we really just want to > -* fail instead. > -*/ > - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; > - void *area; > - > - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { > - area = kmalloc(size, GFP_USER | flags); > - if (area != NULL) > - return area; > - } > - > - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, > -PAGE_KERNEL); > + return kvzalloc(size, GFP_USER); > } > > void bpf_map_area_free(void *area) Looks fine by me. Daniel, thoughts?
Re: [PATCH 0/6 v3] kvmalloc
On Wed 25-01-17 14:10:06, Michal Hocko wrote: > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > > Hi, > > > > this has been previously posted as a single patch [1] but later on more > > > > built on top. It turned out that there are users who would like to have > > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > > > requests. Doing the same for smaller requests would require to redefine > > > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > > > this series. > > > > > > > > There are many open coded kmalloc with vmalloc fallback instances in > > > > the tree. Most of them are not careful enough or simply do not care > > > > about the underlying semantic of the kmalloc/page allocator which means > > > > that a) some vmalloc fallbacks are basically unreachable because the > > > > kmalloc part will keep retrying until it succeeds b) the page allocator > > > > can invoke a really disruptive steps like the OOM killer to move forward > > > > which doesn't sound appropriate when we consider that the vmalloc > > > > fallback is available. > > > > > > > > As it can be seen implementing kvmalloc requires quite an intimate > > > > knowledge if the page allocator and the memory reclaim internals which > > > > strongly suggests that a helper should be implemented in the memory > > > > subsystem proper. > > > > > > > > Most callers I could find have been converted to use the helper instead. > > > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > > > networking stack which I have converted as well but considering we do > > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > > > have marked it RFC. > > > > > > Are there any more comments? I would really appreciate to hear from > > > networking folks before I resubmit the series. > > > > while this patchset was baking the bpf side switched to use > > bpf_map_area_alloc() > > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with > > map alloc") > > it covers all kmalloc/vmalloc pairs instead of just one place as in this > > set. > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). > > OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Wed 25-01-17 14:10:06, Michal Hocko wrote: > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > > Hi, > > > > this has been previously posted as a single patch [1] but later on more > > > > built on top. It turned out that there are users who would like to have > > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > > > requests. Doing the same for smaller requests would require to redefine > > > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > > > this series. > > > > > > > > There are many open coded kmalloc with vmalloc fallback instances in > > > > the tree. Most of them are not careful enough or simply do not care > > > > about the underlying semantic of the kmalloc/page allocator which means > > > > that a) some vmalloc fallbacks are basically unreachable because the > > > > kmalloc part will keep retrying until it succeeds b) the page allocator > > > > can invoke a really disruptive steps like the OOM killer to move forward > > > > which doesn't sound appropriate when we consider that the vmalloc > > > > fallback is available. > > > > > > > > As it can be seen implementing kvmalloc requires quite an intimate > > > > knowledge if the page allocator and the memory reclaim internals which > > > > strongly suggests that a helper should be implemented in the memory > > > > subsystem proper. > > > > > > > > Most callers I could find have been converted to use the helper instead. > > > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > > > networking stack which I have converted as well but considering we do > > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > > > have marked it RFC. > > > > > > Are there any more comments? I would really appreciate to hear from > > > networking folks before I resubmit the series. > > > > while this patchset was baking the bpf side switched to use > > bpf_map_area_alloc() > > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with > > map alloc") > > it covers all kmalloc/vmalloc pairs instead of just one place as in this > > set. > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). > > OK, will do. Thanks for the heads up. Just for the record, I will fold the following into the patch 1 --- diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 19b6129eab23..8697f43cf93c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl) void *bpf_map_area_alloc(size_t size) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't -* trigger under memory pressure as we really just want to -* fail instead. -*/ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; - void *area; - - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc(size, GFP_USER | flags); - if (area != NULL) - return area; - } - - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags, -PAGE_KERNEL); + return kvzalloc(size, GFP_USER); } void bpf_map_area_free(void *area) -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > Hi, > > > this has been previously posted as a single patch [1] but later on more > > > built on top. It turned out that there are users who would like to have > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > > requests. Doing the same for smaller requests would require to redefine > > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > > this series. > > > > > > There are many open coded kmalloc with vmalloc fallback instances in > > > the tree. Most of them are not careful enough or simply do not care > > > about the underlying semantic of the kmalloc/page allocator which means > > > that a) some vmalloc fallbacks are basically unreachable because the > > > kmalloc part will keep retrying until it succeeds b) the page allocator > > > can invoke a really disruptive steps like the OOM killer to move forward > > > which doesn't sound appropriate when we consider that the vmalloc > > > fallback is available. > > > > > > As it can be seen implementing kvmalloc requires quite an intimate > > > knowledge if the page allocator and the memory reclaim internals which > > > strongly suggests that a helper should be implemented in the memory > > > subsystem proper. > > > > > > Most callers I could find have been converted to use the helper instead. > > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > > networking stack which I have converted as well but considering we do > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > > have marked it RFC. > > > > Are there any more comments? I would really appreciate to hear from > > networking folks before I resubmit the series. > > while this patchset was baking the bpf side switched to use > bpf_map_area_alloc() > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with > map alloc") > it covers all kmalloc/vmalloc pairs instead of just one place as in this set. > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote: > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > Hi, > > > this has been previously posted as a single patch [1] but later on more > > > built on top. It turned out that there are users who would like to have > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > > requests. Doing the same for smaller requests would require to redefine > > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > > this series. > > > > > > There are many open coded kmalloc with vmalloc fallback instances in > > > the tree. Most of them are not careful enough or simply do not care > > > about the underlying semantic of the kmalloc/page allocator which means > > > that a) some vmalloc fallbacks are basically unreachable because the > > > kmalloc part will keep retrying until it succeeds b) the page allocator > > > can invoke a really disruptive steps like the OOM killer to move forward > > > which doesn't sound appropriate when we consider that the vmalloc > > > fallback is available. > > > > > > As it can be seen implementing kvmalloc requires quite an intimate > > > knowledge if the page allocator and the memory reclaim internals which > > > strongly suggests that a helper should be implemented in the memory > > > subsystem proper. > > > > > > Most callers I could find have been converted to use the helper instead. > > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > > networking stack which I have converted as well but considering we do > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > > have marked it RFC. > > > > Are there any more comments? I would really appreciate to hear from > > networking folks before I resubmit the series. > > while this patchset was baking the bpf side switched to use > bpf_map_area_alloc() > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with > map alloc") > it covers all kmalloc/vmalloc pairs instead of just one place as in this set. > So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). OK, will do. Thanks for the heads up. -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Tue 24-01-17 08:00:26, Eric Dumazet wrote: > On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote: > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > Are there any more comments? I would really appreciate to hear from > > networking folks before I resubmit the series. > > I do not see any issues right now. > > I am happy to see this thing finally coming, after years of > resistance ;) OK, so I will repost the series and ask Andrew for inclusion after it passes my compile test battery after the rebase. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Tue 24-01-17 08:00:26, Eric Dumazet wrote: > On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote: > > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > > Are there any more comments? I would really appreciate to hear from > > networking folks before I resubmit the series. > > I do not see any issues right now. > > I am happy to see this thing finally coming, after years of > resistance ;) OK, so I will repost the series and ask Andrew for inclusion after it passes my compile test battery after the rebase. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > Hi, > > this has been previously posted as a single patch [1] but later on more > > built on top. It turned out that there are users who would like to have > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > requests. Doing the same for smaller requests would require to redefine > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > this series. > > > > There are many open coded kmalloc with vmalloc fallback instances in > > the tree. Most of them are not careful enough or simply do not care > > about the underlying semantic of the kmalloc/page allocator which means > > that a) some vmalloc fallbacks are basically unreachable because the > > kmalloc part will keep retrying until it succeeds b) the page allocator > > can invoke a really disruptive steps like the OOM killer to move forward > > which doesn't sound appropriate when we consider that the vmalloc > > fallback is available. > > > > As it can be seen implementing kvmalloc requires quite an intimate > > knowledge if the page allocator and the memory reclaim internals which > > strongly suggests that a helper should be implemented in the memory > > subsystem proper. > > > > Most callers I could find have been converted to use the helper instead. > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > networking stack which I have converted as well but considering we do > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > have marked it RFC. > > Are there any more comments? I would really appreciate to hear from > networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). Thanks
Re: [PATCH 0/6 v3] kvmalloc
On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote: > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > > Hi, > > this has been previously posted as a single patch [1] but later on more > > built on top. It turned out that there are users who would like to have > > __GFP_REPEAT semantic. This is currently implemented for costly >64B > > requests. Doing the same for smaller requests would require to redefine > > __GFP_REPEAT semantic in the page allocator which is out of scope of > > this series. > > > > There are many open coded kmalloc with vmalloc fallback instances in > > the tree. Most of them are not careful enough or simply do not care > > about the underlying semantic of the kmalloc/page allocator which means > > that a) some vmalloc fallbacks are basically unreachable because the > > kmalloc part will keep retrying until it succeeds b) the page allocator > > can invoke a really disruptive steps like the OOM killer to move forward > > which doesn't sound appropriate when we consider that the vmalloc > > fallback is available. > > > > As it can be seen implementing kvmalloc requires quite an intimate > > knowledge if the page allocator and the memory reclaim internals which > > strongly suggests that a helper should be implemented in the memory > > subsystem proper. > > > > Most callers I could find have been converted to use the helper instead. > > This is patch 5. There are some more relying on __GFP_REPEAT in the > > networking stack which I have converted as well but considering we do > > not have a support for __GFP_REPEAT for requests smaller than 64kB I > > have marked it RFC. > > Are there any more comments? I would really appreciate to hear from > networking folks before I resubmit the series. while this patchset was baking the bpf side switched to use bpf_map_area_alloc() which fixes the issue with missing __GFP_NORETRY that we had to fix quickly. See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") it covers all kmalloc/vmalloc pairs instead of just one place as in this set. So please rebase and switch bpf_map_area_alloc() to use kvmalloc(). Thanks
Re: [PATCH 0/6 v3] kvmalloc
On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote: > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > Are there any more comments? I would really appreciate to hear from > networking folks before I resubmit the series. I do not see any issues right now. I am happy to see this thing finally coming, after years of resistance ;)
Re: [PATCH 0/6 v3] kvmalloc
On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote: > On Thu 12-01-17 16:37:11, Michal Hocko wrote: > Are there any more comments? I would really appreciate to hear from > networking folks before I resubmit the series. I do not see any issues right now. I am happy to see this thing finally coming, after years of resistance ;)
Re: [PATCH 0/6 v3] kvmalloc
On Thu 12-01-17 16:37:11, Michal Hocko wrote: > Hi, > this has been previously posted as a single patch [1] but later on more > built on top. It turned out that there are users who would like to have > __GFP_REPEAT semantic. This is currently implemented for costly >64B > requests. Doing the same for smaller requests would require to redefine > __GFP_REPEAT semantic in the page allocator which is out of scope of > this series. > > There are many open coded kmalloc with vmalloc fallback instances in > the tree. Most of them are not careful enough or simply do not care > about the underlying semantic of the kmalloc/page allocator which means > that a) some vmalloc fallbacks are basically unreachable because the > kmalloc part will keep retrying until it succeeds b) the page allocator > can invoke a really disruptive steps like the OOM killer to move forward > which doesn't sound appropriate when we consider that the vmalloc > fallback is available. > > As it can be seen implementing kvmalloc requires quite an intimate > knowledge if the page allocator and the memory reclaim internals which > strongly suggests that a helper should be implemented in the memory > subsystem proper. > > Most callers I could find have been converted to use the helper instead. > This is patch 5. There are some more relying on __GFP_REPEAT in the > networking stack which I have converted as well but considering we do > not have a support for __GFP_REPEAT for requests smaller than 64kB I > have marked it RFC. Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. Thanks! > [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org > -- Michal Hocko SUSE Labs
Re: [PATCH 0/6 v3] kvmalloc
On Thu 12-01-17 16:37:11, Michal Hocko wrote: > Hi, > this has been previously posted as a single patch [1] but later on more > built on top. It turned out that there are users who would like to have > __GFP_REPEAT semantic. This is currently implemented for costly >64B > requests. Doing the same for smaller requests would require to redefine > __GFP_REPEAT semantic in the page allocator which is out of scope of > this series. > > There are many open coded kmalloc with vmalloc fallback instances in > the tree. Most of them are not careful enough or simply do not care > about the underlying semantic of the kmalloc/page allocator which means > that a) some vmalloc fallbacks are basically unreachable because the > kmalloc part will keep retrying until it succeeds b) the page allocator > can invoke a really disruptive steps like the OOM killer to move forward > which doesn't sound appropriate when we consider that the vmalloc > fallback is available. > > As it can be seen implementing kvmalloc requires quite an intimate > knowledge if the page allocator and the memory reclaim internals which > strongly suggests that a helper should be implemented in the memory > subsystem proper. > > Most callers I could find have been converted to use the helper instead. > This is patch 5. There are some more relying on __GFP_REPEAT in the > networking stack which I have converted as well but considering we do > not have a support for __GFP_REPEAT for requests smaller than 64kB I > have marked it RFC. Are there any more comments? I would really appreciate to hear from networking folks before I resubmit the series. Thanks! > [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org > -- Michal Hocko SUSE Labs
[PATCH 0/6 v3] kvmalloc
Hi, this has been previously posted as a single patch [1] but later on more built on top. It turned out that there are users who would like to have __GFP_REPEAT semantic. This is currently implemented for costly >64B requests. Doing the same for smaller requests would require to redefine __GFP_REPEAT semantic in the page allocator which is out of scope of this series. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers I could find have been converted to use the helper instead. This is patch 5. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well but considering we do not have a support for __GFP_REPEAT for requests smaller than 64kB I have marked it RFC. [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org
[PATCH 0/6 v3] kvmalloc
Hi, this has been previously posted as a single patch [1] but later on more built on top. It turned out that there are users who would like to have __GFP_REPEAT semantic. This is currently implemented for costly >64B requests. Doing the same for smaller requests would require to redefine __GFP_REPEAT semantic in the page allocator which is out of scope of this series. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers I could find have been converted to use the helper instead. This is patch 5. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well but considering we do not have a support for __GFP_REPEAT for requests smaller than 64kB I have marked it RFC. [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org