subject:"\[PATCH 0\/6 v3\] kvmalloc"

Re: [PATCH 0/6 v3] kvmalloc

2017-02-05 Thread Michal Hocko

Is there anything more to be done before this can get merged? I would
relly like to target this to the next merge window. I already have some
more changes which depend on this.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-02-05 Thread Michal Hocko

Is there anything more to be done before this can get merged? I would
relly like to target this to the next merge window. I already have some
more changes which depend on this.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Daniel Borkmann


On 01/30/2017 05:28 PM, Michal Hocko wrote:

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:

On 01/30/2017 08:56 AM, Michal Hocko wrote:

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:

On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

[...]

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.


Thanks! I will repost the whole series today.


Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?).


Yes


Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?


I misread your response. I thought you were OK with the FIXME
explanation.


At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?


I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.


Ok, I'm fine with that as well.

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Daniel Borkmann


On 01/30/2017 05:28 PM, Michal Hocko wrote:

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:

On 01/30/2017 08:56 AM, Michal Hocko wrote:

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:

On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

[...]

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.


Thanks! I will repost the whole series today.


Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?).


Yes


Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?


I misread your response. I thought you were OK with the FIXME
explanation.


At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?


I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.


Ok, I'm fine with that as well.

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Daniel Borkmann


On 01/30/2017 08:56 AM, Michal Hocko wrote:

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:

On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

[...]

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.


Thanks! I will repost the whole series today.


Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?). Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?

Thanks again,
Daniel

  [1] https://lkml.org/lkml/2017/1/30/129

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Daniel Borkmann


On 01/30/2017 08:56 AM, Michal Hocko wrote:

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:

On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

[...]

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.


Thanks! I will repost the whole series today.


Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?). Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?

Thanks again,
Daniel

  [1] https://lkml.org/lkml/2017/1/30/129

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
> On 01/30/2017 08:56 AM, Michal Hocko wrote:
> > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> > > On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> > [...]
> > > > > So to answer your second email with the bpf and netfilter hunks, why
> > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > > > big fat FIXME comment above there, saying explicitly that 
> > > > > __GFP_NORETRY
> > > > > is not harmful though has only /partial/ effect right now and that 
> > > > > full
> > > > > support needs to be implemented in future. That would still be better
> > > > > that not having it, imo, and the FIXME would make expectations clear
> > > > > to anyone reading that code.
> > > > 
> > > > Well, we can do that, I just would like to prevent from this (ab)use
> > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> > > 
> > > Understandable.
> > > 
> > > > report or a fallback mechanism you are mentioning above would justify
> > > > the (ab)use IMHO. But that abuse would be documented properly and have a
> > > > real reason to exist. That sounds like a better approach to me.
> > > > 
> > > > But if you absolutely _insist_ I can change that.
> > > 
> > > Yeah, please do (with a big FIXME comment as mentioned), this originally
> > > came from a real bug report. Anyway, feel free to add my Acked-by then.
> > 
> > Thanks! I will repost the whole series today.
> 
> Looks like I got only Cc'ed on the cover letter of your v3 from today
> (should have been v4 actually?).

Yes

> Anyway, I looked up the last patch
> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?

I misread your response. I thought you were OK with the FIXME
explanation.

> At least that was what was discussed above (insisting on __GFP_NORETRY
> plus FIXME comment) for providing my Acked-by then. Can you still fix
> that up in a final respin?

I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
> On 01/30/2017 08:56 AM, Michal Hocko wrote:
> > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> > > On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> > [...]
> > > > > So to answer your second email with the bpf and netfilter hunks, why
> > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > > > big fat FIXME comment above there, saying explicitly that 
> > > > > __GFP_NORETRY
> > > > > is not harmful though has only /partial/ effect right now and that 
> > > > > full
> > > > > support needs to be implemented in future. That would still be better
> > > > > that not having it, imo, and the FIXME would make expectations clear
> > > > > to anyone reading that code.
> > > > 
> > > > Well, we can do that, I just would like to prevent from this (ab)use
> > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> > > 
> > > Understandable.
> > > 
> > > > report or a fallback mechanism you are mentioning above would justify
> > > > the (ab)use IMHO. But that abuse would be documented properly and have a
> > > > real reason to exist. That sounds like a better approach to me.
> > > > 
> > > > But if you absolutely _insist_ I can change that.
> > > 
> > > Yeah, please do (with a big FIXME comment as mentioned), this originally
> > > came from a real bug report. Anyway, feel free to add my Acked-by then.
> > 
> > Thanks! I will repost the whole series today.
> 
> Looks like I got only Cc'ed on the cover letter of your v3 from today
> (should have been v4 actually?).

Yes

> Anyway, I looked up the last patch
> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?

I misread your response. I thought you were OK with the FIXME
explanation.

> At least that was what was discussed above (insisting on __GFP_NORETRY
> plus FIXME comment) for providing my Acked-by then. Can you still fix
> that up in a final respin?

I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.
-- 
Michal Hocko
SUSE Labs

[PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

Hi,
this has been previously posted here [1] and it received quite some
feedback. As a result the number of patches has grown again. We are at
9 patches right now. I have rebased the series on top of the current
next-20170130. There were some changes since the last posting, namely
a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which
dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't
trigger OOM killer under pressure with map alloc") which has created a
kvmalloc alternative for bpf code. Both have been changed to use the mm
kvmalloc but it is worth noting this dependency during the merge window.

I hope there are no further obstacles to have this merged into the mmotm
tree and go in in the next merge window.

Original cover:

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 5. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170112153717.28943-1-mho...@kernel.org
[2] 
http://lkml.kernel.org/r/1485273626.16328.301.ca...@edumazet-glaptop3.roam.corp.google.com

Michal Hocko (9):
  mm: introduce kv[mz]alloc helpers
  mm: support __GFP_REPEAT in kvmalloc_node for >32kB
  rhashtable: simplify a strange allocation pattern
  ila: simplify a strange allocation pattern
  treewide: use kv[mz]alloc* rather than opencoded variants
  net: use kvmalloc with __GFP_REPEAT rather than open coded variant
  md: use kvmalloc rather than opencoded variant
  bcache: use kvmalloc
  net, bpf: use kvzalloc helper

 arch/s390/kvm/kvm-s390.c   | 10 +---
 arch/x86/kvm/lapic.c   |  4 +-
 arch/x86/kvm/page_track.c  |  4 +-
 arch/x86/kvm/x86.c |  4 +-
 crypto/lzo.c   |  4 +-
 drivers/acpi/apei/erst.c   |  8 +--
 drivers/char/agp/generic.c |  8 +--
 drivers/gpu/drm/nouveau/nouveau_gem.c  |  4 +-
 drivers/md/bcache/super.c  |  8 +--
 drivers/md/bcache/util.h   | 12 +
 drivers/md/dm-ioctl.c  | 13 ++---
 drivers/md/dm-stats.c  |  7 +--
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h|  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++-
 drivers/net/ethernet/chelsio/cxgb3/l2t.c   |  8 +--
 drivers/net/ethernet/chelsio/cxgb3/l2t.h   |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c  | 12 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 31 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/l2t.c   |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c | 12 ++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 ++--
 drivers/net/ethernet/mellanox/mlx4/mr.c|  9 ++--
 drivers/nvdimm/dimm_devs.c |  5 +-
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +
 drivers/vhost/net.c|  9 ++--
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vsock.c  |  9 ++--
 drivers/xen/evtchn.c   | 14 +-
 fs/btrfs/ctree.c   |  9 ++--
 fs/btrfs/ioctl.c   |  9 ++--
 fs/btrfs/send.c| 27 --
 fs/ceph/file.c |  9 ++--
 fs/ext4/mballoc.c  |  2 +-
 fs/ext4/super.c|  4 +-
 fs/f2fs/f2fs.h | 20 
 fs/f2fs/file.c |  4 +-
 fs/f2fs/segment.c  | 14 +++---
 fs/select.c

[PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

Hi,
this has been previously posted here [1] and it received quite some
feedback. As a result the number of patches has grown again. We are at
9 patches right now. I have rebased the series on top of the current
next-20170130. There were some changes since the last posting, namely
a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which
dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't
trigger OOM killer under pressure with map alloc") which has created a
kvmalloc alternative for bpf code. Both have been changed to use the mm
kvmalloc but it is worth noting this dependency during the merge window.

I hope there are no further obstacles to have this merged into the mmotm
tree and go in in the next merge window.

Original cover:

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 5. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170112153717.28943-1-mho...@kernel.org
[2] 
http://lkml.kernel.org/r/1485273626.16328.301.ca...@edumazet-glaptop3.roam.corp.google.com

Michal Hocko (9):
  mm: introduce kv[mz]alloc helpers
  mm: support __GFP_REPEAT in kvmalloc_node for >32kB
  rhashtable: simplify a strange allocation pattern
  ila: simplify a strange allocation pattern
  treewide: use kv[mz]alloc* rather than opencoded variants
  net: use kvmalloc with __GFP_REPEAT rather than open coded variant
  md: use kvmalloc rather than opencoded variant
  bcache: use kvmalloc
  net, bpf: use kvzalloc helper

 arch/s390/kvm/kvm-s390.c   | 10 +---
 arch/x86/kvm/lapic.c   |  4 +-
 arch/x86/kvm/page_track.c  |  4 +-
 arch/x86/kvm/x86.c |  4 +-
 crypto/lzo.c   |  4 +-
 drivers/acpi/apei/erst.c   |  8 +--
 drivers/char/agp/generic.c |  8 +--
 drivers/gpu/drm/nouveau/nouveau_gem.c  |  4 +-
 drivers/md/bcache/super.c  |  8 +--
 drivers/md/bcache/util.h   | 12 +
 drivers/md/dm-ioctl.c  | 13 ++---
 drivers/md/dm-stats.c  |  7 +--
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h|  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++-
 drivers/net/ethernet/chelsio/cxgb3/l2t.c   |  8 +--
 drivers/net/ethernet/chelsio/cxgb3/l2t.h   |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c  | 12 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 31 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/l2t.c   |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c | 12 ++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 ++--
 drivers/net/ethernet/mellanox/mlx4/mr.c|  9 ++--
 drivers/nvdimm/dimm_devs.c |  5 +-
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +
 drivers/vhost/net.c|  9 ++--
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vsock.c  |  9 ++--
 drivers/xen/evtchn.c   | 14 +-
 fs/btrfs/ctree.c   |  9 ++--
 fs/btrfs/ioctl.c   |  9 ++--
 fs/btrfs/send.c| 27 --
 fs/ceph/file.c |  9 ++--
 fs/ext4/mballoc.c  |  2 +-
 fs/ext4/super.c|  4 +-
 fs/f2fs/f2fs.h | 20 
 fs/f2fs/file.c |  4 +-
 fs/f2fs/segment.c  | 14 +++---
 fs/select.c

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
[...]
> > > So to answer your second email with the bpf and netfilter hunks, why
> > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > is not harmful though has only /partial/ effect right now and that full
> > > support needs to be implemented in future. That would still be better
> > > that not having it, imo, and the FIXME would make expectations clear
> > > to anyone reading that code.
> > 
> > Well, we can do that, I just would like to prevent from this (ab)use
> > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> 
> Understandable.
> 
> > report or a fallback mechanism you are mentioning above would justify
> > the (ab)use IMHO. But that abuse would be documented properly and have a
> > real reason to exist. That sounds like a better approach to me.
> > 
> > But if you absolutely _insist_ I can change that.
> 
> Yeah, please do (with a big FIXME comment as mentioned), this originally
> came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks! I will repost the whole series today.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-30 Thread Michal Hocko

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
[...]
> > > So to answer your second email with the bpf and netfilter hunks, why
> > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > is not harmful though has only /partial/ effect right now and that full
> > > support needs to be implemented in future. That would still be better
> > > that not having it, imo, and the FIXME would make expectations clear
> > > to anyone reading that code.
> > 
> > Well, we can do that, I just would like to prevent from this (ab)use
> > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> 
> Understandable.
> 
> > report or a fallback mechanism you are mentioning above would justify
> > the (ab)use IMHO. But that abuse would be documented properly and have a
> > real reason to exist. That sounds like a better approach to me.
> > 
> > But if you absolutely _insist_ I can change that.
> 
> Yeah, please do (with a big FIXME comment as mentioned), this originally
> came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks! I will repost the whole series today.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-27 Thread Daniel Borkmann


On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

On 01/26/2017 02:40 PM, Michal Hocko wrote:

[...]

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?


The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead.


I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.


In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks again,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-27 Thread Daniel Borkmann


On 01/27/2017 11:05 AM, Michal Hocko wrote:

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:

On 01/26/2017 02:40 PM, Michal Hocko wrote:

[...]

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?


The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead.


I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.


In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.


Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug


Understandable.


report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.


Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks again,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-27 Thread Michal Hocko

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> On 01/26/2017 02:40 PM, Michal Hocko wrote:
[...]
> > But realistically, how big is this problem really? Is it really worth
> > it? You said this is an admin only interface and admin can kill the
> > machine by OOM and other means already.
> > 
> > Moreover and I should probably mention it explicitly, your d407bd25a204b
> > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> > could indeed hit the OOM e.g. due to memory fragmentation. It would be
> > much harder to hit the OOM killer from vmalloc which doesn't issue
> > higher order allocation requests. Or have you ever seen the OOM killer
> > pointing to the vmalloc fallback path?
> 
> The case I was concerned about was from vmalloc() path, not kmalloc().
> That was where the stack trace indicating OOM pointed to. As an example,
> there could be really large allocation requests for maps where the map
> has pre-allocated memory for its elements. Thus, if we get to the point
> where we need to kill others due to shortage of mem for satisfying this,
> I'd much much rather prefer to just not let vmalloc() work really hard
> and fail early on instead. 

I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.

> In my (crafted) test case, I was connected
> via ssh and it each time reliably killed my connection, which is really
> suboptimal.
> 
> F.e., I could also imagine a buggy or miscalculated map definition for
> a prog that is provisioned to multiple places, which then accidentally
> triggers this. Or if large on purpose, but we crossed the line, it
> could be handled more gracefully, f.e. I could imagine an option to
> falling back to a non-pre-allocated map flavor from the application
> loading the program. Trade-off for sure, but still allowing it to
> operate up to a certain extend. Granted, if vmalloc() succeeded without
> trying hard and we then OOM elsewhere, too bad, but we don't have much
> control over that one anyway, only about our own request. Reason I
> asked above was whether having __GFP_NORETRY in would be fatal
> somewhere down the path, but seems not as you say.
> 
> So to answer your second email with the bpf and netfilter hunks, why
> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> is not harmful though has only /partial/ effect right now and that full
> support needs to be implemented in future. That would still be better
> that not having it, imo, and the FIXME would make expectations clear
> to anyone reading that code.

Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug
report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-27 Thread Michal Hocko

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> On 01/26/2017 02:40 PM, Michal Hocko wrote:
[...]
> > But realistically, how big is this problem really? Is it really worth
> > it? You said this is an admin only interface and admin can kill the
> > machine by OOM and other means already.
> > 
> > Moreover and I should probably mention it explicitly, your d407bd25a204b
> > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> > could indeed hit the OOM e.g. due to memory fragmentation. It would be
> > much harder to hit the OOM killer from vmalloc which doesn't issue
> > higher order allocation requests. Or have you ever seen the OOM killer
> > pointing to the vmalloc fallback path?
> 
> The case I was concerned about was from vmalloc() path, not kmalloc().
> That was where the stack trace indicating OOM pointed to. As an example,
> there could be really large allocation requests for maps where the map
> has pre-allocated memory for its elements. Thus, if we get to the point
> where we need to kill others due to shortage of mem for satisfying this,
> I'd much much rather prefer to just not let vmalloc() work really hard
> and fail early on instead. 

I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.

> In my (crafted) test case, I was connected
> via ssh and it each time reliably killed my connection, which is really
> suboptimal.
> 
> F.e., I could also imagine a buggy or miscalculated map definition for
> a prog that is provisioned to multiple places, which then accidentally
> triggers this. Or if large on purpose, but we crossed the line, it
> could be handled more gracefully, f.e. I could imagine an option to
> falling back to a non-pre-allocated map flavor from the application
> loading the program. Trade-off for sure, but still allowing it to
> operate up to a certain extend. Granted, if vmalloc() succeeded without
> trying hard and we then OOM elsewhere, too bad, but we don't have much
> control over that one anyway, only about our own request. Reason I
> asked above was whether having __GFP_NORETRY in would be fatal
> somewhere down the path, but seems not as you say.
> 
> So to answer your second email with the bpf and netfilter hunks, why
> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> is not harmful though has only /partial/ effect right now and that full
> support needs to be implemented in future. That would still be better
> that not having it, imo, and the FIXME would make expectations clear
> to anyone reading that code.

Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug
report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 02:40 PM, Michal Hocko wrote:

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:

On 01/26/2017 12:58 PM, Michal Hocko wrote:

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:

On 01/26/2017 11:08 AM, Michal Hocko wrote:

[...]

If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.


Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with


Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?


No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.


Ok, thanks for clarifying, more on that further below.


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

   void *bpf_map_area_alloc(size_t size)
   {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/


Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?


propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?


The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead. In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 02:40 PM, Michal Hocko wrote:

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:

On 01/26/2017 12:58 PM, Michal Hocko wrote:

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:

On 01/26/2017 11:08 AM, Michal Hocko wrote:

[...]

If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.


Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with


Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?


No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.


Ok, thanks for clarifying, more on that further below.


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

   void *bpf_map_area_alloc(size_t size)
   {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/


Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?


propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?


The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead. In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 14:40:04, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> > On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > > [...]
> > > > > If you disagree I can drop the bpf part of course...
> > > > 
> > > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > > possible paths, it kind of does have an effect already of saying
> > > > 'don't try too hard', so would it be harmful to still keep that for
> > > > now? If it's not, I'd personally prefer to just leave it as is until
> > > > there's some form of support by kvmalloc() and friends.
> > > 
> > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > > disallowed. It is not _supported_ which means that if it doesn't work as
> > > you expect you are on your own. Which is actually the situation right
> > > now as well. But I still think that this is just not right thing to do.
> > > Even though it might happen to work in some cases it gives a false
> > > impression of a solution. So I would rather go with
> > 
> > Hmm. 'On my own' means, we could potentially BUG somewhere down the
> > vmalloc implementation, etc, presumably? So it might in-fact be
> > harmful to pass that, right?
> 
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

I will separate both bpf and netfilter hunks into its own patch with the
clarification. Does the following look better?
---
>From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 26 Jan 2017 14:59:21 +0100
Subject: [PATCH] net, bpf: use kvzalloc helper

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Andrey Konovalov 
Cc: Marcelo Ricardo Leitner 
Cc: Pablo Neira Ayuso 
Signed-off-by: Michal Hocko 
---
 kernel/bpf/syscall.c | 19 +--
 net/netfilter/x_tables.c | 16 ++--
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
 */
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
size)
if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 >

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 14:40:04, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> > On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > > [...]
> > > > > If you disagree I can drop the bpf part of course...
> > > > 
> > > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > > possible paths, it kind of does have an effect already of saying
> > > > 'don't try too hard', so would it be harmful to still keep that for
> > > > now? If it's not, I'd personally prefer to just leave it as is until
> > > > there's some form of support by kvmalloc() and friends.
> > > 
> > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > > disallowed. It is not _supported_ which means that if it doesn't work as
> > > you expect you are on your own. Which is actually the situation right
> > > now as well. But I still think that this is just not right thing to do.
> > > Even though it might happen to work in some cases it gives a false
> > > impression of a solution. So I would rather go with
> > 
> > Hmm. 'On my own' means, we could potentially BUG somewhere down the
> > vmalloc implementation, etc, presumably? So it might in-fact be
> > harmful to pass that, right?
> 
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

I will separate both bpf and netfilter hunks into its own patch with the
clarification. Does the following look better?
---
>From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 26 Jan 2017 14:59:21 +0100
Subject: [PATCH] net, bpf: use kvzalloc helper

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Andrey Konovalov 
Cc: Marcelo Ricardo Leitner 
Cc: Pablo Neira Ayuso 
Signed-off-by: Michal Hocko 
---
 kernel/bpf/syscall.c | 19 +--
 net/netfilter/x_tables.c | 16 ++--
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
 */
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
size)
if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
return NULL;
 
-   if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
-   info = kmalloc(sz,

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > [...]
> > > > If you disagree I can drop the bpf part of course...
> > > 
> > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > possible paths, it kind of does have an effect already of saying
> > > 'don't try too hard', so would it be harmful to still keep that for
> > > now? If it's not, I'd personally prefer to just leave it as is until
> > > there's some form of support by kvmalloc() and friends.
> > 
> > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > disallowed. It is not _supported_ which means that if it doesn't work as
> > you expect you are on your own. Which is actually the situation right
> > now as well. But I still think that this is just not right thing to do.
> > Even though it might happen to work in some cases it gives a false
> > impression of a solution. So I would rather go with
> 
> Hmm. 'On my own' means, we could potentially BUG somewhere down the
> vmalloc implementation, etc, presumably? So it might in-fact be
> harmful to pass that, right?

No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.

> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 8697f43cf93c..a6dc4d596f14 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > 
> >   void *bpf_map_area_alloc(size_t size)
> >   {
> > +   /*
> > +* FIXME: we would really like to not trigger the OOM killer and rather
> > +* fail instead. This is not supported right now. Please nag MM people
> > +* if these OOM start bothering people.
> > +*/
> 
> Ok, I know this is out of scope for this series, but since i) this
> is _not_ the _only_ spot right now which has such a construct and ii)
> I am already kind of nagging a bit ;), my question would be, what
> would it take to start supporting it?

propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > [...]
> > > > If you disagree I can drop the bpf part of course...
> > > 
> > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > possible paths, it kind of does have an effect already of saying
> > > 'don't try too hard', so would it be harmful to still keep that for
> > > now? If it's not, I'd personally prefer to just leave it as is until
> > > there's some form of support by kvmalloc() and friends.
> > 
> > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > disallowed. It is not _supported_ which means that if it doesn't work as
> > you expect you are on your own. Which is actually the situation right
> > now as well. But I still think that this is just not right thing to do.
> > Even though it might happen to work in some cases it gives a false
> > impression of a solution. So I would rather go with
> 
> Hmm. 'On my own' means, we could potentially BUG somewhere down the
> vmalloc implementation, etc, presumably? So it might in-fact be
> harmful to pass that, right?

No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.

> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 8697f43cf93c..a6dc4d596f14 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > 
> >   void *bpf_map_area_alloc(size_t size)
> >   {
> > +   /*
> > +* FIXME: we would really like to not trigger the OOM killer and rather
> > +* fail instead. This is not supported right now. Please nag MM people
> > +* if these OOM start bothering people.
> > +*/
> 
> Ok, I know this is out of scope for this series, but since i) this
> is _not_ the _only_ spot right now which has such a construct and ii)
> I am already kind of nagging a bit ;), my question would be, what
> would it take to start supporting it?

propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 12:58 PM, Michal Hocko wrote:

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:

On 01/26/2017 11:08 AM, Michal Hocko wrote:

[...]

If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.


Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with


Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

  void *bpf_map_area_alloc(size_t size)
  {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/


Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?


return kvzalloc(size, GFP_USER);
  }


Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 12:58 PM, Michal Hocko wrote:

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:

On 01/26/2017 11:08 AM, Michal Hocko wrote:

[...]

If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.


Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with


Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?


diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

  void *bpf_map_area_alloc(size_t size)
  {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/


Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?


return kvzalloc(size, GFP_USER);
  }


Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 04:14:37, Joe Perches wrote:
> On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> []
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> []
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> >   * Allocate enough pages to cover @size from the page level
> >   * allocator with @gfp_mask flags.  Map them into contiguous
> >   * kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + * and __GFP_NOFAIL are not supported
> 
> Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

I would really like to not touch vmalloc in this series.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 04:14:37, Joe Perches wrote:
> On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> []
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> []
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> >   * Allocate enough pages to cover @size from the page level
> >   * allocator with @gfp_mask flags.  Map them into contiguous
> >   * kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + * and __GFP_NOFAIL are not supported
> 
> Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

I would really like to not touch vmalloc in this series.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Joe Perches

On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
[]
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
[]
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>   *   Allocate enough pages to cover @size from the page level
>   *   allocator with @gfp_mask flags.  Map them into contiguous
>   *   kernel virtual space, using a pagetable protection of @prot.
> + *
> + *   Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *   and __GFP_NOFAIL are not supported

Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Joe Perches

On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
[]
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
[]
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>   *   Allocate enough pages to cover @size from the page level
>   *   allocator with @gfp_mask flags.  Map them into contiguous
>   *   kernel virtual space, using a pagetable protection of @prot.
> + *
> + *   Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *   and __GFP_NOFAIL are not supported

Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> On 01/26/2017 11:08 AM, Michal Hocko wrote:
[...]
> > If you disagree I can drop the bpf part of course...
> 
> If we could consolidate these spots with kvmalloc() eventually, I'm
> all for it. But even if __GFP_NORETRY is not covered down to all
> possible paths, it kind of does have an effect already of saying
> 'don't try too hard', so would it be harmful to still keep that for
> now? If it's not, I'd personally prefer to just leave it as is until
> there's some form of support by kvmalloc() and friends.

Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/
return kvzalloc(size, GFP_USER);
 }
 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> On 01/26/2017 11:08 AM, Michal Hocko wrote:
[...]
> > If you disagree I can drop the bpf part of course...
> 
> If we could consolidate these spots with kvmalloc() eventually, I'm
> all for it. But even if __GFP_NORETRY is not covered down to all
> possible paths, it kind of does have an effect already of saying
> 'don't try too hard', so would it be harmful to still keep that for
> now? If it's not, I'd personally prefer to just leave it as is until
> there's some form of support by kvmalloc() and friends.

Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
+   /*
+* FIXME: we would really like to not trigger the OOM killer and rather
+* fail instead. This is not supported right now. Please nag MM people
+* if these OOM start bothering people.
+*/
return kvzalloc(size, GFP_USER);
 }
 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 12:04:13, Daniel Borkmann wrote:
> On 01/26/2017 11:32 AM, Michal Hocko wrote:
> > On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > > > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> > > [...]
> > > > > > I assume that kvzalloc() is still the same from [1], right? If so, 
> > > > > > then
> > > > > > it would unfortunately (partially) reintroduce the issue that was 
> > > > > > fixed.
> > > > > > If you look above at flags, they're also passed to __vmalloc() to 
> > > > > > not
> > > > > > trigger OOM in these situations I've experienced.
> > > > > 
> > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > > > think it would. It can still trigger the OOM killer becauset the flags
> > > > > are no propagated all the way down to all allocations requests (e.g.
> > > > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > > > vmalloc.
> > > > 
> > > > Ok, good to know, is that somewhere clearly documented (like for the
> > > > case with kmalloc())?
> > > 
> > > I am afraid that we really suck on this front. I will add something.
> > 
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> > ---
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d89034a393f2..6c1aa2c68887 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> >*Allocate enough pages to cover @size from the page level
> >*allocator with @gfp_mask flags.  Map them into contiguous
> >*kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + * and __GFP_NOFAIL are not supported
> 
> We could probably also mention that __GFP_ZERO in @gfp_mask is
> supported, though.

There are others which would be supported so I would rather stay with
explicit unsupported.

> 
> > + * Any use of gfp flags outside of GFP_KERNEL should be consulted
> > + * with mm people.
> 
> Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
> that is what vmalloc() resp. vzalloc() and others pass as flags?

yes, even though I think that specifying __GFP_HIGHMEM shouldn't be
really necessary. Are there any users who would really insist on vmalloc
pages in lowmem? Anyway this made me recheck kvmalloc_node
implementation and I am not adding this flags which would mean a
regression from the current state. Will fix it up.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 12:04:13, Daniel Borkmann wrote:
> On 01/26/2017 11:32 AM, Michal Hocko wrote:
> > On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > > > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> > > [...]
> > > > > > I assume that kvzalloc() is still the same from [1], right? If so, 
> > > > > > then
> > > > > > it would unfortunately (partially) reintroduce the issue that was 
> > > > > > fixed.
> > > > > > If you look above at flags, they're also passed to __vmalloc() to 
> > > > > > not
> > > > > > trigger OOM in these situations I've experienced.
> > > > > 
> > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > > > think it would. It can still trigger the OOM killer becauset the flags
> > > > > are no propagated all the way down to all allocations requests (e.g.
> > > > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > > > vmalloc.
> > > > 
> > > > Ok, good to know, is that somewhere clearly documented (like for the
> > > > case with kmalloc())?
> > > 
> > > I am afraid that we really suck on this front. I will add something.
> > 
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> > ---
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d89034a393f2..6c1aa2c68887 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, 
> > unsigned long align,
> >*Allocate enough pages to cover @size from the page level
> >*allocator with @gfp_mask flags.  Map them into contiguous
> >*kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + * and __GFP_NOFAIL are not supported
> 
> We could probably also mention that __GFP_ZERO in @gfp_mask is
> supported, though.

There are others which would be supported so I would rather stay with
explicit unsupported.

> 
> > + * Any use of gfp flags outside of GFP_KERNEL should be consulted
> > + * with mm people.
> 
> Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
> that is what vmalloc() resp. vzalloc() and others pass as flags?

yes, even though I think that specifying __GFP_HIGHMEM shouldn't be
really necessary. Are there any users who would really insist on vmalloc
pages in lowmem? Anyway this made me recheck kvmalloc_node
implementation and I am not adding this flags which would mean a
regression from the current state. Will fix it up.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 11:08 AM, Michal Hocko wrote:

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:

On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

[...]

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())?


I am afraid that we really suck on this front. I will add something.


Thanks for doing that, much appreciated!


If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?


I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.


Okay, makes sense then. Just the kdoc comment from your other
mail should help fine already.


This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).


Could you clarify why the oom killer in vmalloc matters actually?


For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not.


I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.


In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough.


Yeah, ptes have been most probably popullated already.


Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?


As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.


I see, that's unfortunate. I understand that there are requests
in parallel and that we might end up with OOM eventually if we're
unlucky, but having some way to tell vmalloc to just not try as
hard as usual would be nice.


I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.


I totally agree with you that having kvmalloc() as helper is awesome
and probably long overdue as well. :)


If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.

Thanks for your input, Michal!

Cheers,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 11:08 AM, Michal Hocko wrote:

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:

On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

[...]

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())?


I am afraid that we really suck on this front. I will add something.


Thanks for doing that, much appreciated!


If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?


I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.


Okay, makes sense then. Just the kdoc comment from your other
mail should help fine already.


This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).


Could you clarify why the oom killer in vmalloc matters actually?


For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not.


I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.


In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough.


Yeah, ptes have been most probably popullated already.


Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?


As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.


I see, that's unfortunate. I understand that there are requests
in parallel and that we might end up with OOM eventually if we're
unlucky, but having some way to tell vmalloc to just not try as
hard as usual would be nice.


I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.


I totally agree with you that having kvmalloc() as helper is awesome
and probably long overdue as well. :)


If you disagree I can drop the bpf part of course...


If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.

Thanks for your input, Michal!

Cheers,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 11:32 AM, Michal Hocko wrote:

On Thu 26-01-17 11:08:02, Michal Hocko wrote:

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:

On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

[...]

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())?


I am afraid that we really suck on this front. I will add something.


So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
   *Allocate enough pages to cover @size from the page level
   *allocator with @gfp_mask flags.  Map them into contiguous
   *kernel virtual space, using a pagetable protection of @prot.
+ *
+ * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ * and __GFP_NOFAIL are not supported


We could probably also mention that __GFP_ZERO in @gfp_mask is
supported, though.


+ * Any use of gfp flags outside of GFP_KERNEL should be consulted
+ * with mm people.


Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
that is what vmalloc() resp. vzalloc() and others pass as flags?


+ *
   */


Sounds good otherwise, thanks Michal!


  static void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, pgprot_t prot,

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 11:32 AM, Michal Hocko wrote:

On Thu 26-01-17 11:08:02, Michal Hocko wrote:

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:

On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

[...]

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())?


I am afraid that we really suck on this front. I will add something.


So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
   *Allocate enough pages to cover @size from the page level
   *allocator with @gfp_mask flags.  Map them into contiguous
   *kernel virtual space, using a pagetable protection of @prot.
+ *
+ * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ * and __GFP_NOFAIL are not supported


We could probably also mention that __GFP_ZERO in @gfp_mask is
supported, though.


+ * Any use of gfp flags outside of GFP_KERNEL should be consulted
+ * with mm people.


Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
that is what vmalloc() resp. vzalloc() and others pass as flags?


+ *
   */


Sounds good otherwise, thanks Michal!


  static void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, pgprot_t prot,

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
> > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > trigger OOM in these situations I've experienced.
> > > 
> > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > think it would. It can still trigger the OOM killer becauset the flags
> > > are no propagated all the way down to all allocations requests (e.g.
> > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > vmalloc.
> > 
> > Ok, good to know, is that somewhere clearly documented (like for the
> > case with kmalloc())?
> 
> I am afraid that we really suck on this front. I will add something.

So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
  * Allocate enough pages to cover @size from the page level
  * allocator with @gfp_mask flags.  Map them into contiguous
  * kernel virtual space, using a pagetable protection of @prot.
+ *
+ * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ * and __GFP_NOFAIL are not supported
+ *
+ * Any use of gfp flags outside of GFP_KERNEL should be consulted
+ * with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, pgprot_t prot,
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
> > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > trigger OOM in these situations I've experienced.
> > > 
> > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > think it would. It can still trigger the OOM killer becauset the flags
> > > are no propagated all the way down to all allocations requests (e.g.
> > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > vmalloc.
> > 
> > Ok, good to know, is that somewhere clearly documented (like for the
> > case with kmalloc())?
> 
> I am afraid that we really suck on this front. I will add something.

So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
  * Allocate enough pages to cover @size from the page level
  * allocator with @gfp_mask flags.  Map them into contiguous
  * kernel virtual space, using a pagetable protection of @prot.
+ *
+ * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ * and __GFP_NOFAIL are not supported
+ *
+ * Any use of gfp flags outside of GFP_KERNEL should be consulted
+ * with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
gfp_t gfp_mask, pgprot_t prot,
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
[...]
> > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > If you look above at flags, they're also passed to __vmalloc() to not
> > > trigger OOM in these situations I've experienced.
> > 
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())?

I am afraid that we really suck on this front. I will add something.

> If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.

> > > This is effectively the
> > > same requirement as in other networking areas f.e. that 5bad87348c70
> > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> > > In your comment in kvzalloc() you eventually say that some of the above
> > > modifiers are not supported. So there would be two options, i) just leave
> > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and 
> > > tackle
> > > it later (along with similar code from 5bad87348c70), or ii) implement
> > > support for these modifiers as well to your original set. I guess it's not
> > > too urgent, so we could also proceed with i) if that is easier for you to
> > > proceed (I don't mind either way).
> > 
> > Could you clarify why the oom killer in vmalloc matters actually?
> 
> For both mentioned commits, (privileged) user space can potentially
> create large allocation requests, where we thus switch to vmalloc()
> flavor eventually and then OOM starts killing processes to try to
> satisfy the allocation request. This is bad, because we want the
> request to just fail instead as it's non-critical and f.e. not kill
> ssh connection et al. Failing is totally fine in this case, whereas
> triggering OOM is not.

I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.

> In my testing, __GFP_NORETRY did satisfy this
> just fine, but as you say it seems it's not enough.

Yeah, ptes have been most probably popullated already.

> Given there are
> multiple places like these in the kernel, could we instead add an
> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.

I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.

If you disagree I can drop the bpf part of course...
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Michal Hocko

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
[...]
> > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > If you look above at flags, they're also passed to __vmalloc() to not
> > > trigger OOM in these situations I've experienced.
> > 
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())?

I am afraid that we really suck on this front. I will add something.

> If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.

> > > This is effectively the
> > > same requirement as in other networking areas f.e. that 5bad87348c70
> > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> > > In your comment in kvzalloc() you eventually say that some of the above
> > > modifiers are not supported. So there would be two options, i) just leave
> > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and 
> > > tackle
> > > it later (along with similar code from 5bad87348c70), or ii) implement
> > > support for these modifiers as well to your original set. I guess it's not
> > > too urgent, so we could also proceed with i) if that is easier for you to
> > > proceed (I don't mind either way).
> > 
> > Could you clarify why the oom killer in vmalloc matters actually?
> 
> For both mentioned commits, (privileged) user space can potentially
> create large allocation requests, where we thus switch to vmalloc()
> flavor eventually and then OOM starts killing processes to try to
> satisfy the allocation request. This is bad, because we want the
> request to just fail instead as it's non-critical and f.e. not kill
> ssh connection et al. Failing is totally fine in this case, whereas
> triggering OOM is not.

I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.

> In my testing, __GFP_NORETRY did satisfy this
> just fine, but as you say it seems it's not enough.

Yeah, ptes have been most probably popullated already.

> Given there are
> multiple places like these in the kernel, could we instead add an
> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.

I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.

If you disagree I can drop the bpf part of course...
-- 
Michal Hocko
SUSE Labs

RE: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread David Laight

From: Daniel Borkmann
> Sent: 26 January 2017 09:37
...
> >> I assume that kvzalloc() is still the same from [1], right? If so, then
> >> it would unfortunately (partially) reintroduce the issue that was fixed.
> >> If you look above at flags, they're also passed to __vmalloc() to not
> >> trigger OOM in these situations I've experienced.
> >
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())? If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

ISTM that requests for the relatively small memory blocks needed for page
tables aren't really likely to invoke the OOM killer when it isn't already
being invoked by other actions. So that isn't really a problem.

More of a problem is that requests that you really don't mind failing
can use the last 'reasonably available' memory.
This will cause the next allocate to fail when it would be better for
the earlier one to fail instead.

David

RE: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread David Laight

From: Daniel Borkmann
> Sent: 26 January 2017 09:37
...
> >> I assume that kvzalloc() is still the same from [1], right? If so, then
> >> it would unfortunately (partially) reintroduce the issue that was fixed.
> >> If you look above at flags, they're also passed to __vmalloc() to not
> >> trigger OOM in these situations I've experienced.
> >
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())? If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

ISTM that requests for the relatively small memory blocks needed for page
tables aren't really likely to invoke the OOM killer when it isn't already
being invoked by other actions. So that isn't really a problem.

More of a problem is that requests that you really don't mind failing
can use the last 'reasonably available' memory.
This will cause the next allocate to fail when it would be better for
the earlier one to fail instead.

David

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:

On Wed 25-01-17 14:10:06, Michal Hocko wrote:

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:

[...]

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.


while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().


OK, will do. Thanks for the heads up.


Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

   void *bpf_map_area_alloc(size_t size)
   {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
   }

   void bpf_map_area_free(void *area)


Looks fine by me.
Daniel, thoughts?


I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())? If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?


This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).


Could you clarify why the oom killer in vmalloc matters actually?


For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough. Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-26 Thread Daniel Borkmann


On 01/26/2017 08:43 AM, Michal Hocko wrote:

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:

On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:

On Wed 25-01-17 14:10:06, Michal Hocko wrote:

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:

[...]

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.


while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().


OK, will do. Thanks for the heads up.


Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

   void *bpf_map_area_alloc(size_t size)
   {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
   }

   void bpf_map_area_free(void *area)


Looks fine by me.
Daniel, thoughts?


I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced.


Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.


Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())? If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?


This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).


Could you clarify why the oom killer in vmalloc matters actually?


For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough. Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

Thanks,
Daniel

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:
> > > On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> [...]
> > > > > > Are there any more comments? I would really appreciate to hear from
> > > > > > networking folks before I resubmit the series.
> > > > > 
> > > > > while this patchset was baking the bpf side switched to use 
> > > > > bpf_map_area_alloc()
> > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix 
> > > > > quickly.
> > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under 
> > > > > pressure with map alloc")
> > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in 
> > > > > this set.
> > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> > > > 
> > > > OK, will do. Thanks for the heads up.
> > > 
> > > Just for the record, I will fold the following into the patch 1
> > > ---
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 19b6129eab23..8697f43cf93c 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list 
> > > *tl)
> > > 
> > >   void *bpf_map_area_alloc(size_t size)
> > >   {
> > > -   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> > > -* trigger under memory pressure as we really just want to
> > > -* fail instead.
> > > -*/
> > > -   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> > > -   void *area;
> > > -
> > > -   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> > > -   area = kmalloc(size, GFP_USER | flags);
> > > -   if (area != NULL)
> > > -   return area;
> > > -   }
> > > -
> > > -   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> > > -PAGE_KERNEL);
> > > +   return kvzalloc(size, GFP_USER);
> > >   }
> > > 
> > >   void bpf_map_area_free(void *area)
> > 
> > Looks fine by me.
> > Daniel, thoughts?
> 
> I assume that kvzalloc() is still the same from [1], right? If so, then
> it would unfortunately (partially) reintroduce the issue that was fixed.
> If you look above at flags, they're also passed to __vmalloc() to not
> trigger OOM in these situations I've experienced.

Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.

> This is effectively the
> same requirement as in other networking areas f.e. that 5bad87348c70
> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> In your comment in kvzalloc() you eventually say that some of the above
> modifiers are not supported. So there would be two options, i) just leave
> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> it later (along with similar code from 5bad87348c70), or ii) implement
> support for these modifiers as well to your original set. I guess it's not
> too urgent, so we could also proceed with i) if that is easier for you to
> proceed (I don't mind either way).

Could you clarify why the oom killer in vmalloc matters actually?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:
> > > On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> [...]
> > > > > > Are there any more comments? I would really appreciate to hear from
> > > > > > networking folks before I resubmit the series.
> > > > > 
> > > > > while this patchset was baking the bpf side switched to use 
> > > > > bpf_map_area_alloc()
> > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix 
> > > > > quickly.
> > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under 
> > > > > pressure with map alloc")
> > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in 
> > > > > this set.
> > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> > > > 
> > > > OK, will do. Thanks for the heads up.
> > > 
> > > Just for the record, I will fold the following into the patch 1
> > > ---
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 19b6129eab23..8697f43cf93c 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list 
> > > *tl)
> > > 
> > >   void *bpf_map_area_alloc(size_t size)
> > >   {
> > > -   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> > > -* trigger under memory pressure as we really just want to
> > > -* fail instead.
> > > -*/
> > > -   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> > > -   void *area;
> > > -
> > > -   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> > > -   area = kmalloc(size, GFP_USER | flags);
> > > -   if (area != NULL)
> > > -   return area;
> > > -   }
> > > -
> > > -   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> > > -PAGE_KERNEL);
> > > +   return kvzalloc(size, GFP_USER);
> > >   }
> > > 
> > >   void bpf_map_area_free(void *area)
> > 
> > Looks fine by me.
> > Daniel, thoughts?
> 
> I assume that kvzalloc() is still the same from [1], right? If so, then
> it would unfortunately (partially) reintroduce the issue that was fixed.
> If you look above at flags, they're also passed to __vmalloc() to not
> trigger OOM in these situations I've experienced.

Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.

> This is effectively the
> same requirement as in other networking areas f.e. that 5bad87348c70
> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> In your comment in kvzalloc() you eventually say that some of the above
> modifiers are not supported. So there would be two options, i) just leave
> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> it later (along with similar code from 5bad87348c70), or ii) implement
> support for these modifiers as well to your original set. I guess it's not
> too urgent, so we could also proceed with i) if that is easier for you to
> proceed (I don't mind either way).

Could you clarify why the oom killer in vmalloc matters actually?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Daniel Borkmann


On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:

On Wed 25-01-17 14:10:06, Michal Hocko wrote:

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:

[...]

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.


while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().


OK, will do. Thanks for the heads up.


Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

  void *bpf_map_area_alloc(size_t size)
  {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
  }

  void bpf_map_area_free(void *area)


Looks fine by me.
Daniel, thoughts?


I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced. This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).

Thanks a lot,
Daniel

  [1] https://lkml.org/lkml/2017/1/12/442

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Daniel Borkmann


On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:

On Wed 25-01-17 14:10:06, Michal Hocko wrote:

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:

[...]

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.


while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().


OK, will do. Thanks for the heads up.


Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)

  void *bpf_map_area_alloc(size_t size)
  {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
  }

  void bpf_map_area_free(void *area)


Looks fine by me.
Daniel, thoughts?


I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced. This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).

Thanks a lot,
Daniel

  [1] https://lkml.org/lkml/2017/1/12/442

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Alexei Starovoitov

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:
> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
>> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
>> > > > Hi,
>> > > > this has been previously posted as a single patch [1] but later on more
>> > > > built on top. It turned out that there are users who would like to have
>> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
>> > > > requests. Doing the same for smaller requests would require to redefine
>> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
>> > > > this series.
>> > > >
>> > > > There are many open coded kmalloc with vmalloc fallback instances in
>> > > > the tree.  Most of them are not careful enough or simply do not care
>> > > > about the underlying semantic of the kmalloc/page allocator which means
>> > > > that a) some vmalloc fallbacks are basically unreachable because the
>> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
>> > > > can invoke a really disruptive steps like the OOM killer to move 
>> > > > forward
>> > > > which doesn't sound appropriate when we consider that the vmalloc
>> > > > fallback is available.
>> > > >
>> > > > As it can be seen implementing kvmalloc requires quite an intimate
>> > > > knowledge if the page allocator and the memory reclaim internals which
>> > > > strongly suggests that a helper should be implemented in the memory
>> > > > subsystem proper.
>> > > >
>> > > > Most callers I could find have been converted to use the helper 
>> > > > instead.
>> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
>> > > > networking stack which I have converted as well but considering we do
>> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
>> > > > have marked it RFC.
>> > >
>> > > Are there any more comments? I would really appreciate to hear from
>> > > networking folks before I resubmit the series.
>> >
>> > while this patchset was baking the bpf side switched to use 
>> > bpf_map_area_alloc()
>> > which fixes the issue with missing __GFP_NORETRY that we had to fix 
>> > quickly.
>> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure 
>> > with map alloc")
>> > it covers all kmalloc/vmalloc pairs instead of just one place as in this 
>> > set.
>> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>
>> OK, will do. Thanks for the heads up.
>
> Just for the record, I will fold the following into the patch 1
> ---
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..8697f43cf93c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>  void *bpf_map_area_alloc(size_t size)
>  {
> -   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -* trigger under memory pressure as we really just want to
> -* fail instead.
> -*/
> -   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -   void *area;
> -
> -   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -   area = kmalloc(size, GFP_USER | flags);
> -   if (area != NULL)
> -   return area;
> -   }
> -
> -   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -PAGE_KERNEL);
> +   return kvzalloc(size, GFP_USER);
>  }
>
>  void bpf_map_area_free(void *area)

Looks fine by me.
Daniel, thoughts?

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Alexei Starovoitov

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko  wrote:
> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
>> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
>> > > > Hi,
>> > > > this has been previously posted as a single patch [1] but later on more
>> > > > built on top. It turned out that there are users who would like to have
>> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
>> > > > requests. Doing the same for smaller requests would require to redefine
>> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
>> > > > this series.
>> > > >
>> > > > There are many open coded kmalloc with vmalloc fallback instances in
>> > > > the tree.  Most of them are not careful enough or simply do not care
>> > > > about the underlying semantic of the kmalloc/page allocator which means
>> > > > that a) some vmalloc fallbacks are basically unreachable because the
>> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
>> > > > can invoke a really disruptive steps like the OOM killer to move 
>> > > > forward
>> > > > which doesn't sound appropriate when we consider that the vmalloc
>> > > > fallback is available.
>> > > >
>> > > > As it can be seen implementing kvmalloc requires quite an intimate
>> > > > knowledge if the page allocator and the memory reclaim internals which
>> > > > strongly suggests that a helper should be implemented in the memory
>> > > > subsystem proper.
>> > > >
>> > > > Most callers I could find have been converted to use the helper 
>> > > > instead.
>> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
>> > > > networking stack which I have converted as well but considering we do
>> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
>> > > > have marked it RFC.
>> > >
>> > > Are there any more comments? I would really appreciate to hear from
>> > > networking folks before I resubmit the series.
>> >
>> > while this patchset was baking the bpf side switched to use 
>> > bpf_map_area_alloc()
>> > which fixes the issue with missing __GFP_NORETRY that we had to fix 
>> > quickly.
>> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure 
>> > with map alloc")
>> > it covers all kmalloc/vmalloc pairs instead of just one place as in this 
>> > set.
>> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>
>> OK, will do. Thanks for the heads up.
>
> Just for the record, I will fold the following into the patch 1
> ---
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..8697f43cf93c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>  void *bpf_map_area_alloc(size_t size)
>  {
> -   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -* trigger under memory pressure as we really just want to
> -* fail instead.
> -*/
> -   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -   void *area;
> -
> -   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -   area = kmalloc(size, GFP_USER | flags);
> -   if (area != NULL)
> -   return area;
> -   }
> -
> -   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -PAGE_KERNEL);
> +   return kvzalloc(size, GFP_USER);
>  }
>
>  void bpf_map_area_free(void *area)

Looks fine by me.
Daniel, thoughts?

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > > Hi,
> > > > this has been previously posted as a single patch [1] but later on more
> > > > built on top. It turned out that there are users who would like to have
> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > > requests. Doing the same for smaller requests would require to redefine
> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > > this series.
> > > > 
> > > > There are many open coded kmalloc with vmalloc fallback instances in
> > > > the tree.  Most of them are not careful enough or simply do not care
> > > > about the underlying semantic of the kmalloc/page allocator which means
> > > > that a) some vmalloc fallbacks are basically unreachable because the
> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > > can invoke a really disruptive steps like the OOM killer to move forward
> > > > which doesn't sound appropriate when we consider that the vmalloc
> > > > fallback is available.
> > > > 
> > > > As it can be seen implementing kvmalloc requires quite an intimate
> > > > knowledge if the page allocator and the memory reclaim internals which
> > > > strongly suggests that a helper should be implemented in the memory
> > > > subsystem proper.
> > > > 
> > > > Most callers I could find have been converted to use the helper instead.
> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > > networking stack which I have converted as well but considering we do
> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > > have marked it RFC.
> > > 
> > > Are there any more comments? I would really appreciate to hear from
> > > networking folks before I resubmit the series.
> > 
> > while this patchset was baking the bpf side switched to use 
> > bpf_map_area_alloc()
> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with 
> > map alloc")
> > it covers all kmalloc/vmalloc pairs instead of just one place as in this 
> > set.
> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> 
> OK, will do. Thanks for the heads up.

Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > > Hi,
> > > > this has been previously posted as a single patch [1] but later on more
> > > > built on top. It turned out that there are users who would like to have
> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > > requests. Doing the same for smaller requests would require to redefine
> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > > this series.
> > > > 
> > > > There are many open coded kmalloc with vmalloc fallback instances in
> > > > the tree.  Most of them are not careful enough or simply do not care
> > > > about the underlying semantic of the kmalloc/page allocator which means
> > > > that a) some vmalloc fallbacks are basically unreachable because the
> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > > can invoke a really disruptive steps like the OOM killer to move forward
> > > > which doesn't sound appropriate when we consider that the vmalloc
> > > > fallback is available.
> > > > 
> > > > As it can be seen implementing kvmalloc requires quite an intimate
> > > > knowledge if the page allocator and the memory reclaim internals which
> > > > strongly suggests that a helper should be implemented in the memory
> > > > subsystem proper.
> > > > 
> > > > Most callers I could find have been converted to use the helper instead.
> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > > networking stack which I have converted as well but considering we do
> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > > have marked it RFC.
> > > 
> > > Are there any more comments? I would really appreciate to hear from
> > > networking folks before I resubmit the series.
> > 
> > while this patchset was baking the bpf side switched to use 
> > bpf_map_area_alloc()
> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with 
> > map alloc")
> > it covers all kmalloc/vmalloc pairs instead of just one place as in this 
> > set.
> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> 
> OK, will do. Thanks for the heads up.

Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-   /* We definitely need __GFP_NORETRY, so OOM killer doesn't
-* trigger under memory pressure as we really just want to
-* fail instead.
-*/
-   const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-   void *area;
-
-   if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-   area = kmalloc(size, GFP_USER | flags);
-   if (area != NULL)
-   return area;
-   }
-
-   return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-PAGE_KERNEL);
+   return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > Hi,
> > > this has been previously posted as a single patch [1] but later on more
> > > built on top. It turned out that there are users who would like to have
> > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > requests. Doing the same for smaller requests would require to redefine
> > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > this series.
> > > 
> > > There are many open coded kmalloc with vmalloc fallback instances in
> > > the tree.  Most of them are not careful enough or simply do not care
> > > about the underlying semantic of the kmalloc/page allocator which means
> > > that a) some vmalloc fallbacks are basically unreachable because the
> > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > can invoke a really disruptive steps like the OOM killer to move forward
> > > which doesn't sound appropriate when we consider that the vmalloc
> > > fallback is available.
> > > 
> > > As it can be seen implementing kvmalloc requires quite an intimate
> > > knowledge if the page allocator and the memory reclaim internals which
> > > strongly suggests that a helper should be implemented in the memory
> > > subsystem proper.
> > > 
> > > Most callers I could find have been converted to use the helper instead.
> > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > networking stack which I have converted as well but considering we do
> > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > have marked it RFC.
> > 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> while this patchset was baking the bpf side switched to use 
> bpf_map_area_alloc()
> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with 
> map alloc")
> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().

OK, will do. Thanks for the heads up.


-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > Hi,
> > > this has been previously posted as a single patch [1] but later on more
> > > built on top. It turned out that there are users who would like to have
> > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > requests. Doing the same for smaller requests would require to redefine
> > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > this series.
> > > 
> > > There are many open coded kmalloc with vmalloc fallback instances in
> > > the tree.  Most of them are not careful enough or simply do not care
> > > about the underlying semantic of the kmalloc/page allocator which means
> > > that a) some vmalloc fallbacks are basically unreachable because the
> > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > can invoke a really disruptive steps like the OOM killer to move forward
> > > which doesn't sound appropriate when we consider that the vmalloc
> > > fallback is available.
> > > 
> > > As it can be seen implementing kvmalloc requires quite an intimate
> > > knowledge if the page allocator and the memory reclaim internals which
> > > strongly suggests that a helper should be implemented in the memory
> > > subsystem proper.
> > > 
> > > Most callers I could find have been converted to use the helper instead.
> > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > networking stack which I have converted as well but considering we do
> > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > have marked it RFC.
> > 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> while this patchset was baking the bpf side switched to use 
> bpf_map_area_alloc()
> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with 
> map alloc")
> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().

OK, will do. Thanks for the heads up.


-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Tue 24-01-17 08:00:26, Eric Dumazet wrote:
> On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> I do not see any issues right now.
> 
> I am happy to see this thing finally coming, after years of
> resistance ;)

OK, so I will repost the series and ask Andrew for inclusion
after it passes my compile test battery after the rebase.
 
Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-25 Thread Michal Hocko

On Tue 24-01-17 08:00:26, Eric Dumazet wrote:
> On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> I do not see any issues right now.
> 
> I am happy to see this thing finally coming, after years of
> resistance ;)

OK, so I will repost the series and ask Andrew for inclusion
after it passes my compile test battery after the rebase.
 
Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Alexei Starovoitov

On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > Hi,
> > this has been previously posted as a single patch [1] but later on more
> > built on top. It turned out that there are users who would like to have
> > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > requests. Doing the same for smaller requests would require to redefine
> > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > this series.
> > 
> > There are many open coded kmalloc with vmalloc fallback instances in
> > the tree.  Most of them are not careful enough or simply do not care
> > about the underlying semantic of the kmalloc/page allocator which means
> > that a) some vmalloc fallbacks are basically unreachable because the
> > kmalloc part will keep retrying until it succeeds b) the page allocator
> > can invoke a really disruptive steps like the OOM killer to move forward
> > which doesn't sound appropriate when we consider that the vmalloc
> > fallback is available.
> > 
> > As it can be seen implementing kvmalloc requires quite an intimate
> > knowledge if the page allocator and the memory reclaim internals which
> > strongly suggests that a helper should be implemented in the memory
> > subsystem proper.
> > 
> > Most callers I could find have been converted to use the helper instead.
> > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > networking stack which I have converted as well but considering we do
> > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > have marked it RFC.
> 
> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
Thanks

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Alexei Starovoitov

On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > Hi,
> > this has been previously posted as a single patch [1] but later on more
> > built on top. It turned out that there are users who would like to have
> > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > requests. Doing the same for smaller requests would require to redefine
> > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > this series.
> > 
> > There are many open coded kmalloc with vmalloc fallback instances in
> > the tree.  Most of them are not careful enough or simply do not care
> > about the underlying semantic of the kmalloc/page allocator which means
> > that a) some vmalloc fallbacks are basically unreachable because the
> > kmalloc part will keep retrying until it succeeds b) the page allocator
> > can invoke a really disruptive steps like the OOM killer to move forward
> > which doesn't sound appropriate when we consider that the vmalloc
> > fallback is available.
> > 
> > As it can be seen implementing kvmalloc requires quite an intimate
> > knowledge if the page allocator and the memory reclaim internals which
> > strongly suggests that a helper should be implemented in the memory
> > subsystem proper.
> > 
> > Most callers I could find have been converted to use the helper instead.
> > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > networking stack which I have converted as well but considering we do
> > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > have marked it RFC.
> 
> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map 
alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
Thanks

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Eric Dumazet

On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:

> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

I do not see any issues right now.

I am happy to see this thing finally coming, after years of
resistance ;)

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Eric Dumazet

On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:

> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

I do not see any issues right now.

I am happy to see this thing finally coming, after years of
resistance ;)

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Michal Hocko

On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> Hi,
> this has been previously posted as a single patch [1] but later on more
> built on top. It turned out that there are users who would like to have
> __GFP_REPEAT semantic. This is currently implemented for costly >64B
> requests. Doing the same for smaller requests would require to redefine
> __GFP_REPEAT semantic in the page allocator which is out of scope of
> this series.
> 
> There are many open coded kmalloc with vmalloc fallback instances in
> the tree.  Most of them are not careful enough or simply do not care
> about the underlying semantic of the kmalloc/page allocator which means
> that a) some vmalloc fallbacks are basically unreachable because the
> kmalloc part will keep retrying until it succeeds b) the page allocator
> can invoke a really disruptive steps like the OOM killer to move forward
> which doesn't sound appropriate when we consider that the vmalloc
> fallback is available.
> 
> As it can be seen implementing kvmalloc requires quite an intimate
> knowledge if the page allocator and the memory reclaim internals which
> strongly suggests that a helper should be implemented in the memory
> subsystem proper.
> 
> Most callers I could find have been converted to use the helper instead.
> This is patch 5. There are some more relying on __GFP_REPEAT in the
> networking stack which I have converted as well but considering we do
> not have a support for __GFP_REPEAT for requests smaller than 64kB I
> have marked it RFC.

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.

Thanks!

> [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org
> 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/6 v3] kvmalloc

2017-01-24 Thread Michal Hocko

On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> Hi,
> this has been previously posted as a single patch [1] but later on more
> built on top. It turned out that there are users who would like to have
> __GFP_REPEAT semantic. This is currently implemented for costly >64B
> requests. Doing the same for smaller requests would require to redefine
> __GFP_REPEAT semantic in the page allocator which is out of scope of
> this series.
> 
> There are many open coded kmalloc with vmalloc fallback instances in
> the tree.  Most of them are not careful enough or simply do not care
> about the underlying semantic of the kmalloc/page allocator which means
> that a) some vmalloc fallbacks are basically unreachable because the
> kmalloc part will keep retrying until it succeeds b) the page allocator
> can invoke a really disruptive steps like the OOM killer to move forward
> which doesn't sound appropriate when we consider that the vmalloc
> fallback is available.
> 
> As it can be seen implementing kvmalloc requires quite an intimate
> knowledge if the page allocator and the memory reclaim internals which
> strongly suggests that a helper should be implemented in the memory
> subsystem proper.
> 
> Most callers I could find have been converted to use the helper instead.
> This is patch 5. There are some more relying on __GFP_REPEAT in the
> networking stack which I have converted as well but considering we do
> not have a support for __GFP_REPEAT for requests smaller than 64kB I
> have marked it RFC.

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.

Thanks!

> [1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org
> 

-- 
Michal Hocko
SUSE Labs

[PATCH 0/6 v3] kvmalloc

2017-01-12 Thread Michal Hocko

Hi,
this has been previously posted as a single patch [1] but later on more
built on top. It turned out that there are users who would like to have
__GFP_REPEAT semantic. This is currently implemented for costly >64B
requests. Doing the same for smaller requests would require to redefine
__GFP_REPEAT semantic in the page allocator which is out of scope of
this series.

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers I could find have been converted to use the helper instead.
This is patch 5. There are some more relying on __GFP_REPEAT in the
networking stack which I have converted as well but considering we do
not have a support for __GFP_REPEAT for requests smaller than 64kB I
have marked it RFC.

[1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org

[PATCH 0/6 v3] kvmalloc

2017-01-12 Thread Michal Hocko

Hi,
this has been previously posted as a single patch [1] but later on more
built on top. It turned out that there are users who would like to have
__GFP_REPEAT semantic. This is currently implemented for costly >64B
requests. Doing the same for smaller requests would require to redefine
__GFP_REPEAT semantic in the page allocator which is out of scope of
this series.

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers I could find have been converted to use the helper instead.
This is patch 5. There are some more relying on __GFP_REPEAT in the
networking stack which I have converted as well but considering we do
not have a support for __GFP_REPEAT for requests smaller than 64kB I
have marked it RFC.

[1] http://lkml.kernel.org/r/20170102133700.1734-1-mho...@kernel.org

64 matches

Mail list logo