Re: [PATCH net-next 00/11] net: sched: cls_u32 Various improvements

2018-10-08 Thread Al Viro
On Sun, Oct 07, 2018 at 10:55:52PM -0700, David Miller wrote:
> From: Al Viro 
> Date: Mon, 8 Oct 2018 06:45:15 +0100
> 
> > Er...  Both are due to missing in the very beginning of the series (well, on
> > top of "net: sched: cls_u32: fix hnode refcounting") commit
> 
> All dependencies like this must be explicitly stated.
>
> And in such situations you actually should wait for the dependency to
> get into 'net', eventually get merged into 'net-next', and then you
> can submit the stuff that depends upon it.
> 
> Not the way this was done.

Point (and this commit is simply missing - it's not that it went into
net).  FWIW, this was simply "this is what the breakage is caused by",
not an attempt to amend the submission or anything like that...


Re: [PATCH net-next 00/11] net: sched: cls_u32 Various improvements

2018-10-07 Thread Al Viro
On Sun, Oct 07, 2018 at 09:25:01PM -0700, David Miller wrote:
> From: Jamal Hadi Salim 
> Date: Sun,  7 Oct 2018 12:38:00 -0400
> 
> > Various improvements from Al.
> 
> Please submit changes that actually are compile tested:
> 
>   CC [M]  net/sched/cls_u32.o
> net/sched/cls_u32.c: In function ‘u32_delete’:
> net/sched/cls_u32.c:674:6: error: ‘root_ht’ undeclared (first use in this 
> function); did you mean ‘root_user’?
>   if (root_ht == ht) {
>   ^~~
>   root_user
> net/sched/cls_u32.c:674:6: note: each undeclared identifier is reported only 
> once for each function it appears in
> net/sched/cls_u32.c: In function ‘u32_set_parms’:
> net/sched/cls_u32.c:746:15: error: ‘struct tc_u_hnode’ has no member named 
> ‘is_root’
> if (ht_down->is_root) {
>^~

Er...  Both are due to missing in the very beginning of the series (well, on
top of "net: sched: cls_u32: fix hnode refcounting") commit

Author: Al Viro 
Date:   Mon Sep 3 14:39:02 2018 -0400

net: sched: cls_u32: mark root hnode explicitly

... and produce consistent error on attempt to delete such.
Existing check in u32_delete() is inconsistent - after

tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent : protocol ip prio 100 handle 1: u32 
divisor 1
tc filter add dev eth0 parent : protocol ip prio 200 handle 2: u32 
divisor 1

both

tc filter delete dev eth0 parent : protocol ip prio 100 handle 801: u32

and

tc filter delete dev eth0 parent : protocol ip prio 100 handle 800: u32

will fail (at least with refcounting fixes), but the former will complain
about an attempt to remove a busy table, while the latter will recognize
it as root and yield "Not allowed to delete root node" instead.

The problem with the existing check is that several tcf_proto instances 
might
share the same tp->data and handle-to-hnode lookup will be the same for all
of them.  So comparing an hnode to be deleted with tp->root won't catch the
case when one tp is used to try deleting the root of another.  Solution is
trivial - mark the root hnodes explicitly upon allocation and check for 
that.

Signed-off-by: Al Viro 

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index b2c3406a2cf2..c4782aa808c7 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -84,6 +84,7 @@ struct tc_u_hnode {
int refcnt;
unsigned intdivisor;
struct idr  handle_idr;
+   boolis_root;
struct rcu_head rcu;
u32 flags;
/* The 'ht' field MUST be the last field in structure to allow for
@@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
root_ht->refcnt++;
root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000;
root_ht->prio = tp->prio;
+   root_ht->is_root = true;
idr_init(_ht->handle_idr);
 
if (tp_c == NULL) {
@@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
goto out;
}
 
-   if (root_ht == ht) {
+   if (ht->is_root) {
NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
return -EINVAL;
}


Re: [PATCH] socket: fix struct ifreq size in compat ioctl

2018-09-13 Thread Al Viro
On Thu, Sep 13, 2018 at 01:49:09PM +0200, Johannes Berg wrote:
> From: Johannes Berg 
> 
> As reported by Reobert O'Callahan, since Viro's commit to kill
> dev_ifsioc() we attempt to copy too much data in compat mode,
> which may lead to EFAULT when the 32-bit version of struct ifreq
> sits at/near the end of a page boundary, and the next page isn't
> mapped.
> 
> Fix this by passing whether or not we're doing a compat call and
> copying the appropriate size in/out, as we did before. This works
> because only the embedded "struct ifmap" has different size, and
> this is only used in SIOCGIFMAP/SIOCSIFMAP which has a different
> handler. All other parts of the union are naturally compatible.

Might be better to pass explicit size instead of this bool compat...


Re: [PATCH net 00/13] cls_u32 cleanups and fixes.

2018-09-09 Thread Al Viro
On Sun, Sep 09, 2018 at 03:15:38PM +0100, Al Viro wrote:

> Umm...  Interesting - TCA_U32_SEL is not the only thing that
> gets ignored there; TCA_U32_MARK gets the same treatment.
> And then there's a lovely question what to do with n->pf -
> it's an array of n->sel.nkeys counters, and apparently we
> want (at least in common cases) to avoid resetting those.
> 
> *If* we declare that ->nkeys mismatch means failure, it's
> all relatively easy to implement.  Alternatively, we could
> declare that selector change means resetting the stats.
> Preferences?

BTW, shouldn't we issue u32_clear_hw_hnode() every time
we destroy an hnode?  It's done on u32_delete(), it's
done (for root ht) on u32_destroy(), but it's not done
for any other hnodes when you remove the entire (not shared)
filter.  Looks fishy...


Re: [PATCH net 00/13] cls_u32 cleanups and fixes.

2018-09-09 Thread Al Viro
On Sun, Sep 09, 2018 at 08:58:50AM -0400, Jamal Hadi Salim wrote:
> 
> Since you have the momentum here: i noticed something
> unusual while i was trying to craft a test that would
> vet some of your changes. This has nothing to do with
> your changes, same happens on my stock debian laptop
> with kernel:
> 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 (2018-08-27)
> 
> Looking at git - possibly introduced around the time u32
> lockless was being introduced and maybe even earlier
> than that.

It's always been that way, actually - before that point the old
knode simply got reused, which excluded any chance of changing
n->sel.

> Unfortunately i dont have time to dig
> further.
> 
> To reproduce what i am referring to, here's a setup:
> 
> $tc filter add dev eth0 parent : protocol ip prio 102 u32 \
> classid 1:2 match ip src 192.168.8.0/8
> $tc filter replace dev eth0 parent : protocol ip prio 102 \
> handle 800:0:800 u32 classid 1:2 match ip src 1.1.0.0/24
> 
> u32_change() code path should have allowed changing of the
> keynode.

Umm...  Interesting - TCA_U32_SEL is not the only thing that
gets ignored there; TCA_U32_MARK gets the same treatment.
And then there's a lovely question what to do with n->pf -
it's an array of n->sel.nkeys counters, and apparently we
want (at least in common cases) to avoid resetting those.

*If* we declare that ->nkeys mismatch means failure, it's
all relatively easy to implement.  Alternatively, we could
declare that selector change means resetting the stats.
Preferences?


[PATCH net 13/13] net: sched: cls_u32: simplify the hell out u32_delete() emptiness check

2018-09-08 Thread Al Viro
From: Al Viro 

Now that we have the knode count, we can instantly check if
any hnodes are non-empty.  And that kills the check for extra
references to root hnode - those could happen only if there was
a knode to carry such a link.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 48 +---
 1 file changed, 1 insertion(+), 47 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 86cbe4f5800e..ef786ec24843 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -628,17 +628,6 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
return -ENOENT;
 }
 
-static bool ht_empty(struct tc_u_hnode *ht)
-{
-   unsigned int h;
-
-   for (h = 0; h <= ht->divisor; h++)
-   if (rcu_access_pointer(ht->ht[h]))
-   return false;
-
-   return true;
-}
-
 static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
struct tc_u_common *tp_c = tp->data;
@@ -676,13 +665,9 @@ static int u32_delete(struct tcf_proto *tp, void *arg, 
bool *last,
  struct netlink_ext_ack *extack)
 {
struct tc_u_hnode *ht = arg;
-   struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
struct tc_u_common *tp_c = tp->data;
int ret = 0;
 
-   if (ht == NULL)
-   goto out;
-
if (TC_U32_KEY(ht->handle)) {
u32_remove_hw_knode(tp, (struct tc_u_knode *)ht, extack);
ret = u32_delete_key(tp, (struct tc_u_knode *)ht);
@@ -702,38 +687,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, 
bool *last,
}
 
 out:
-   *last = true;
-   if (root_ht) {
-   if (root_ht->refcnt > 2) {
-   *last = false;
-   goto ret;
-   }
-   if (root_ht->refcnt == 2) {
-   if (!ht_empty(root_ht)) {
-   *last = false;
-   goto ret;
-   }
-   }
-   }
-
-   if (tp_c->refcnt > 1) {
-   *last = false;
-   goto ret;
-   }
-
-   if (tp_c->refcnt == 1) {
-   struct tc_u_hnode *ht;
-
-   for (ht = rtnl_dereference(tp_c->hlist);
-ht;
-ht = rtnl_dereference(ht->next))
-   if (!ht_empty(ht)) {
-   *last = false;
-   break;
-   }
-   }
-
-ret:
+   *last = tp_c->refcnt == 1 && tp_c->knodes == 0;
return ret;
 }
 
-- 
2.11.0



[PATCH net 12/13] net: sched: cls_u32: keep track of knodes count in tc_u_common

2018-09-08 Thread Al Viro
From: Al Viro 

allows to simplify u32_delete() considerably

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index f6bb3885598d..86cbe4f5800e 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -97,6 +97,7 @@ struct tc_u_common {
int refcnt;
struct idr  handle_idr;
struct hlist_node   hnode;
+   longknodes;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
@@ -453,6 +454,7 @@ static void u32_delete_key_freepf_work(struct work_struct 
*work)
 
 static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
 {
+   struct tc_u_common *tp_c = tp->data;
struct tc_u_knode __rcu **kp;
struct tc_u_knode *pkp;
struct tc_u_hnode *ht = rtnl_dereference(key->ht_up);
@@ -463,6 +465,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct 
tc_u_knode *key)
 kp = >next, pkp = rtnl_dereference(*kp)) {
if (pkp == key) {
RCU_INIT_POINTER(*kp, key->next);
+   tp_c->knodes--;
 
tcf_unbind_filter(tp, >res);
idr_remove(>handle_idr, key->handle);
@@ -577,6 +580,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, 
struct tc_u_knode *n,
 static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
struct netlink_ext_ack *extack)
 {
+   struct tc_u_common *tp_c = tp->data;
struct tc_u_knode *n;
unsigned int h;
 
@@ -584,6 +588,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
while ((n = rtnl_dereference(ht->ht[h])) != NULL) {
RCU_INIT_POINTER(ht->ht[h],
 rtnl_dereference(n->next));
+   tp_c->knodes--;
tcf_unbind_filter(tp, >res);
u32_remove_hw_knode(tp, n, extack);
idr_remove(>handle_idr, n->handle);
@@ -1140,6 +1145,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
 
RCU_INIT_POINTER(n->next, pins);
rcu_assign_pointer(*ins, n);
+   tp_c->knodes++;
*arg = n;
return 0;
}
-- 
2.11.0



[PATCH net 11/13] net: sched: cls_u32: get rid of hnode ->tp_c and tp_c argument of u32_set_parms()

2018-09-08 Thread Al Viro
From: Al Viro 

the latter is redundant, the former - never read...

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 12757e3ec8d8..f6bb3885598d 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -79,7 +79,6 @@ struct tc_u_hnode {
struct tc_u_hnode __rcu *next;
u32 handle;
u32 prio;
-   struct tc_u_common  *tp_c;
int refcnt;
unsigned intdivisor;
struct idr  handle_idr;
@@ -390,7 +389,6 @@ static int u32_init(struct tcf_proto *tp)
tp_c->refcnt++;
RCU_INIT_POINTER(root_ht->next, tp_c->hlist);
rcu_assign_pointer(tp_c->hlist, root_ht);
-   root_ht->tp_c = tp_c;
 
root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
@@ -761,7 +759,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] 
= {
 };
 
 static int u32_set_parms(struct net *net, struct tcf_proto *tp,
-unsigned long base, struct tc_u_common *tp_c,
+unsigned long base,
 struct tc_u_knode *n, struct nlattr **tb,
 struct nlattr *est, bool ovr,
 struct netlink_ext_ack *extack)
@@ -782,7 +780,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
}
 
if (handle) {
-   ht_down = u32_lookup_ht(tp_c, handle);
+   ht_down = u32_lookup_ht(tp->data, handle);
 
if (!ht_down) {
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
@@ -956,7 +954,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (!new)
return -ENOMEM;
 
-   err = u32_set_parms(net, tp, base, tp_c, new, tb,
+   err = u32_set_parms(net, tp, base, new, tb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
@@ -1012,7 +1010,6 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
return err;
}
}
-   ht->tp_c = tp_c;
ht->refcnt = 1;
ht->divisor = divisor;
ht->handle = handle;
@@ -1123,8 +1120,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
}
 #endif
 
-   err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr,
-   extack);
+   err = u32_set_parms(net, tp, base, n, tb, tca[TCA_RATE], ovr, extack);
if (err == 0) {
struct tc_u_knode __rcu **ins;
struct tc_u_knode *pins;
-- 
2.11.0



[PATCH net 09/13] net: sched: cls_u32: pass tc_u_common to u32_set_parms() instead of tc_u_hnode

2018-09-08 Thread Al Viro
From: Al Viro 

the only thing we used ht for was ht->tp_c and callers can get that
without going through ->tp_c at all; start with lifting that into
the callers, next commits will massage those, eventually removing
->tp_c altogether.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index af05113c1212..221ce532b241 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -761,7 +761,7 @@ static const struct nla_policy u32_policy[TCA_U32_MAX + 1] 
= {
 };
 
 static int u32_set_parms(struct net *net, struct tcf_proto *tp,
-unsigned long base, struct tc_u_hnode *ht,
+unsigned long base, struct tc_u_common *tp_c,
 struct tc_u_knode *n, struct nlattr **tb,
 struct nlattr *est, bool ovr,
 struct netlink_ext_ack *extack)
@@ -782,7 +782,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
}
 
if (handle) {
-   ht_down = u32_lookup_ht(ht->tp_c, handle);
+   ht_down = u32_lookup_ht(tp_c, handle);
 
if (!ht_down) {
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
@@ -957,7 +957,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
return -ENOMEM;
 
err = u32_set_parms(net, tp, base,
-   rtnl_dereference(n->ht_up), new, tb,
+   rtnl_dereference(n->ht_up)->tp_c, new, tb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
@@ -1124,7 +1124,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
}
 #endif
 
-   err = u32_set_parms(net, tp, base, ht, n, tb, tca[TCA_RATE], ovr,
+   err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr,
extack);
if (err == 0) {
struct tc_u_knode __rcu **ins;
-- 
2.11.0



[PATCH net 10/13] net: sched: cls_u32: the tp_c argument of u32_set_parms() is always tp->data

2018-09-08 Thread Al Viro
From: Al Viro 

It must be tc_u_common associated with that tp (i.e. tp->data).
Proof:
* both ->ht_up and ->tp_c are assign-once
* ->tp_c of anything inserted into tp_c->hlist is tp_c
* hnodes never get reinserted into the lists or moved
between those, so anything found by u32_lookup_ht(tp->data, ...)
will have ->tp_c equal to tp->data.
* tp->root->tp_c == tp->data.
* ->ht_up of anything inserted into hnode->ht[...] is
equal to hnode.
* knodes never get reinserted into hash chains or moved
between those, so anything returned by u32_lookup_key(ht, ...)
will have ->ht_up equal to ht.
* any knode returned by u32_get(tp, ...) will have ->ht_up->tp_c
point to tp->data

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 221ce532b241..12757e3ec8d8 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -956,8 +956,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (!new)
return -ENOMEM;
 
-   err = u32_set_parms(net, tp, base,
-   rtnl_dereference(n->ht_up)->tp_c, new, tb,
+   err = u32_set_parms(net, tp, base, tp_c, new, tb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
@@ -1124,7 +1123,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
}
 #endif
 
-   err = u32_set_parms(net, tp, base, ht->tp_c, n, tb, tca[TCA_RATE], ovr,
+   err = u32_set_parms(net, tp, base, tp_c, n, tb, tca[TCA_RATE], ovr,
extack);
if (err == 0) {
struct tc_u_knode __rcu **ins;
-- 
2.11.0



[PATCH net 08/13] net: sched: cls_u32: clean tc_u_common hashtable

2018-09-08 Thread Al Viro
From: Al Viro 

* calculate key *once*, not for each hash chain element
* let tc_u_hash() return the pointer to chain head rather than index -
callers are cleaner that way.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 1bfbfcab7260..af05113c1212 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -343,19 +343,16 @@ static void *tc_u_common_ptr(const struct tcf_proto *tp)
return block->q;
 }
 
-static unsigned int tc_u_hash(const struct tcf_proto *tp)
+static struct hlist_head *tc_u_hash(void *key)
 {
-   return hash_ptr(tc_u_common_ptr(tp), U32_HASH_SHIFT);
+   return tc_u_common_hash + hash_ptr(key, U32_HASH_SHIFT);
 }
 
-static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
+static struct tc_u_common *tc_u_common_find(void *key)
 {
struct tc_u_common *tc;
-   unsigned int h;
-
-   h = tc_u_hash(tp);
-   hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
-   if (tc->ptr == tc_u_common_ptr(tp))
+   hlist_for_each_entry(tc, tc_u_hash(key), hnode) {
+   if (tc->ptr == key)
return tc;
}
return NULL;
@@ -364,10 +361,8 @@ static struct tc_u_common *tc_u_common_find(const struct 
tcf_proto *tp)
 static int u32_init(struct tcf_proto *tp)
 {
struct tc_u_hnode *root_ht;
-   struct tc_u_common *tp_c;
-   unsigned int h;
-
-   tp_c = tc_u_common_find(tp);
+   void *key = tc_u_common_ptr(tp);
+   struct tc_u_common *tp_c = tc_u_common_find(key);
 
root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL);
if (root_ht == NULL)
@@ -385,12 +380,11 @@ static int u32_init(struct tcf_proto *tp)
kfree(root_ht);
return -ENOBUFS;
}
-   tp_c->ptr = tc_u_common_ptr(tp);
+   tp_c->ptr = key;
INIT_HLIST_NODE(_c->hnode);
idr_init(_c->handle_idr);
 
-   h = tc_u_hash(tp);
-   hlist_add_head(_c->hnode, _u_common_hash[h]);
+   hlist_add_head(_c->hnode, tc_u_hash(key));
}
 
tp_c->refcnt++;
-- 
2.11.0



[PATCH net 07/13] net: sched: cls_u32: get rid of tc_u_common ->rcu

2018-09-08 Thread Al Viro
From: Al Viro 

unused

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 281ac954511c..1bfbfcab7260 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -98,7 +98,6 @@ struct tc_u_common {
int refcnt;
struct idr  handle_idr;
struct hlist_node   hnode;
-   struct rcu_head rcu;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
-- 
2.11.0



[PATCH net 06/13] net: sched: cls_u32: get rid of tc_u_knode ->tp

2018-09-08 Thread Al Viro
From: Al Viro 

not used anymore

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index d11862823911..281ac954511c 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -68,7 +68,6 @@ struct tc_u_knode {
u32 mask;
u32 __percpu*pcpu_success;
 #endif
-   struct tcf_proto*tp;
struct rcu_work rwork;
/* The 'sel' field MUST be the last field in structure to allow for
 * tc_u32_keys allocated at end of structure.
@@ -896,7 +895,6 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto 
*tp,
/* Similarly success statistics must be moved as pointers */
new->pcpu_success = n->pcpu_success;
 #endif
-   new->tp = tp;
memcpy(>sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key));
 
if (tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE)) {
@@ -1112,7 +1110,6 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
n->handle = handle;
n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;
n->flags = flags;
-   n->tp = tp;
 
err = tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE);
if (err < 0)
-- 
2.11.0



[PATCH net 04/13] net: sched: cls_u32: make sure that divisor is a power of 2

2018-09-08 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 72459b09d910..d9923d474b65 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -994,7 +994,11 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (tb[TCA_U32_DIVISOR]) {
unsigned int divisor = nla_get_u32(tb[TCA_U32_DIVISOR]);
 
-   if (--divisor > 0x100) {
+   if (!is_power_of_2(divisor)) {
+   NL_SET_ERR_MSG_MOD(extack, "Divisor is not a power of 
2");
+   return -EINVAL;
+   }
+   if (divisor-- > 0x100) {
NL_SET_ERR_MSG_MOD(extack, "Exceeded maximum 256 hash 
buckets");
return -EINVAL;
}
-- 
2.11.0



[PATCH net 05/13] net: sched: cls_u32: get rid of unused argument of u32_destroy_key()

2018-09-08 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index d9923d474b65..d11862823911 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -406,8 +406,7 @@ static int u32_init(struct tcf_proto *tp)
return 0;
 }
 
-static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
-  bool free_pf)
+static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
 {
struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
 
@@ -441,7 +440,7 @@ static void u32_delete_key_work(struct work_struct *work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, false);
+   u32_destroy_key(key, false);
rtnl_unlock();
 }
 
@@ -458,7 +457,7 @@ static void u32_delete_key_freepf_work(struct work_struct 
*work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, true);
+   u32_destroy_key(key, true);
rtnl_unlock();
 }
 
@@ -601,7 +600,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
if (tcf_exts_get_net(>exts))
tcf_queue_work(>rwork, 
u32_delete_key_freepf_work);
else
-   u32_destroy_key(n->tp, n, true);
+   u32_destroy_key(n, true);
}
}
 }
@@ -971,13 +970,13 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
err = u32_replace_hw_knode(tp, new, flags, extack);
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
-- 
2.11.0



[PATCH net 03/13] net: sched: cls_u32: disallow linking to root hnode

2018-09-08 Thread Al Viro
From: Al Viro 

Operation makes no sense.  Nothing will actually break if we do so
(depth limit in u32_classify() will prevent infinite loops), but
according to maintainers it's best prohibited outright.

NOTE: doing so guarantees that u32_destroy() will trigger the call
of u32_destroy_hnode(); we might want to make that unconditional.

Test:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent : protocol ip prio 100 u32 \
link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 
ff
should fail with
Error: cls_u32: Not linking to root node

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index c4782aa808c7..72459b09d910 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -797,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
return -EINVAL;
}
+   if (ht_down->is_root) {
+   NL_SET_ERR_MSG_MOD(extack, "Not linking to root 
node");
+   return -EINVAL;
+   }
ht_down->refcnt++;
}
 
-- 
2.11.0



[PATCH net 02/13] net: sched: cls_u32: mark root hnode explicitly

2018-09-08 Thread Al Viro
From: Al Viro 

... and produce consistent error on attempt to delete such.
Existing check in u32_delete() is inconsistent - after

tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent : protocol ip prio 100 handle 1: u32 divisor 1
tc filter add dev eth0 parent : protocol ip prio 200 handle 2: u32 divisor 1

both

tc filter delete dev eth0 parent : protocol ip prio 100 handle 801: u32

and

tc filter delete dev eth0 parent : protocol ip prio 100 handle 800: u32

will fail (at least with refcounting fixes), but the former will complain
about an attempt to remove a busy table, while the latter will recognize
it as root and yield "Not allowed to delete root node" instead.

The problem with the existing check is that several tcf_proto instances might
share the same tp->data and handle-to-hnode lookup will be the same for all
of them.  So comparing an hnode to be deleted with tp->root won't catch the
case when one tp is used to try deleting the root of another.  Solution is
trivial - mark the root hnodes explicitly upon allocation and check for that.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index b2c3406a2cf2..c4782aa808c7 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -84,6 +84,7 @@ struct tc_u_hnode {
int refcnt;
unsigned intdivisor;
struct idr  handle_idr;
+   boolis_root;
struct rcu_head rcu;
u32 flags;
/* The 'ht' field MUST be the last field in structure to allow for
@@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
root_ht->refcnt++;
root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000;
root_ht->prio = tp->prio;
+   root_ht->is_root = true;
idr_init(_ht->handle_idr);
 
if (tp_c == NULL) {
@@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
goto out;
}
 
-   if (root_ht == ht) {
+   if (ht->is_root) {
NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
return -EINVAL;
}
-- 
2.11.0



[PATCH net 00/13] cls_u32 cleanups and fixes.

2018-09-08 Thread Al Viro
From: Al Viro 

A series of net/sched/cls_u32.c cleanups and fixes:
1) fix hnode refcounting.  Refcounting for tc_u_hnode is broken;
it's not hard to trigger oopsen (including one inside an interrupt handler,
with resulting panic) as well as memory corruption.  Definitely -stable
fodder.
2) mark root hnode explicitly.  Consistent errors on attempts to
delete root hnodes.  Less serious than 1/13.
3) disallow linking to root hnode.  Prohibit creating links to
root hnodes; not critical (nothing actually breaks if we do allow those),
but gets rid of surprising cases.
4) make sure that divisor is a power of 2.  Missing validation -
divisor is documented as power of 2, but that's not actually enforced.
Results are moderately bogus (i.e. the kernel doesn't break), but rather
surprising.

Those are fixes, or at least can be argued to be such.  The rest are pure
cleanups:
5) get rid of unused argument of u32_destroy_key()
6) get rid of tc_u_knode ->tp
7) get rid of tc_u_common ->rcu
Eliminate some unused fields.
8) clean tc_u_common hashtable.
Hash lookups are best done with minimum of calculations per chain
element - comparing the field in each candidate with f(key) where
f() is actually a pure function is not nice, especially when
compiler doesn't see f() as such...  Better calculate f(key) once,
especially since we need its value to choose the hash chain in
the first place.
9) pass tc_u_common to u32_set_parms() instead of tc_u_hnode
10) the tp_c argument of u32_set_parms() is always tp->data
11) get rid of hnode ->tp_c and tp_c argument of u32_set_parms()
Massage that ends with getting rid of a redundant field.
12) keep track of knodes count in tc_u_common
13) simplify the hell out u32_delete() emptiness check
Checking if a filter needs to be killed after u32_delete() can be
done much easier - the test is equivalent to "filter doesn't
have ->data shared with anyone else and it has no knodes left
in it" and keeping track of the number of knodes is trivial.



[PATCH net 01/13] net: sched: cls_u32: fix hnode refcounting

2018-09-08 Thread Al Viro
From: Al Viro 

cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via
->hlist and via ->tp_root together.  u32_destroy() drops the former and, in
case when there had been links, leaves the sucker on the list.  As the result,
there's nothing to protect it from getting freed once links are dropped.
That also makes the "is it busy" check incapable of catching the root hnode -
it *is* busy (there's a reference from tp), but we don't see it as something
separate.  "Is it our root?" check partially covers that, but the problem
exists for others' roots as well.

AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
* count tp->root and tp_c->hlist as separate references.  I.e.
have u32_init() set refcount to 2, not 1.
* in u32_destroy() we always drop the former; in u32_destroy_hnode() -
the latter.

That way we have *all* references contributing to refcount.  List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with everything
reachable from it.  IOW, that way we know that u32_destroy_key() won't
free something still on the list (or pointed to by someone's ->root).

Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index f218ccf1e2d9..b2c3406a2cf2 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp)
rcu_assign_pointer(tp_c->hlist, root_ht);
root_ht->tp_c = tp_c;
 
+   root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
tp->data = tp_c;
return 0;
@@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
struct tc_u_hnode __rcu **hn;
struct tc_u_hnode *phn;
 
-   WARN_ON(ht->refcnt);
+   WARN_ON(--ht->refcnt);
 
u32_clear_hnode(tp, ht, extack);
 
@@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct 
netlink_ext_ack *extack)
 
WARN_ON(root_ht == NULL);
 
-   if (root_ht && --root_ht->refcnt == 0)
+   if (root_ht && --root_ht->refcnt == 1)
u32_destroy_hnode(tp, root_ht, extack);
 
if (--tp_c->refcnt == 0) {
@@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
}
 
if (ht->refcnt == 1) {
-   ht->refcnt--;
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
@@ -708,11 +708,11 @@ static int u32_delete(struct tcf_proto *tp, void *arg, 
bool *last,
 out:
*last = true;
if (root_ht) {
-   if (root_ht->refcnt > 1) {
+   if (root_ht->refcnt > 2) {
*last = false;
goto ret;
}
-   if (root_ht->refcnt == 1) {
+   if (root_ht->refcnt == 2) {
if (!ht_empty(root_ht)) {
*last = false;
goto ret;
-- 
2.11.0



Re: [PATCH 1/7] fix hnode refcounting

2018-09-08 Thread Al Viro
On Fri, Sep 07, 2018 at 08:13:56AM -0400, Jamal Hadi Salim wrote:
> > } else {
> >  bool last;
> > 
> >  err = tfilter_del_notify(net, skb, n, tp, block,
> >   q, parent, fh, false, ,
> >   extack);
> > How can we ever get there with NULL fh?
> > 
> 
> Try:
> tc filter delete dev $P parent : protocol ip prio 10 u32
> tcm handle is 0, so will hit that code path.

Huh?  It will hit tcf_proto_destroy() (and thus u32_destroy()), but where will
it hit u32_delete()?  Sure, we have fh == NULL there; what happens next is
if (t->tcm_handle == 0) {
tcf_chain_tp_remove(chain, _info, tp);
tfilter_notify(net, skb, n, tp, block, q, parent, fh,
   RTM_DELTFILTER, false);
tcf_proto_destroy(tp, extack);
and that's it.  IDGI...  Direct experiment shows that on e.g.
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent : protocol ip prio 10 u32 match ip protocol 1 
0xff
tc filter delete dev eth0 parent : protocol ip prio 10 u32
we get u32_destroy() called, with u32_destroy_hnode() called by it,
but no u32_delete() is called at all, let alone with ht == NULL...


Re: [PATCH 6/7] get rid of tc_u_common ->rcu

2018-09-06 Thread Al Viro
On Thu, Sep 06, 2018 at 09:18:47PM -0700, Cong Wang wrote:
> On Wed, Sep 5, 2018 at 12:04 PM Al Viro  wrote:
> >
> > From: Al Viro 
> >
> > unused
> >
> > Signed-off-by: Al Viro 
> > ---
> >  net/sched/cls_u32.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> > index 8a1a573487bd..be9240ae1417 100644
> > --- a/net/sched/cls_u32.c
> > +++ b/net/sched/cls_u32.c
> > @@ -98,7 +98,6 @@ struct tc_u_common {
> > int refcnt;
> > struct idr  handle_idr;
> > struct hlist_node   hnode;
> > -   struct rcu_head rcu;
> >  };
> 
> Just FYI:
> 
> This was added when RCU was introduced to u32 fast path,
> it looks like on fast path we never touch tc_u_common, we
> only use tp->root, all the rest are slow paths with RTNL lock,
> so it is probably fine to just remove it rather than converting
> that kfree() to kfree_rcu().

*nod*

In any case, if u32_classify() grows that dereference, we can always
re-add ->rcu at the same time.


Re: [PATCH 2/7] mark root hnode explicitly

2018-09-06 Thread Al Viro
On Thu, Sep 06, 2018 at 08:23:36PM -0700, Cong Wang wrote:

> Pretty sure there is a 'tp' in u32_set_parms() parameter list.
> 
> Are you saying it is not what you want? If so, why?
> 
> More importantly, why this information is again missing in your
> changelog? This patch is definitely not trivial, it deserves a detailed
> changelog.
> 
> 
> > our own root, sure.  But there's nothing to stop doing the same via another
> > tcf_proto...
> 
> To my best knowledge, the place where you set ->is_root=true
> is precisely same with where we set tp->root=root_ht, and it doesn't
> change after set. What am I missing here?

The fact that there can be two (or more) different tcf_proto instances sharing
->data, but not ->root.  And since ->data is shared, u32_get() on one tp
will be able to return you ->root of *ANOTHER* one.  So comparison with
tp->root doesn't protect you.  Try this on mainline:

tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent : protocol ip prio 100 handle 1: u32 divisor 1
tc filter add dev eth0 parent : protocol ip prio 200 handle 2: u32 divisor 1
tc filter delete dev eth0 parent : protocol ip prio 100 handle 801: u32

and watch the fun as soon as you get an incoming packet on eth0.  That panic
is fixed by 1/7, but you get "Not allowed to delete root node" for removing
_your_ root, with "Can not delete in-use filter" for other's root (as in the
last line of the reproducer).


Re: [PATCH 2/7] mark root hnode explicitly

2018-09-06 Thread Al Viro
On Thu, Sep 06, 2018 at 07:57:25PM -0700, Cong Wang wrote:

> > -   if (root_ht == ht) {
> > +   if (ht->is_root) {
> 
> 
> What's wrong with comparing pointers with root ht?

The fact that there may be more than one tcf_proto sharing tp->data.

> > NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root 
> > node");
> > return -EINVAL;
> > }
> > @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct 
> > tcf_proto *tp,
> > NL_SET_ERR_MSG_MOD(extack, "Link hash table 
> > not found");
> > return -EINVAL;
> > }
> > +   if (ht_down->is_root) {
> 
> root ht is saved in tp->root, so you can compare ht_down with it too,
> if you want.
> 
> If this check is all what you need, you don't need an extra flag.

Again, *which* tp?  We can trivially check that we are not linking to/deleting
our own root, sure.  But there's nothing to stop doing the same via another
tcf_proto...


Re: [PATCH 1/7] fix hnode refcounting

2018-09-06 Thread Al Viro
On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote:

> For networking patches, subject should be reflective of tree and
> subsystem. Example for this one:
> "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting"
> Also useful to have a cover letter summarizing the patchset
> in 0/7. Otherwise
> 
> Acked-by: Jamal Hadi Salim 

Argh...  Unfortunately, there's this: in u32_delete() we have
if (root_ht) {
if (root_ht->refcnt > 1) {
*last = false;
goto ret;
}
if (root_ht->refcnt == 1) {
if (!ht_empty(root_ht)) {
*last = false;
goto ret;
}
}
}
and that would need to be updated.  However, that logics is bloody odd
to start with.  First of all, root_ht has come from
   struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
and the only place where it's ever modified is
rcu_assign_pointer(tp->root, root_ht);
in u32_init(), where we'd bloody well checked that root_ht is non-NULL
(see
if (root_ht == NULL)
return -ENOBUFS;
upstream of that place) and where that assignment is inevitable on the
way to returning 0.  No matter what, if tp has passed u32_init() it
will have non-NULL ->root, forever.  And there is no way for tcf_proto
to be seen outside of tcf_proto_create() without ->init() having returned
0 - it gets freed before anyone sees it.

So this 'if (root_ht)' can't be false.  What's more, what the hell is the
whole thing checking?  We are in u32_delete().  It's called (as ->delete())
from tfilter_del_notify(), which is called from tc_del_tfilter().  If we
return 0 with *last true, we follow up calling tcf_proto_destroy().
OK, let's look at the logics in there:
* if there are links to root hnode => false
* if there's no links to root hnode and it has knodes => false
(BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed)
* if there is a tcf_proto sharing tp->data => false (i.e. any filters
with different prio - don't bother)
* if tp is the only one with reference to tp->data and there are *any*
knodes => false.

Any extra links can come only from knodes in a non-empty hnode.  And it's not
a common case.  Shouldn't thIe whole thing be
* shared tp->data => false
* any non-empty hnode => false
instead?  Perhaps even with the knode counter in tp->data, avoiding any loops
in there, as well as the entire ht_empty()...

Now, in the very beginning of u32_delete() we have this:
struct tc_u_hnode *ht = arg;

if (ht == NULL)
goto out;
OK, but the call of ->delete() is
err = tp->ops->delete(tp, fh, last, extack);
and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify().
Which is called in
if (!fh) {
...
} else {
bool last;

err = tfilter_del_notify(net, skb, n, tp, block,
 q, parent, fh, false, ,
 extack);
How can we ever get there with NULL fh?

The whole thing makes very little sense; looks like it used to live in
u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp
check from ->destroy() to ->delete()"), but looking at the rationale in
that commit...  I don't see how it fixes anything - sure, now we remove
tcf_proto from the list before calling ->destroy().  Without any RCU delays
in between.  How could it possibly solve any issues with ->classify()
called in parallel with ->destroy()?  cls_u32 (at least these days)
does try to survive u32_destroy() in parallel with u32_classify();
if any other classifiers do not, they are still broken and that commit
has not done anything for them.

Anyway, adjusting 1/7 for that is trivial, but I would really like to
understand what that code is doing...  Comments?


Re: [PATCH 2/7] mark root hnode explicitly

2018-09-06 Thread Al Viro
On Thu, Sep 06, 2018 at 06:34:00AM -0400, Jamal Hadi Salim wrote:
> On 2018-09-06 6:28 a.m., Jamal Hadi Salim wrote:
> > On 2018-09-05 3:04 p.m., Al Viro wrote:
> > > From: Al Viro 
> > > 
> > > ... and disallow deleting or linking to such
> > > 
> > > Signed-off-by: Al Viro 
> > 
> > Same comment as other one in regards to subject
> > 
> > Since the flag space is coming from htnode which is
> > exposed via uapi it makes sense to keep this one here
> > because it is for private use; but  a comment in
> > include/uapi/linux/pkt_cls.h that this flag or
> > maybe a set of bits is reserved for internal use.
> > Otherwise:
> > 
> > Acked-by: Jamal Hadi Salim 
> 
> 
> Sorry, additional comment:
> It makes sense to reject user space attempt to
> set TCA_CLS_FLAGS_U32_ROOT

Point, and that one is IMO enough to give up on using ->flags for
that.  How about simply

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 3f985f29ef30..d14048e38b5c 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -84,6 +84,7 @@ struct tc_u_hnode {
int refcnt;
unsigned intdivisor;
struct idr  handle_idr;
+   boolis_root;
struct rcu_head rcu;
u32 flags;
/* The 'ht' field MUST be the last field in structure to allow for
@@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
root_ht->refcnt++;
root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000;
root_ht->prio = tp->prio;
+   root_ht->is_root = true;
idr_init(_ht->handle_idr);
 
if (tp_c == NULL) {
@@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
goto out;
}
 
-   if (root_ht == ht) {
+   if (ht->is_root) {
NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
return -EINVAL;
}
@@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
return -EINVAL;
}
+   if (ht_down->is_root) {
+   NL_SET_ERR_MSG_MOD(extack, "Not linking to root 
node");
+   return -EINVAL;
+   }
ht_down->refcnt++;
}
 


[PATCH 1/7] fix hnode refcounting

2018-09-05 Thread Al Viro
From: Al Viro 

cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via
->hlist and via ->tp_root together.  u32_destroy() drops the former and, in
case when there had been links, leaves the sucker on the list.  As the result,
there's nothing to protect it from getting freed once links are dropped.
That also makes the "is it busy" check incapable of catching the root hnode -
it *is* busy (there's a reference from tp), but we don't see it as something
separate.  "Is it our root?" check partially covers that, but the problem
exists for others' roots as well.

AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
* count tp->root and tp_c->hlist as separate references.  I.e.
have u32_init() set refcount to 2, not 1.
* in u32_destroy() we always drop the former; in u32_destroy_hnode() -
the latter.

That way we have *all* references contributing to refcount.  List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with everything
reachable from it.  IOW, that way we know that u32_destroy_key() won't
free something still on the list (or pointed to by someone's ->root).

Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index f218ccf1e2d9..3f985f29ef30 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp)
rcu_assign_pointer(tp_c->hlist, root_ht);
root_ht->tp_c = tp_c;
 
+   root_ht->refcnt++;
rcu_assign_pointer(tp->root, root_ht);
tp->data = tp_c;
return 0;
@@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
struct tc_u_hnode __rcu **hn;
struct tc_u_hnode *phn;
 
-   WARN_ON(ht->refcnt);
+   WARN_ON(--ht->refcnt);
 
u32_clear_hnode(tp, ht, extack);
 
@@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct 
netlink_ext_ack *extack)
 
WARN_ON(root_ht == NULL);
 
-   if (root_ht && --root_ht->refcnt == 0)
+   if (root_ht && --root_ht->refcnt == 1)
u32_destroy_hnode(tp, root_ht, extack);
 
if (--tp_c->refcnt == 0) {
@@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
}
 
if (ht->refcnt == 1) {
-   ht->refcnt--;
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
-- 
2.11.0



[PATCH 4/7] get rid of unused argument of u32_destroy_key()

2018-09-05 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 5816288810cc..3311aacad6c3 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -406,8 +406,7 @@ static int u32_init(struct tcf_proto *tp)
return 0;
 }
 
-static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
-  bool free_pf)
+static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
 {
struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
 
@@ -441,7 +440,7 @@ static void u32_delete_key_work(struct work_struct *work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, false);
+   u32_destroy_key(key, false);
rtnl_unlock();
 }
 
@@ -458,7 +457,7 @@ static void u32_delete_key_freepf_work(struct work_struct 
*work)
  struct tc_u_knode,
  rwork);
rtnl_lock();
-   u32_destroy_key(key->tp, key, true);
+   u32_destroy_key(key, true);
rtnl_unlock();
 }
 
@@ -602,7 +601,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht,
if (tcf_exts_get_net(>exts))
tcf_queue_work(>rwork, 
u32_delete_key_freepf_work);
else
-   u32_destroy_key(n->tp, n, true);
+   u32_destroy_key(n, true);
}
}
 }
@@ -972,13 +971,13 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
tca[TCA_RATE], ovr, extack);
 
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
err = u32_replace_hw_knode(tp, new, flags, extack);
if (err) {
-   u32_destroy_key(tp, new, false);
+   u32_destroy_key(new, false);
return err;
}
 
-- 
2.11.0



[PATCH 3/7] make sure that divisor is a power of 2

2018-09-05 Thread Al Viro
From: Al Viro 

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 9ea5f2be907b..5816288810cc 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -995,7 +995,11 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (tb[TCA_U32_DIVISOR]) {
unsigned int divisor = nla_get_u32(tb[TCA_U32_DIVISOR]);
 
-   if (--divisor > 0x100) {
+   if (!is_power_of_2(divisor)) {
+   NL_SET_ERR_MSG_MOD(extack, "Divisor is not a power of 
2");
+   return -EINVAL;
+   }
+   if (divisor-- > 0x100) {
NL_SET_ERR_MSG_MOD(extack, "Exceeded maximum 256 hash 
buckets");
return -EINVAL;
}
-- 
2.11.0



[PATCHES] cls_u32 cleanups and fixes

2018-09-05 Thread Al Viro
Several cls_u32 patches: fixing refcounting, preventing
links to and deletion of root hnodes, validating divisor.  Plus
cleanups - removal of some useless fields and saner handling of
tc_u_common hashtable.  The first 3 in series are fixes (and
-stable fodder), the rest - cleanups.  Branch can be found in
vfs.git#misc.cls_u32; individual patches will go in followups.


[PATCH 5/7] get rid of tc_u_knode ->tp

2018-09-05 Thread Al Viro
From: Al Viro 

not used anymore

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 3311aacad6c3..8a1a573487bd 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -68,7 +68,6 @@ struct tc_u_knode {
u32 mask;
u32 __percpu*pcpu_success;
 #endif
-   struct tcf_proto*tp;
struct rcu_work rwork;
/* The 'sel' field MUST be the last field in structure to allow for
 * tc_u32_keys allocated at end of structure.
@@ -897,7 +896,6 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto 
*tp,
/* Similarly success statistics must be moved as pointers */
new->pcpu_success = n->pcpu_success;
 #endif
-   new->tp = tp;
memcpy(>sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key));
 
if (tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE)) {
@@ -1113,7 +,6 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
n->handle = handle;
n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;
n->flags = flags;
-   n->tp = tp;
 
err = tcf_exts_init(>exts, TCA_U32_ACT, TCA_U32_POLICE);
if (err < 0)
-- 
2.11.0



[PATCH 2/7] mark root hnode explicitly

2018-09-05 Thread Al Viro
From: Al Viro 

... and disallow deleting or linking to such

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 3f985f29ef30..9ea5f2be907b 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -91,6 +91,7 @@ struct tc_u_hnode {
 */
struct tc_u_knode __rcu *ht[1];
 };
+#define TCA_CLS_FLAGS_U32_ROOT (1<<8)
 
 struct tc_u_common {
struct tc_u_hnode __rcu *hlist;
@@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
root_ht->refcnt++;
root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x8000;
root_ht->prio = tp->prio;
+   root_ht->flags = TCA_CLS_FLAGS_U32_ROOT;
idr_init(_ht->handle_idr);
 
if (tp_c == NULL) {
@@ -491,7 +493,8 @@ static void u32_clear_hw_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *h,
struct tcf_block *block = tp->chain->block;
struct tc_cls_u32_offload cls_u32 = {};
 
-   tc_cls_common_offload_init(_u32.common, tp, h->flags, extack);
+   tc_cls_common_offload_init(_u32.common, tp,
+   h->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack);
cls_u32.command = TC_CLSU32_DELETE_HNODE;
cls_u32.hnode.divisor = h->divisor;
cls_u32.hnode.handle = h->handle;
@@ -693,7 +696,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool 
*last,
goto out;
}
 
-   if (root_ht == ht) {
+   if (ht->flags & TCA_CLS_FLAGS_U32_ROOT) {
NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
return -EINVAL;
}
@@ -795,6 +798,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto 
*tp,
NL_SET_ERR_MSG_MOD(extack, "Link hash table not 
found");
return -EINVAL;
}
+   if (ht_down->flags & TCA_CLS_FLAGS_U32_ROOT) {
+   NL_SET_ERR_MSG_MOD(extack, "Not linke to root 
node");
+   return -EINVAL;
+   }
ht_down->refcnt++;
}
 
@@ -1214,7 +1221,8 @@ static int u32_reoffload_hnode(struct tcf_proto *tp, 
struct tc_u_hnode *ht,
struct tc_cls_u32_offload cls_u32 = {};
int err;
 
-   tc_cls_common_offload_init(_u32.common, tp, ht->flags, extack);
+   tc_cls_common_offload_init(_u32.common, tp,
+   ht->flags & ~TCA_CLS_FLAGS_U32_ROOT, extack);
cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE;
cls_u32.hnode.divisor = ht->divisor;
cls_u32.hnode.handle = ht->handle;
-- 
2.11.0



[PATCH 7/7] clean tc_u_common hashtable

2018-09-05 Thread Al Viro
From: Al Viro 

* calculate key *once*, not for each hash chain element
* let tc_u_hash() return the pointer to chain head rather than index -
callers are cleaner that way.

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index be9240ae1417..6d45ec4c218c 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -343,19 +343,16 @@ static void *tc_u_common_ptr(const struct tcf_proto *tp)
return block->q;
 }
 
-static unsigned int tc_u_hash(const struct tcf_proto *tp)
+static struct hlist_head *tc_u_hash(void *key)
 {
-   return hash_ptr(tc_u_common_ptr(tp), U32_HASH_SHIFT);
+   return tc_u_common_hash + hash_ptr(key, U32_HASH_SHIFT);
 }
 
-static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
+static struct tc_u_common *tc_u_common_find(void *key)
 {
struct tc_u_common *tc;
-   unsigned int h;
-
-   h = tc_u_hash(tp);
-   hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
-   if (tc->ptr == tc_u_common_ptr(tp))
+   hlist_for_each_entry(tc, tc_u_hash(key), hnode) {
+   if (tc->ptr == key)
return tc;
}
return NULL;
@@ -364,10 +361,8 @@ static struct tc_u_common *tc_u_common_find(const struct 
tcf_proto *tp)
 static int u32_init(struct tcf_proto *tp)
 {
struct tc_u_hnode *root_ht;
-   struct tc_u_common *tp_c;
-   unsigned int h;
-
-   tp_c = tc_u_common_find(tp);
+   void *key = tc_u_common_ptr(tp);
+   struct tc_u_common *tp_c = tc_u_common_find(key);
 
root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL);
if (root_ht == NULL)
@@ -389,8 +384,7 @@ static int u32_init(struct tcf_proto *tp)
INIT_HLIST_NODE(_c->hnode);
idr_init(_c->handle_idr);
 
-   h = tc_u_hash(tp);
-   hlist_add_head(_c->hnode, _u_common_hash[h]);
+   hlist_add_head(_c->hnode, tc_u_hash(key));
}
 
tp_c->refcnt++;
-- 
2.11.0



[PATCH 6/7] get rid of tc_u_common ->rcu

2018-09-05 Thread Al Viro
From: Al Viro 

unused

Signed-off-by: Al Viro 
---
 net/sched/cls_u32.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 8a1a573487bd..be9240ae1417 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -98,7 +98,6 @@ struct tc_u_common {
int refcnt;
struct idr  handle_idr;
struct hlist_node   hnode;
-   struct rcu_head rcu;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
-- 
2.11.0



Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL

2018-08-26 Thread Al Viro
On Sun, Aug 26, 2018 at 11:35:17PM -0400, Julia Lawall wrote:

> * x = \(kmalloc\|kzalloc\|devm_kmalloc\|devm_kzalloc\)(...)

I can name several you've missed right off the top of my head -
vmalloc, kvmalloc, kmem_cache_alloc, kmem_cache_zalloc, variants
with _trace slapped on, and that is not to mention the things like
get_free_page or

void *my_k3wl_alloc(u64 n) // 'cause all artificial limits suck, that's why
{
lots and lots of home-grown stats collection
some tracepoints thrown in just for fun
return kmalloc(n);
}

(and no, I'm not implying that net/sched folks had done anything of that
sort; I have seen that and worse in drivers, though)

> The * at the beginning of the line means to highlight what you are looking
> for, which is done by making a diff in which the highlighted line
> appears to be removed.

Umm...  Does that cover return, BTW?  Or something like
T *barf;
extern void foo(T *p);
foo(kmalloc(sizeof(*barf)));


> The limitation is the ability to figure out the type of x.  If it is a
> local variable, Coccinelle should have no problem.  If it is a structure
> field, it may be necessary to provide command line arguments like
> 
> --all-includes --include-headers-for-types
> 
> --all-includes means to try to find all include files that are mentioned
> in the .c file.  The next stronger option is --recursive includes, which
> means include what all of the mentioned files include as well,
> recursively.  This tends to cause a major performance hit, because a lot
> of code is being parsed.  --include-headers-for-types heals a bit with
> that, as it only considers the header files when computing type
> information, and now when applying the rules.
> 
> With respect to ifdefs around variable declarations and structure field
> declaration, in these cases Coccinelle considers that it cannot make the
> ifdef have an if-like control flow, and so if considers the #ifdef, #else
> and #endif to be comments.  Thus it takes into account only the last type
> provided for a given variable.

[snip]

What about several variants of structure definition?  Because ifdefs around
includes do occur in the wild...


Re: [PATCH] net: sched: Fix memory exposure from short TCA_U32_SEL

2018-08-26 Thread Al Viro
On Sun, Aug 26, 2018 at 10:00:46PM -0400, Julia Lawall wrote:
> 
> 
> On Sun, 26 Aug 2018, Al Viro wrote:
> 
> > On Sun, Aug 26, 2018 at 03:26:54PM -0700, Joe Perches wrote:
> > > On Sun, 2018-08-26 at 22:24 +0100, Al Viro wrote:
> > > > On Sun, Aug 26, 2018 at 11:57:57AM -0700, Joe Perches wrote:
> > > >
> > > > > > That, BTW, is why I hate the use of sizeof(*p) in kmalloc, etc.
> > > > > > arguments.  typeof is even worse in that respect.
> > > > >
> > > > > True.  Semantic searches via tools like coccinelle could help here
> > > > > but those searches are quite a bit slower than straightforward greps.
> > > >
> > > > Those searches are .config-sensitive as well, which can be much more
> > > > unpleasant than being slow...
> > >
> > > Are they?  Julia?
> >
> > They work pretty much on preprocessor output level; if something it ifdef'ed
> > out on given config, it won't be seen...
> 
> Coccinelle doesn't care what is ifdef'd out.  It only misses the things it
> can't parse.  Very strange ifdefs could indeed cause that, but it should
> be a minor problem.

OK, but... what does it do when it sees two definitions of a structure
in different branches of #if/#else/#endif?  I think I'm confused about
what it can and cannot do; to restate the original problem:
* we need to find all places where instances of given type
are created.  Assume it never is a member of struct/union/array and
no static or auto duration instances exist - everything is dynamically
allocated somewhere.

Can coccinelle do that and if it can, what are the limitations?


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-17 Thread Al Viro
On Fri, Aug 17, 2018 at 07:59:44PM +0100, Al Viro wrote:
> On Fri, Aug 17, 2018 at 07:58:41PM +0100, Al Viro wrote:
> > On Fri, Aug 17, 2018 at 07:09:49PM +0100, Al Viro wrote:
> > 
> > > Re that code - are you sure it doesn't need le64_to_cpu(*src)?  Because 
> > > from what
> > > I understand about PCI (which matches just fine to the comments in the 
> > > same driver),
> > > you probably do need that.  Again, the only real way to find out is to 
> > > test on
> > > big-endian host...
> > 
> > BTW, would that, by any chance, be an open-coded
> > _iowrite64_copy(dst, src, EQ_UNIT/sizeof(u64))
> 
> __iowrite64_copy, even...

FWIW, it looks like the confusion had been between the endianness of the data 
structures
(b-e both on host and NIC side) and the fact that PCI is l-e.  *IF* that code 
wants to
copy data from host data structures to iomem as-is, it needs to use 
__raw_writeq() and
its ilk or writeq(le64_to_cpu(...)) to compensate.  The latter would, indeed, 
confuse
sparse - we are accessing b-e data as if it was l-e.

If we want copying that wouldn't affect the endianness, we need memcpy_toio() 
or similar
beasts.  And AFAICS that code is very close to
/* If we're only writing a single Egress Unit and the BAR2
 * Queue ID is 0, we can use the Write Combining Doorbell
 * Gather Buffer; otherwise we use the simple doorbell.
 */
if (n == 1 && tq->bar2_qid == 0) {
unsigned int index = (tq->pidx ?: tq->size) - 1;
/* Copy the TX Descriptor in a tight loop in order to
 * try to get it to the adapter in a single Write
 * Combined transfer on the PCI-E Bus.  If the Write
 * Combine fails (say because of an interrupt, etc.)
 * the hardware will simply take the last write as a
 * simple doorbell write with a PIDX Increment of 1
 * and will fetch the TX Descriptor from memory via
 * DMA.
 */
__iowrite64_copy(tq->bar2_addr + SGE_UDB_WCDOORBELL,
 >desc[index], EQ_UNIT/sizeof(u64))
} else {
writel(val | QID_V(tq->bar2_qid),
   tq->bar2_addr + SGE_UDB_KDOORBELL);
}
/* This Write Memory Barrier will force the write to the User
 * Doorbell area to be flushed.  This is needed to prevent
 * writes on different CPUs for the same queue from hitting
 * the adapter out of order.  This is required when some Work
 * Requests take the Write Combine Gather Buffer path (user
 * doorbell area offset [SGE_UDB_WCDOORBELL..+63]) and some
 * take the traditional path where we simply increment the
 * PIDX (User Doorbell area SGE_UDB_KDOORBELL) and have the
 * hardware DMA read the actual Work Request.
 */
wmb();

which wouldn't have looked unusual...  Again, that really needs review from
the folks familiar with the hardware in question, as well as testing - I'm
not trying to push patches like that.  If the current mainline variant
really works on b-e, I'd like to understand how does it manage that, though...


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-17 Thread Al Viro
On Fri, Aug 17, 2018 at 07:58:41PM +0100, Al Viro wrote:
> On Fri, Aug 17, 2018 at 07:09:49PM +0100, Al Viro wrote:
> 
> > Re that code - are you sure it doesn't need le64_to_cpu(*src)?  Because 
> > from what
> > I understand about PCI (which matches just fine to the comments in the same 
> > driver),
> > you probably do need that.  Again, the only real way to find out is to test 
> > on
> > big-endian host...
> 
> BTW, would that, by any chance, be an open-coded
>   _iowrite64_copy(dst, src, EQ_UNIT/sizeof(u64))

__iowrite64_copy, even...


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-17 Thread Al Viro
On Fri, Aug 17, 2018 at 07:09:49PM +0100, Al Viro wrote:

> Re that code - are you sure it doesn't need le64_to_cpu(*src)?  Because from 
> what
> I understand about PCI (which matches just fine to the comments in the same 
> driver),
> you probably do need that.  Again, the only real way to find out is to test on
> big-endian host...

BTW, would that, by any chance, be an open-coded
_iowrite64_copy(dst, src, EQ_UNIT/sizeof(u64))


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-17 Thread Al Viro
On Fri, Aug 17, 2018 at 04:43:20PM +0100, Al Viro wrote:
> On Fri, Aug 17, 2018 at 06:35:41PM +0530, Ganesh Goudar wrote:
> > Thanks, Al. The patch looks good to me but it does not seem
> > to be showing up in patchwork, should I resend the patch on
> > your behalf to net tree ?
> 
> Umm...  I thought net-next had been closed until -rc1, hadn't
> it?
> 
> Anyway, endianness cleanups and fixes of drivers/net/ethernet/chelsio
> can be found in vfs.git #net-endian.chelsio; I was planning to post
> that stuff on netdev after -rc1, but I would certainly appreciate
> a look from somebody familiar with the hardware prior to that, assuming
> you have time for that at the moment...  The stuff in there (it's
> based off net/master):
>   struct cxgb4_next_header .match_val/.match_mask/mask should be 
> net-endian
>   cxgb4: fix TC-PEDIT-related parts of cxgb4_tc_flower on big-endian
>   cxgb4_tc_u32: trivial endianness annotations
>   cxgb4: trivial endianness annotations
>   libcxgb: trivial __percpu annotations
>   libcxgb: trivial endianness annotations
>   cxgb3: trivial endianness annotations
>   cxgb3: don't get cute with !! and shifts in t3_prep_adapter()...
>   [investigate][endianness bug] cxgb3: assigning htonl(11-bit value) to 
> __be16 field is wrong
>   cxgb: trivial endianness annotations

... and updated for some of today's catch (== stuff caught while finding the
reply).  Another very likely bug hidden by force-casts: in t4_fwcache()
you put host-endian 0 or 1 into __be32 field, hiding that by force-cast.
Then feed that to hardware, which, judging by everything else nearby
expects big-endian there.  The only reason it works, AFAICS, is that you
only pass it FW_PARAM_DEV_FWCACHE_FLUSH (== 0); as soon as you get
a caller passing FW_PARAM_DEV_FWCACHE_FLUSHINV, you'll get breakage.

And frankly, comments like
while (count) {
/* the (__force u64) is because the compiler
 * doesn't understand the endian swizzling
 * going on
 */
writeq((__force u64)*src, dst);
src++;
dst++;
count--;
}
are more than slightly terrifying - piss on compiler, what about reviewers?  And
the authors themselves, for that matter...  FWIW, see the dmr's comments on
"you are not expected to understand that" story (bell labs site is buggered,
but wayback machine has it on
https://web.archive.org/web/20140724213028/http://cm.bell-labs.com/cm/cs/who/dmr/odd.html)
Especially the "The real problem is that we didn't understand what was going on 
either"
part...

Re that code - are you sure it doesn't need le64_to_cpu(*src)?  Because from 
what
I understand about PCI (which matches just fine to the comments in the same 
driver),
you probably do need that.  Again, the only real way to find out is to test on
big-endian host...


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-17 Thread Al Viro
On Fri, Aug 17, 2018 at 06:35:41PM +0530, Ganesh Goudar wrote:
> Thanks, Al. The patch looks good to me but it does not seem
> to be showing up in patchwork, should I resend the patch on
> your behalf to net tree ?

Umm...  I thought net-next had been closed until -rc1, hadn't
it?

Anyway, endianness cleanups and fixes of drivers/net/ethernet/chelsio
can be found in vfs.git #net-endian.chelsio; I was planning to post
that stuff on netdev after -rc1, but I would certainly appreciate
a look from somebody familiar with the hardware prior to that, assuming
you have time for that at the moment...  The stuff in there (it's
based off net/master):
  struct cxgb4_next_header .match_val/.match_mask/mask should be net-endian
  cxgb4: fix TC-PEDIT-related parts of cxgb4_tc_flower on big-endian
  cxgb4_tc_u32: trivial endianness annotations
  cxgb4: trivial endianness annotations
  libcxgb: trivial __percpu annotations
  libcxgb: trivial endianness annotations
  cxgb3: trivial endianness annotations
  cxgb3: don't get cute with !! and shifts in t3_prep_adapter()...
  [investigate][endianness bug] cxgb3: assigning htonl(11-bit value) to 
__be16 field is wrong
  cxgb: trivial endianness annotations

The first two are fixes (the second being the patch you've just replied to),
next-to-last might or might not be (see "[RFC] weirdness in
cxgb3_main.c:init_tp_parity()" posted to netdev a couple of weeks
ago), the rest is pure annotations.  Result is not entirely sparse-clean, but
fairly close to that:

drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:681:67: warning: incorrect type in 
argument 3 (different base types)
drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:681:67:expected restricted 
__le32 [usertype] data
drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:681:67:got int
drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:898:31: warning: incorrect type in 
assignment (different base types)
drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:898:31:expected unsigned int 
[unsigned] [usertype] 
drivers/net/ethernet/chelsio/cxgb3/t3_hw.c:898:31:got restricted __be32 
[usertype] 
drivers/net/ethernet/chelsio/cxgb3/sge.c:2435:43: warning: incorrect type in 
assignment (different base types)
drivers/net/ethernet/chelsio/cxgb3/sge.c:2435:43:expected restricted __wsum 
[usertype] csum
drivers/net/ethernet/chelsio/cxgb3/sge.c:2435:43:got restricted __be32 
[assigned] [usertype] rss_hi
drivers/net/ethernet/chelsio/cxgb3/sge.c:2436:47: warning: incorrect type in 
assignment (different base types)
drivers/net/ethernet/chelsio/cxgb3/sge.c:2436:47:expected unsigned int 
[unsigned] [usertype] priority
drivers/net/ethernet/chelsio/cxgb3/sge.c:2436:47:got restricted __be32 
[assigned] [usertype] rss_lo
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:419:38: warning: incorrect 
type in assignment (different base types)
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:419:38:expected 
restricted __be32 [usertype] mask
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:419:38:got unsigned int
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:420:37: warning: incorrect 
type in assignment (different base types)
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:420:37:expected 
restricted __be32 [usertype] val
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:420:37:got unsigned int
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:449:22: warning: incorrect 
type in assignment (different base types)
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:449:22:expected 
restricted __be32 [usertype] mask
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c:449:22:got unsigned int

Remaining tc_flower warnings are misannotated tcf_pedit_mask()/tcf_pedit_val()
(both are actually returning __be32)

sge.c ones are
/* Preserve the RSS info in csum & priority */
skb->csum = rss_hi;
skb->priority = rss_lo;
and I've no idea whether it's a problem or not - ->csum is almost
certainly being abused there, and it looks like ->priority also
is...  I don't know - it's certainly used as host-endian in the
same driver (see e.g. queue_set() and is_ctrl_pkt()), but those are
on TX path and this is RX...  I'm not familiar enough with the
code to tell what's going on here - what *is* the consumer of
the data stored there?  cxgb3_offload.c get_hwtid() and get_opcode()?
If so (and if that's all there is), why not simply something like
skb->priority = (ntohl(rss_lo) & 0x00) | G_OPCODE(ntohl(rss_hi));
before passing skb to rx_offload(), with
static inline u32 get_hwtid(struct sk_buff *skb)
{
return skb->priority >> 8;
}
static inline u32 get_opcode(struct sk_buff *skb)
{
return skb->priority & 0xff;
}
and don't bother touching ->csum...  Again, I'm not familiar with hardware
*or* the driver; I've no idea how much is e.g. shared with the infiniband
or scsi 

[RFC][bug?] "net/act_pedit: Introduce 'add' operation" is broken for anything wider than an octet

2018-08-14 Thread Al Viro
The code doing addition in that commit is

+   switch (cmd) {
+   case TCA_PEDIT_KEY_EX_CMD_SET:
+   val = tkey->val;
+   break;
+   case TCA_PEDIT_KEY_EX_CMD_ADD:
+   val = (*ptr + tkey->val) & ~tkey->mask;
+   break;
+   default:
+   pr_info("tc filter pedit bad command (%d)\n",
+   cmd);
+   goto bad;
+   }
+
+   *ptr = ((*ptr & tkey->mask) ^ val);


Any net-endian field wider than an octet will have the carry between
octets handled wrong on little-endian hosts.  Should we at least
verify that ~mask fits into one octet?

As it is, consider e.g. an attempt to subtract 1 from a 16bit field
at offset 2 in a word.  We want {0,0,0,1} (0x1000 from host POV)
to turn into 0, so the value to add would be 0xff00.  Except that
{0, 0, 1, 0} would turn into {0, 0, 1, 0xff} that way, not the
expected {0, 0, 0, 0xff}.

Granted, there's not a lot of wider-than-octet fields where arithmetics
would've made sense, but we probably ought to refuse allowing such
operations.  Especially since on big-endian hosts they will work
just fine until you try to move that over to a little-endian box...

Alternatively, we could do something like
val = htonl(be32_to_cpup(ptr) + ntohl(tkey->val)) & ~tkey->mask;
but I'm not sure if that's worth doing.  It's not as if there would be
a major overhead, but still...

Comments?


Re: [endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-14 Thread Al Viro
How can cxgb4/cxgb4_tc_flower.c handling of 16bit
fields possibly work on b-e?  Look:
case TCA_PEDIT_KEY_EX_HDR_TYPE_TCP:
switch (offset) {
case PEDIT_TCP_SPORT_DPORT:
if (~mask & PEDIT_TCP_UDP_SPORT_MASK)
offload_pedit(fs, cpu_to_be32(val) >> 16,
  cpu_to_be32(mask) >> 16,
  TCP_SPORT);

OK, we are feeding two results of >> 16 (i.e. the values in
range 0..65535 from the host POV) to offload_pedit().  Which does

static void offload_pedit(struct ch_filter_specification *fs, u32 val, u32 mask,
  u8 field)
{
u32 set_val = val & ~mask;

OK, it's a value in range 0..65535.

u32 offset = 0;
u8 size = 1;
int i;

for (i = 0; i < ARRAY_SIZE(pedits); i++) {
if (pedits[i].field == field) {
go until we finally find this:
PEDIT_FIELDS(TCP_, SPORT, 2, nat_fport, 0),
i.e.
{TCP_SPORT, 2, offsetof(struct ch_filter_specification, nat_fport)}
offset = pedits[i].offset;
size = pedits[i].size;
... resulting in offset = offsetof(..., nat_fport), size = 2
break;
}
}
memcpy((u8 *)fs + offset, _val, size);
... and we copy the first two bytes of set_val to fs->nat_fport, right?

On little-endian, assuming that val & 0x was 256 * V0 + V1 and
mask & 0x - 256 * M0 + M1, we get cpu_to_be32(val) >> 16 equal to
256 * V1 + V0, and similar for mask, resuling in set_val containing
{V0 & ~M0, V1 & ~M1, 0, 0}, with the first two bytes copied to fs->nat_fport.

Now, think what will happen on big-endian.  The value in set_val has upper
16 bits all zero, no matter what - shift anything 32bit down by 16 and you'll
get that.  And on big-endian that's first two bytes of memory representation,
so this memcpy() is absolutely guaranteed to set fs->nat_fport to zero.
No matter how fancy the hardware is, it can't guess what had the other two
bytes been - CPU has discarded those before the NIC had a chance to see
them.

Am I right assuming that the val is supposed to be {S1, S0, D1, D0},
with sport == S1 * 256 + S0, dport == D1 * 256 + D0?  If so, the following
ought to work [== COMPLETELY UNTESTED, in other words] on l-e same as the
current code does and do the right thing on b-e.  Objections?

offload_pedit() is broken for big-endian; it's actually easier to spell the
memcpy (and in case of ports - memcpy-with-byteswap) explicitly, avoiding
both the b-e problems and getting rid of a lot of LoC, including an unpleasant
macro.

Signed-off-by: Al Viro 
---
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 3db969eefba9..020ca0121fb4 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -43,27 +43,6 @@
 
 #define STATS_CHECK_PERIOD (HZ / 2)
 
-static struct ch_tc_pedit_fields pedits[] = {
-   PEDIT_FIELDS(ETH_, DMAC_31_0, 4, dmac, 0),
-   PEDIT_FIELDS(ETH_, DMAC_47_32, 2, dmac, 4),
-   PEDIT_FIELDS(ETH_, SMAC_15_0, 2, smac, 0),
-   PEDIT_FIELDS(ETH_, SMAC_47_16, 4, smac, 2),
-   PEDIT_FIELDS(IP4_, SRC, 4, nat_fip, 0),
-   PEDIT_FIELDS(IP4_, DST, 4, nat_lip, 0),
-   PEDIT_FIELDS(IP6_, SRC_31_0, 4, nat_fip, 0),
-   PEDIT_FIELDS(IP6_, SRC_63_32, 4, nat_fip, 4),
-   PEDIT_FIELDS(IP6_, SRC_95_64, 4, nat_fip, 8),
-   PEDIT_FIELDS(IP6_, SRC_127_96, 4, nat_fip, 12),
-   PEDIT_FIELDS(IP6_, DST_31_0, 4, nat_lip, 0),
-   PEDIT_FIELDS(IP6_, DST_63_32, 4, nat_lip, 4),
-   PEDIT_FIELDS(IP6_, DST_95_64, 4, nat_lip, 8),
-   PEDIT_FIELDS(IP6_, DST_127_96, 4, nat_lip, 12),
-   PEDIT_FIELDS(TCP_, SPORT, 2, nat_fport, 0),
-   PEDIT_FIELDS(TCP_, DPORT, 2, nat_lport, 0),
-   PEDIT_FIELDS(UDP_, SPORT, 2, nat_fport, 0),
-   PEDIT_FIELDS(UDP_, DPORT, 2, nat_lport, 0),
-};
-
 static struct ch_tc_flower_entry *allocate_flower_entry(void)
 {
struct ch_tc_flower_entry *new = kzalloc(sizeof(*new), GFP_KERNEL);
@@ -306,81 +285,63 @@ static int cxgb4_validate_flow_match(struct net_device 
*dev,
return 0;
 }
 
-static void offload_pedit(struct ch_filter_specification *fs, u32 val, u32 
mask,
- u8 field)
-{
-   u32 set_val = val & ~mask;
-   u32 offset = 0;
-   u8 size = 1;
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(pedits); i++) {
-   if (pedits[i].field == field) {
-   offset = pedits[i].offset;
-   size = pedits[i].size;
-   break;
-   }
-   }
-   memcpy((u8 *)fs + offset, _val, size);
-}
-
-static void process_pedit_field(struct ch_filter_specification *fs

Re: [endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-06 Thread Al Viro
On Tue, Aug 07, 2018 at 03:51:34AM +0800, kbuild test robot wrote:
> Hi Al,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on net-next/master]
> [also build test WARNING on v4.18-rc8 next-20180806]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Al-Viro/cxgb4_next_header-match_val-match_mask-should-be-net-endian/20180806-183127
> reproduce:
> # apt-get install sparse
> make ARCH=x86_64 allmodconfig
> make C=1 CF=-D__CHECK_ENDIAN__

Yes, there are arseloads of misannotations.  And I do have followups
that *do* annotate that (which is exactly how that bug had been found
in the first place).  But I'd rather separate fixes from trivial
annotations - the former might need to be backported, the latter would
only cause extra PITA for backport.


[RFC] weirdness in cxgb3_main.c:init_tp_parity()

2018-08-05 Thread Al Viro
for (i = 0; i < 2048; i++) {
...
req->l2t_idx = htonl(V_L2T_W_IDX(i));
...

in there is very odd; l2t_idx is a 16bit field, and
#define V_L2T_W_IDX(x) ((x) << S_L2T_W_IDX)
#define S_L2T_W_IDX0

IOW, we are taking htonl(something in range 0..2047) and
shove it into 16bit field.  Which would, on a little-endian
host, be a fancy way of spelling 
req->l2t_idx = 0;

What's the intended behaviour there?  I'm not familiar with
the hardware in question; this smells like a typoed
req->l2t_idx = htons(...)
but how does the current code manage to work (i.e. does
anything even care about the value stored there)?  It's not
a big-endian-only driver, after all...


[endianness bug] cxgb4: mk_act_open_req() buggers ->{local,peer}_ip on big-endian hosts

2018-08-05 Thread Al Viro
Unlike fs.val.lport and fs.val.fport, cxgb4_process_flow_match()
sets fs.val.{l,f}ip to net-endian values without conversion - they come
straight from flow_dissector_key_ipv4_addrs ->dst and ->src resp.  So
the assignment in mk_act_open_req() ought to be a straigh copy.

As far as I know, T4 PCIe cards do exist, so it's not as if that
thing could only be found on little-endian systems...

Signed-off-by: Al Viro 
---
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
index 00fc5f1afb1d..7dddb9e748b8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
@@ -1038,10 +1038,8 @@ static void mk_act_open_req(struct filter_entry *f, 
struct sk_buff *skb,
OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, qid_filterid));
req->local_port = cpu_to_be16(f->fs.val.lport);
req->peer_port = cpu_to_be16(f->fs.val.fport);
-   req->local_ip = f->fs.val.lip[0] | f->fs.val.lip[1] << 8 |
-   f->fs.val.lip[2] << 16 | f->fs.val.lip[3] << 24;
-   req->peer_ip = f->fs.val.fip[0] | f->fs.val.fip[1] << 8 |
-   f->fs.val.fip[2] << 16 | f->fs.val.fip[3] << 24;
+   memcpy(>local_ip, f->fs.val.lip, 4);
+   memcpy(>peer_ip, f->fs.val.fip, 4);
req->opt0 = cpu_to_be64(NAGLE_V(f->fs.newvlan == VLAN_REMOVE ||
f->fs.newvlan == VLAN_REWRITE) |
DELACK_V(f->fs.hitcnts) |


Re: [endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-05 Thread Al Viro
On Sun, Aug 05, 2018 at 04:28:11PM +0100, Al Viro wrote:
> On little-endian host those do yield the right values - e.g. 0x1100 is
> {0, 17, 0, 0}, etc.  On big-endian, though, these will end up checking
> in IPv4 case the octet at offset 10 (i.e. upper 16 bits of checksum) and for 
> IPv6
> - the octet at offset 5 (i.e.  the lower 8 bits of payload length).
> 
> Unless I'm misreading that code, it needs the following to do the right
> thing both on l-e and b-e.  Comments?

... and it looks like the same story with ->mask - it's compared to ->offmask,
which is __be16.  For little-endian hosts the values make sense (htons(0x0f00),
with offoff 0 and shift 6, i.e. "take the first two octets, treat them as
net-endian, clear everything except IHL bits and shift down by 6, which'd
yield IHL*4"), for big-endian they don't - you'd get TOS * 4 instead...


[endianness bug?] cxgb4_next_header .match_val/.match_mask should be net-endian

2018-08-05 Thread Al Viro
AFAICS, cxgb4_next_header() expects to find match_val/match_mask in 
struct cxgb4_next_header stored as big-endian:

/* Found a possible candidate.  Find a key that
 * matches the corresponding offset, value, and
 * mask to jump to next header.
 */
for (j = 0; j < cls->knode.sel->nkeys; j++) {
off = cls->knode.sel->keys[j].off;
val = cls->knode.sel->keys[j].val;
mask = cls->knode.sel->keys[j].mask;

if (next[i].match_off == off &&
next[i].match_val == val &&
next[i].match_mask == mask) {
found = true;
break;
}
}

Here ->keys[] is struct tc_u32_key and there mask and val are definitely
__be32.  match_val and match_mask are never changed after initialization
and they are set to:

* .match_off = 8, .match_val = 0x600, .match_mask = 0xFF00
meant to check for IPV4.Protocol == TCP, i.e. octet at offset 9 being 6
* .match_off = 8, .match_val = 0x1100, .match_mask = 0xFF00
meant to check for IPV4.Protocol == UDP, i.e. octet at offset 9 being 
0x11
* .match_off = 4, .match_val = 0x6, .match_mask = 0xFF
IPV6.NextHeader == TCP, i.e. octet at offset 6 being 6
* .match_off = 4, .match_val = 0x11, .match_mask = 0xFF
IPV6.NextHeader == UDP, i.e. octet at offset 6 being 0x11

On little-endian host those do yield the right values - e.g. 0x1100 is
{0, 17, 0, 0}, etc.  On big-endian, though, these will end up checking
in IPv4 case the octet at offset 10 (i.e. upper 16 bits of checksum) and for 
IPv6
- the octet at offset 5 (i.e.  the lower 8 bits of payload length).

Unless I'm misreading that code, it needs the following to do the right
thing both on l-e and b-e.  Comments?

Signed-off-by: Al Viro 
---
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
index a4b99edcc339..ec226b1cebf4 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h
@@ -259,11 +259,11 @@ struct cxgb4_next_header {
  */
 static const struct cxgb4_next_header cxgb4_ipv4_jumps[] = {
{ .offset = 0, .offoff = 0, .shift = 6, .mask = 0xF,
- .match_off = 8, .match_val = 0x600, .match_mask = 0xFF00,
- .jump = cxgb4_tcp_fields },
+ .match_off = 8, .match_val = htonl(6 << 16),
+ .match_mask = htonl(0xff<<16), .jump = cxgb4_tcp_fields },
{ .offset = 0, .offoff = 0, .shift = 6, .mask = 0xF,
- .match_off = 8, .match_val = 0x1100, .match_mask = 0xFF00,
- .jump = cxgb4_udp_fields },
+ .match_off = 8, .match_val = htonl(17 << 16),
+ .match_mask = htonl(0xff<<16), .jump = cxgb4_udp_fields },
{ .jump = NULL }
 };
 
@@ -272,11 +272,11 @@ static const struct cxgb4_next_header cxgb4_ipv4_jumps[] 
= {
  */
 static const struct cxgb4_next_header cxgb4_ipv6_jumps[] = {
{ .offset = 0x28, .offoff = 0, .shift = 0, .mask = 0,
- .match_off = 4, .match_val = 0x6, .match_mask = 0xFF,
- .jump = cxgb4_tcp_fields },
+ .match_off = 4, .match_val = htonl(6 << 8),
+ .match_mask = htonl(0xff << 8), .jump = cxgb4_tcp_fields },
{ .offset = 0x28, .offoff = 0, .shift = 0, .mask = 0,
- .match_off = 4, .match_val = 0x11, .match_mask = 0xFF,
- .jump = cxgb4_udp_fields },
+ .match_off = 4, .match_val = htonl(17 << 8),
+ .match_mask = htonl(0xff << 8), .jump = cxgb4_udp_fields },
{ .jump = NULL }
 };
 


[PATCH][mwifiex] fix braino in mwifiex_fw_dump_info_event()

2018-08-04 Thread Al Viro
... and on little-endian host it even worked ;-)

Signed-off-by: Al Viro 
---
diff --git a/drivers/net/wireless/marvell/mwifiex/sta_event.c 
b/drivers/net/wireless/marvell/mwifiex/sta_event.c
index 03a6492662ca..1cd42a6a25c9 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_event.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_event.c
@@ -631,7 +631,7 @@ mwifiex_fw_dump_info_event(struct mwifiex_private *priv,
adapter->event_skb->data, event_skb->len);
adapter->devdump_len += event_skb->len;
 
-   if (le16_to_cpu(fw_dump_hdr->type == FW_DUMP_INFO_ENDED)) {
+   if (le16_to_cpu(fw_dump_hdr->type) == FW_DUMP_INFO_ENDED) {
mwifiex_dbg(adapter, MSG,
"receive end of transmission flag event!\n");
goto upload_dump;


[PATCH] mellanox: fix the dport endianness in call of __inet6_lookup_established()

2018-08-04 Thread Al Viro
__inet6_lookup_established() expect th->dport passed in host-endian,
not net-endian.  The reason is microoptimization in __inet6_lookup(),
but if you use the lower-level helpers, you have to play by their
rules...

Signed-off-by: Al Viro 
---
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 92d37459850e..be137d4a9169 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -328,7 +328,7 @@ static int tls_update_resync_sn(struct net_device *netdev,
 
sk = __inet6_lookup_established(dev_net(netdev), _hashinfo,
>saddr, th->source,
-   >daddr, th->dest,
+   >daddr, ntohs(th->dest),
netdev->ifindex, 0);
 #endif
}


[stmmac][bug?] endianness of Flexible RX Parser code

2018-08-03 Thread Al Viro
The values passed in struct tc_u32_sel ->mask and ->val are
32bit net-endian.  Your tc_fill_entry() does this:

data = sel->keys[0].val;
mask = sel->keys[0].mask;

...
entry->frag_ptr = frag;
entry->val.match_en = (mask << (rem * 8)) &
GENMASK(31, rem * 8);
entry->val.match_data = (data << (rem * 8)) &
GENMASK(31, rem * 8);
entry->val.frame_offset = real_off;
entry->prio = prio;

frag->val.match_en = (mask >> (rem * 8)) &
GENMASK(rem * 8 - 1, 0);
frag->val.match_data = (data >> (rem * 8)) &
GENMASK(rem * 8 - 1, 0);
frag->val.frame_offset = real_off + 1;
frag->prio = prio;
frag->is_frag = true;

and that looks very odd.  rem here is offset modulo 4.  Suppose offset is
equal to 5, val contains {V0, V1, V2, V3} and mask - {M0, M1, M2, M3}.
Then on little-endian host we get
entry->val.match_en:{0, M0, M1, M2}
entry->val.match_data:  {0, V0, V1, V2}
entry->val.frame_offset = 1;
frag->val.match_en: {M3, 0, 0, 0}
frag->val.match_data:   {V3, 0, 0, 0}
frag->val.frame_offset = 2;
and on big-endian
entry->val.match_en:{M1, M2, M3, 0}
entry->val.match_data:  {V1, V2, V3, 0}
entry->val.frame_offset = 1;
frag->val.match_en: {0, 0, 0, M0}
frag->val.match_data:   {0, 0, 0, V0}
frag->val.frame_offset = 2;

Little-endian variant looks like we mask octets 5, 6, 7 and 8 with
M0..M3 resp. and want V0..V3 in those.  On big-endian, though, we
look at the octets 11, 4, 5 and 6 instead.

I don't know the hardware (and it might be pulling any kind of weird
endianness-dependent stunts), but that really smells like a bug.
It looks like that code is trying to do something like

data = ntohl(sel->keys[0].val);
mask = ntohl(sel->keys[0].mask);
shift = rem * 8;

entry->val.match_en = htonl(mask >> shift);
entry->val.match_data = htonl(data >> shift);
entry->val.frame_offset = real_off;
...
frag->val.match_en = htonl(mask << (32 - shift));
frag->val.match_data = htonl(data << (32 - shift));
entry->val.frame_offset = real_off + 1;

Comments?


Re: [PATCH net-next] net/tls: Do not call msg_data_left() twice

2018-07-24 Thread Al Viro
On Tue, Jul 24, 2018 at 04:41:18PM +0530, Vakul Garg wrote:
> In function tls_sw_sendmsg(), msg_data_left() needs to be called only
> once. The second invocation of msg_data_left() for assigning variable
> try_to_copy can be removed and merged with the first one.

You do realize that msg_data_left(msg) is simply msg->msg_iter.count?
So I'm not at all sure that your change will be an overall win...


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 03:55:35PM -0700, Linus Torvalds wrote:
> > You are misreading that mess.  What he's trying to do (other than surviving
> > the awful clusterfuck around cancels) is to handle the decision what to
> > report to userland right in the wakeup callback.  *That* is what really
> > drives the "make the second-pass ->poll() or something similar to it
> > non-blocking" (in addition to the fact that it is such in considerable
> > majority of instances).
> 
> That's just crazy BS.
> 
> Just call poll() again when you copy the data to userland (which by
> definition can block, again).
> 
> Stop the idiotic "let's break poll for stupid AIO reasons, because the
> AIO people are morons".

You underestimate the nastiness of that thing (and for the record, I'm
a lot *less* fond of AIO than you are, what with having had to read that
nest of horrors lately).  It does not "copy the data to userland"; what it
does instead is copying into an array of pages it keeps, right from
IO completion callback.  In read/write case.  This
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;

event->obj = (u64)(unsigned long)iocb->ki_user_iocb;
event->data = iocb->ki_user_data;
event->res = res;
event->res2 = res2;

kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
is what does the copying.  And that might be done from IRQ context.
Yes, really.

They do have a slightly saner syscall that does copying from the damn
ring buffer, but its use is optional - userland can (and does) direct
read access to mmapped buffer.

Single-consumer ABIs suck and AIO is one such...

It could do schedule_work() and do blocking stuff from that - does so, in
case if it can't grab ->ctx_lock.  Earlier iteration used to try doing
everything straight from wakeup callback, and *that* was racy as hell;
I'd rather have Christoph explain which races he'd been refering to,
but there had been a whole lot of that.  Solution I suggested in the
last round of that was to offload __aio_poll_complete() via schedule_work()
both for cancel and poll wakeup cases.  Doing the common case right
from poll wakeup callback was argued to avoid noticable overhead in
common situation - that's what "aio: try to complete poll iocbs without
context switch" is about.  I'm more than slightly unhappy about the
lack of performance regression testing in non-AIO case...

At that point I would really like to see replies from Christoph - he's
on CET usually, no idea what his effective timezone is...


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 03:35:03PM -0700, Linus Torvalds wrote:

> Yes, the AIO poll implementation did it under the spinlock.
> 
> But there's no good *reason* for that.  The "aio_poll()" function
> itself is called in perfectly fine blocking context.

aio_poll() is not a problem.  It's wakeup callback that is one.

> As far as I can tell, Christoph could have just done the first pass
> '->poll()' *without* taking a spinlock, and that adds the table entry
> to the table. Then, *under the spinlock*, you associate the table the
> the kioctx. And then *after* the spinlock, you can call "->poll()"
> again (now with a NULL table pointer), to verify that the state is
> still not triggered. That's the whole point of the two-phgase poll
> thing - the first phase adds the entry to the wait queues, and the
> second phase checks for the race of "did it the event happen in the
> meantime".

You are misreading that mess.  What he's trying to do (other than surviving
the awful clusterfuck around cancels) is to handle the decision what to
report to userland right in the wakeup callback.  *That* is what really
drives the "make the second-pass ->poll() or something similar to it
non-blocking" (in addition to the fact that it is such in considerable
majority of instances).


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 10:30:27PM +0100, Al Viro wrote:

> I'm not saying that blocking on other things is a bug; some of such *are* 
> bogus,
> but a lot aren't really broken.  What I said is that in a lot of cases we 
> really
> have hard "no blocking other than in callback" (and on subsequent passes 
> there's
> no callback at all).  Which is just about perfect for AIO purposes, so *IF* we
> go for "new method just for AIO, those who don't have it can take a hike", we 
> might
> as well indicate that "can take a hike" in some way (be it opt-in or opt-out) 
> and
> use straight unchanged ->poll(), with alternative callback.

PS: one way of doing that would be to steal a flag from pt->_key and have 
->poll()
instances do an equivalent of
if (flags & LOOKUP_RCU)
return -ECHILD;
we have in a lot of ->d_revalidate() instances for "need to block" case.  Only
here they would've returned EPOLLNVAL.

Most of the ->poll() instances wouldn't care at all - they do not block unless
the callback does (and in this case it wouldn't have).  Normal poll(2)/select(2)
are completely unaffected.  And AIO would just have that bit set in its
poll_table_struct.

The rules for drivers change only in one respect - if your ->poll() is going to
need to block, check poll_requested_events(pt) & EPOLL_ATOMIC and return 
EPOLLNVAL
in such case.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 02:11:17PM -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 1:28 PM Al Viro  wrote:
> >
> >
> > Sure, but...
> >
> > static __poll_t binder_poll(struct file *filp,
> > struct poll_table_struct *wait)
> > {
> > struct binder_proc *proc = filp->private_data;
> > struct binder_thread *thread = NULL;
> > bool wait_for_proc_work;
> >
> > thread = binder_get_thread(proc);
> > if (!thread)
> > return POLLERR;
> 
> That's actually fine.
> 
> In particular, it's ok to *not* add yourself to the wait-queues if you
> already return the value that you will always return.

Sure (and that's one of the problems I mentioned with ->get_poll_head() model).
But in this case I was refering to GFP_KERNEL allocation down there.

> > And that's hardly unique - we have instances playing with timers,
> > allocations, whatnot.  Even straight mutex_lock(), as in
> 
> So?
> 
> Again, locking is permitted. It's not great, but it's not against the rules.

Me: a *LOT* of ->poll() instances only block in __pollwait() called (indirectly)
on the first pass.
 
You: They are *all* supposed to do it.

Me: 

I'm not saying that blocking on other things is a bug; some of such *are* bogus,
but a lot aren't really broken.  What I said is that in a lot of cases we really
have hard "no blocking other than in callback" (and on subsequent passes there's
no callback at all).  Which is just about perfect for AIO purposes, so *IF* we
go for "new method just for AIO, those who don't have it can take a hike", we 
might
as well indicate that "can take a hike" in some way (be it opt-in or opt-out) 
and
use straight unchanged ->poll(), with alternative callback.

Looks like we were talking past each other for the last couple of rounds...

> So none of the things you point to are excuses for changing interfaces
> or adding any flags.

> Anybody who thinks "select cannot block" or "->poll() musn't block" is
> just confused. It has *never* been about that. It waits asynchronously
> for IO, but it may well wait synchronously for locks or memory or just
> "lazy implementation".

Obviously.  I *do* understand how poll() works, really.

> The fact is, those interface changes were just broken shit. They were
> confused. I don't actually believe that AIO even needed them.
> 
> Christoph, do you have a test program for IOCB_CMD_POLL and what it's
> actually supposed to do?
> 
> Because I think that what it can do is simply to do the ->poll() calls
> outside the iocb locks, and then just attach the poll table to the
> kioctx afterwards.

I'd do a bit more - embed the first poll_table_entry into poll iocb itself,
so that the instances that use only one queue wouldn't need any allocations
at all.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 09:28:37PM +0100, Al Viro wrote:

> Sure.  Unfortunately, adding yourself to the poll table is not the only
> way to block.  And a plenty of instances in e.g. drivers/media (where
> the bulk of ->poll() instances lives) are very free with grabbing mutexes
> as they go.

Speaking of obvious bogosities (a lot more so than a blocking allocation
several calls down into helper):

static __poll_t ca8210_test_int_poll(
struct file *filp,
struct poll_table_struct *ptable
)
{
__poll_t return_flags = 0;
struct ca8210_priv *priv = filp->private_data;

poll_wait(filp, >test.readq, ptable);
if (!kfifo_is_empty(>test.up_fifo))
return_flags |= (EPOLLIN | EPOLLRDNORM);
if (wait_event_interruptible(
priv->test.readq,
!kfifo_is_empty(>test.up_fifo))) {
return EPOLLERR;
}
return return_flags;
}

In a sense, poll_wait() had been an unfortunate choice of name - it's
really "arrange to be woken up by", not "wait for something now" and
while the outright confusion like above is rare, general "blocking
is OK in that area" is not rare at all...


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 12:31:14PM -0700, Linus Torvalds wrote:

> > * a *LOT* of ->poll() instances only block in __pollwait()
> > called (indirectly) on the first pass.
> 
> They are *all* supposed to do it.

Sure, but...

static __poll_t binder_poll(struct file *filp,
struct poll_table_struct *wait)
{
struct binder_proc *proc = filp->private_data;
struct binder_thread *thread = NULL;
bool wait_for_proc_work;

thread = binder_get_thread(proc);
if (!thread)
return POLLERR;
...
static struct binder_thread *binder_get_thread(struct binder_proc *proc)
{
struct binder_thread *thread;
struct binder_thread *new_thread;

binder_inner_proc_lock(proc);
thread = binder_get_thread_ilocked(proc, NULL);
binder_inner_proc_unlock(proc);
if (!thread) {
new_thread = kzalloc(sizeof(*thread), GFP_KERNEL);

And that's hardly unique - we have instances playing with timers,
allocations, whatnot.  Even straight mutex_lock(), as in
static __poll_t i915_perf_poll(struct file *file, poll_table *wait)
{
struct i915_perf_stream *stream = file->private_data;
struct drm_i915_private *dev_priv = stream->dev_priv;
__poll_t ret;

mutex_lock(_priv->perf.lock);
ret = i915_perf_poll_locked(dev_priv, stream, file, wait);
mutex_unlock(_priv->perf.lock);

return ret;
}

or
static __poll_t cec_poll(struct file *filp,
 struct poll_table_struct *poll)
{
struct cec_fh *fh = filp->private_data;
struct cec_adapter *adap = fh->adap;
__poll_t res = 0;

if (!cec_is_registered(adap))
return EPOLLERR | EPOLLHUP;
mutex_lock(>lock);
if (adap->is_configured &&
adap->transmit_queue_sz < CEC_MAX_MSG_TX_QUEUE_SZ)
res |= EPOLLOUT | EPOLLWRNORM;
if (fh->queued_msgs)
res |= EPOLLIN | EPOLLRDNORM;
if (fh->total_queued_events)
res |= EPOLLPRI;
poll_wait(filp, >wait, poll);
mutex_unlock(>lock);
return res;
}

etc.

> 
> The whole idea with "->poll()" is that the model of operation is:
> 
>  -  add_wait_queue() and return state on the first pass
> 
>  - on subsequent passes (or if somebody else already returned a state
> that means we already know we're not going to wait), the poll table is
> NULL, so you *CANNOT* add_wait_queue again, so you just return state.
> 
> Additionally, once _anybody_ has returned a "I already have the
> event", we also clear the poll table queue, so subsequent accesses
> will purely be for returning the poll state.
> 
> So I don't understand why you're asking for annotation. The whole "you
> add yourself to the poll table" is *fundamentally* only done on the
> first pass. You should never do it for later passes.

Sure.  Unfortunately, adding yourself to the poll table is not the only
way to block.  And a plenty of instances in e.g. drivers/media (where
the bulk of ->poll() instances lives) are very free with grabbing mutexes
as they go.


Re: [PATCH 6/6] fs: replace f_ops->get_poll_head with a static ->f_poll_head pointer

2018-06-28 Thread Al Viro
On Thu, Jun 28, 2018 at 09:40:21AM -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:21 AM Christoph Hellwig  wrote:
> >
> > Note that for this removes the possibility of actually returning an
> > error before waiting in poll.  We could still do this with an ERR_PTR
> > in f_poll_head with a little bit of WRITE_ONCE/READ_ONCE magic, but
> > I'd like to defer that until actually required.
> 
> I'm still going to just revert the whole poll mess for now.
>
> It's still completely broken. This helps things, but it doesn't fix
> the fundamental issue: the new interface is strictly worse than the
> old interface ever was.
> 
> So Christoph, if you don't like the tradoitional ->poll() method, and
> you want something else for aio polling, I seriously will suggest that
> you introduce a new f_op for *that*. Don't mess with the existing
> ->poll() function, don't make select() and poll() use a slower and
> less capable function just because aio wants something else.
> 
> In other words, you need to see AIO as the less important case, not as
> the driver for this.
> 
> I also want to understand what the AIO race was, and what the problem
> with the poll() thing was. You claimed it was racy. I don't see it,
> and it was never ever explained in the whole series. I should never
> have pulled it in the first place if only for that reason, but I tend
> to trust Al when it comes to the VFS layer, so I did. My bad.

... and I should have pushed back harder, rather than getting sidetracked
into fixing the fs/aio.c-side races in this series ;-/

As for what can be salvaged out of the whole mess,
* probably the most serious lesson is that INDIRECT CALLS ARE
COSTLY NOW and shouldn't be used lightly.  That had been slow to sink
in and we'd better all realize how much the things have changed.
That, BTW, has implications going a lot further than poll-related stuff -
e.g. the whole "we'll deal with revoke-like issues in procfs/sysfs/debugfs
by wrapping method calls" needs to be reexamined.  And in poll-related
area, note that we have a lot of indirection levels for socket poll.
* having an optional pointer to wait_queue_head in struct file
is probably a good idea; a lot of ->poll() instances do have the same
form.  Even if sockets do not (and I'm not all that happy about the
solution in the latest series), the instances that do are common and
important enough.
* a *LOT* of ->poll() instances only block in __pollwait()
called (indirectly) on the first pass.  If we annotate those in some
way (flag set in ->open(), presence of a new method, whatever) and
limit aio-poll to just those, we could deal with the aio side without
disrupting select/poll at all; just use (in place of __pollwait)
a different callback that wouldn't try to allocate poll_table_entry
and worked with the stuff embedded into aio-poll iocb.

How much do you intend to revert?  Untangling just the ->poll()-related
parts from the rest of changes in fs/aio.c will be unpleasant; we might
end up reverting the whole tail of the series and redoing the things
that are not poll-related on top of the revert... ;-/


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 03:15:15PM -0700, Cong Wang wrote:

> > You do realize that the same ->setattr() can be called by way of
> > chown() on /proc/self/fd/, right?  What would you do there -
> > bump refcount on that struct file when traversing that symlink and
> > hold it past the end of pathname resolution, until... what exactly?
> 
> I was thinking about this:
> 
> error = user_path_at(dfd,); // hold dfd when needed
> 
> if (!error) {
> error = chown_common(, mode);
> path_put();  // release dfd if held
> 
> With this, we can guarantee ->release() is only possibly called
> after chown_common() which is after ->setattr() too.

No, we can't.  You are assuming that there *is* dfd and that it points
to the opened socket we are going to operate upon.  That is not guaranteed.
At all.  If e.g. 42 is a file descriptor of an opened socket, plain chown(2)
on /proc/self/fd/42 will trigger that ->setattr().  No dfd in sight.
We do run across an opened file at some point, all right - when we traverse
the symlink in procfs.  You can't bump ->f_count there.  Even leaving aside
the "where would I stash the reference to that file?" and "how long would I
hold it?", you can't bump ->f_count on other process' files.  That would
bugger the expectations of close(2) callers.


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote:
> On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
> > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
> >> fchownat() doesn't even hold refcnt of fd until it figures out
> >> fd is really needed (otherwise is ignored) and releases it after
> >> it resolves the path. This means sock_close() could race with
> >> sockfs_setattr(), which leads to a NULL pointer dereference
> >> since typically we set sock->sk to NULL in ->release().
> >>
> >> As pointed out by Al, this is unique to sockfs. So we can fix this
> >> in socket layer by acquiring inode_lock in sock_close() and
> >> checking against NULL in sockfs_setattr().
> >
> > That looks like a massive overkill - it's way heavier than it should be.
> 
> I don't see any other quick way to fix this. My initial thought is
> to keep that refcnt until path_put(), apparently you don't like it
> either.

You do realize that the same ->setattr() can be called by way of
chown() on /proc/self/fd/, right?  What would you do there -
bump refcount on that struct file when traversing that symlink and
hold it past the end of pathname resolution, until... what exactly?


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Al Viro
On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
> fchownat() doesn't even hold refcnt of fd until it figures out
> fd is really needed (otherwise is ignored) and releases it after
> it resolves the path. This means sock_close() could race with
> sockfs_setattr(), which leads to a NULL pointer dereference
> since typically we set sock->sk to NULL in ->release().
> 
> As pointed out by Al, this is unique to sockfs. So we can fix this
> in socket layer by acquiring inode_lock in sock_close() and
> checking against NULL in sockfs_setattr().

That looks like a massive overkill - it's way heavier than it should be.
And it's very likely to trigger shitloads of deadlock warnings, some
possibly even true.


Re: aio poll and a new in-kernel poll API V13

2018-05-27 Thread Al Viro
OK, it's in -next now; there are several cleanups I'd put
into vfs.git#work.aio:
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED 
the same way
  aio_read_events_ring(): make a bit more readable
  aio: shift copyin of iocb into io_submit_one()
  aio: fold do_io_submit() into callers
Those are *not* on -next yet and if anybody has objections against
any of those, please yell.  Individual patches in followups...


Re: aio poll and a new in-kernel poll API V13

2018-05-26 Thread Al Viro
On Sat, May 26, 2018 at 01:11:11AM +0100, Al Viro wrote:
> On Wed, May 23, 2018 at 09:19:49PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series adds support for the IOCB_CMD_POLL operation to poll for the
> > readyness of file descriptors using the aio subsystem.  The API is based
> > on patches that existed in RHAS2.1 and RHEL3, which means it already is
> > supported by libaio.  To implement the poll support efficiently new
> > methods to poll are introduced in struct file_operations:  get_poll_head
> > and poll_mask.  The first one returns a wait_queue_head to wait on
> > (lifetime is bound by the file), and the second does a non-blocking
> > check for the POLL* events.  This allows aio poll to work without
> > any additional context switches, unlike epoll.
> > 
> > This series sits on top of the aio-fsync series that also includes
> > support for io_pgetevents.
> 
> OK, I can live with that, except for one problem - the first patch shouldn't
> be sitting on top of arseloads of next window fodder.
> 
> Please, rebase the rest of the series on top of merge of vfs.git#fixes
> (4faa99965e02) with your aio-fsync.4 and tell me what to pull.

UGH

You've based it on vfs.git#hch.aio (== your aio-fsync.4) + baf10564fbb6
(== vfs.git#fixes^), *and* started with cherry-pick of vfs.git#fixes
on top of that, followed by your series.

That makes no sense whatsoever.  Please, take your aio-fsync.4, merge
vfs.git#fixes (== 4faa99965e02, "fix io_destroy()/aio_complete() race",
same change as your 4e79230e5254) into it and rebase the rest of your
branch on top of that (from "uapi: turn __poll_t sparse checkin
on by default" to "random: convert to ->poll_mask").  BTW, you probably
want s/checkin/checks/ in the first one of those...


Re: aio poll and a new in-kernel poll API V13

2018-05-25 Thread Al Viro
On Wed, May 23, 2018 at 09:19:49PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.
> 
> This series sits on top of the aio-fsync series that also includes
> support for io_pgetevents.

OK, I can live with that, except for one problem - the first patch shouldn't
be sitting on top of arseloads of next window fodder.

Please, rebase the rest of the series on top of merge of vfs.git#fixes
(4faa99965e02) with your aio-fsync.4 and tell me what to pull.


YAaioRace (was Re: [PATCH 08/31] aio: implement IOCB_CMD_POLL)

2018-05-22 Thread Al Viro
On Wed, May 23, 2018 at 01:49:04AM +0100, Al Viro wrote:

> > Looks like we want to call ->ki_cancel() *BEFORE* removing from the list,
> > as well as doing fput() after aio_complete().  The same ordering, BTW, goes
> > for aio_read() et.al.
> > 
> > Look:
> > CPU1:   io_cancel() grabs ->ctx_lock, finds iocb and removes it from 
> > the list.
> > CPU2:   aio_rw_complete() on that iocb.  Since the sucker is not in the 
> > list
> > anymore, we do NOT spin on ->ctx_lock and proceed to free iocb
> > CPU1:   pass freed iocb to ->ki_cancel().  BOOM.
> 
> BTW, it seems that the mainline is vulnerable to this one.  I might be
> missing something, but...

It is, but with a different attack vector - io_cancel(2) won't do it (it
does not remove from the list at all), but io_destroy(2) bloody well will.

IMO, we need this in mainline; unless somebody has a problem with it, to
#fixes it goes:

fix io_destroy()/aio_complete() race

If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion.  At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock.  As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().

Fix is simple - remove from the list after the call of kiocb_cancel().  All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).

Cc: sta...@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
---
diff --git a/fs/aio.c b/fs/aio.c
index 8061d9787e54..49f53516eef0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -634,9 +634,8 @@ static void free_ioctx_users(struct percpu_ref *ref)
while (!list_empty(>active_reqs)) {
req = list_first_entry(>active_reqs,
   struct aio_kiocb, ki_list);
-
-   list_del_init(>ki_list);
kiocb_cancel(req);
+   list_del_init(>ki_list);
}
 
spin_unlock_irq(>ctx_lock);


Re: [PATCH 08/31] aio: implement IOCB_CMD_POLL

2018-05-22 Thread Al Viro
On Wed, May 23, 2018 at 01:45:30AM +0100, Al Viro wrote:

> Oh, bugger...
> 
> wakeup
>   removed from queue
>   schedule __aio_poll_complete()
> 
> cancel
>   grab ctx->lock
>   remove from list
> work
>   aio_complete()
>   check if it's in the list
>   it isn't, move on to free the sucker
> cancel
>   call ->ki_cancel()
>   BOOM
> 
> Looks like we want to call ->ki_cancel() *BEFORE* removing from the list,
> as well as doing fput() after aio_complete().  The same ordering, BTW, goes
> for aio_read() et.al.
> 
> Look:
> CPU1: io_cancel() grabs ->ctx_lock, finds iocb and removes it from the list.
> CPU2: aio_rw_complete() on that iocb.  Since the sucker is not in the list
> anymore, we do NOT spin on ->ctx_lock and proceed to free iocb
> CPU1: pass freed iocb to ->ki_cancel().  BOOM.

BTW, it seems that the mainline is vulnerable to this one.  I might be
missing something, but...


Re: [PATCH 08/31] aio: implement IOCB_CMD_POLL

2018-05-22 Thread Al Viro
On Tue, May 22, 2018 at 11:05:24PM +0100, Al Viro wrote:
> > +{
> > +   struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
> > +
> > +   fput(req->file);
> > +   aio_complete(iocb, mangle_poll(mask), 0);
> > +}
> 
> Careful.
> 
> > +static int aio_poll_cancel(struct kiocb *iocb)
> > +{
> > +   struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw);
> > +   struct poll_iocb *req = >poll;
> > +   struct wait_queue_head *head = req->head;
> > +   bool found = false;
> > +
> > +   spin_lock(>lock);
> > +   found = __aio_poll_remove(req);
> > +   spin_unlock(>lock);
> 
> What's to guarantee that req->head has not been freed by that point?
> Look: wakeup finds ->ctx_lock held, so it leaves the sucker on the
> list, removes it from queue and schedules the call of __aio_poll_complete().
> Which gets executed just as we hit aio_poll_cancel(), starting with fput().
> 
> You really want to do aio_complete() before fput().  That way you know that
> req->wait is alive and well at least until iocb gets removed from the list.

Oh, bugger...

wakeup
removed from queue
schedule __aio_poll_complete()

cancel
grab ctx->lock
remove from list
work
aio_complete()
check if it's in the list
it isn't, move on to free the sucker
cancel
call ->ki_cancel()
BOOM

Looks like we want to call ->ki_cancel() *BEFORE* removing from the list,
as well as doing fput() after aio_complete().  The same ordering, BTW, goes
for aio_read() et.al.

Look:
CPU1:   io_cancel() grabs ->ctx_lock, finds iocb and removes it from the list.
CPU2:   aio_rw_complete() on that iocb.  Since the sucker is not in the list
anymore, we do NOT spin on ->ctx_lock and proceed to free iocb
CPU1:   pass freed iocb to ->ki_cancel().  BOOM.

and if we have fput() done first (in aio_rw_complete()) we are vulnerable to
CPU1:   io_cancel() grabs ->ctx_lock, finds iocb and removes it from the list.
CPU2:   aio_rw_complete() on that iocb. fput() done, opening us to rmmod.
CPU1:   call ->ki_cancel(), which points to freed memory now.  BOOM.


Re: aio poll and a new in-kernel poll API V12

2018-05-22 Thread Al Viro
On Tue, May 22, 2018 at 01:30:37PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.

I can live with that, modulo bug in #8 (see reply to that one).  There's
some nitpicking, but that can be dealt with in followups.


Re: [PATCH 08/31] aio: implement IOCB_CMD_POLL

2018-05-22 Thread Al Viro
On Tue, May 22, 2018 at 01:30:45PM +0200, Christoph Hellwig wrote:

> +static inline void __aio_poll_complete(struct poll_iocb *req, __poll_t mask)
> +{
> + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
> +
> + fput(req->file);
> + aio_complete(iocb, mangle_poll(mask), 0);
> +}

Careful.

> +static int aio_poll_cancel(struct kiocb *iocb)
> +{
> + struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw);
> + struct poll_iocb *req = >poll;
> + struct wait_queue_head *head = req->head;
> + bool found = false;
> +
> + spin_lock(>lock);
> + found = __aio_poll_remove(req);
> + spin_unlock(>lock);

What's to guarantee that req->head has not been freed by that point?
Look: wakeup finds ->ctx_lock held, so it leaves the sucker on the
list, removes it from queue and schedules the call of __aio_poll_complete().
Which gets executed just as we hit aio_poll_cancel(), starting with fput().

You really want to do aio_complete() before fput().  That way you know that
req->wait is alive and well at least until iocb gets removed from the list.

> + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;

EPOLLERR | EPOLLHUP, please.  The values are equal to POLLERR and POLLHUP on
all architectures, but let's avoid misannotations.

> + spin_lock_irq(>ctx_lock);
> + list_add_tail(>ki_list, >active_reqs);
> +
> + spin_lock(>head->lock);
> + mask = req->file->f_op->poll_mask(req->file, req->events);
> + if (!mask)
> + __add_wait_queue(req->head, >wait);

ITYM
if (!mask) {
__add_wait_queue(req->head, >wait);
list_add_tail(>ki_list, >active_reqs);
}
What's the point of exposing it to aio_poll_cancel() when it has
never been on waitqueue?


Re: [PATCH 10/32] aio: implement IOCB_CMD_POLL

2018-05-20 Thread Al Viro
On Sun, May 20, 2018 at 06:32:25AM +0100, Al Viro wrote:
> > +   spin_lock_irqsave(>ctx_lock, flags);
> > +   list_add_tail(>ki_list, >delayed_cancel_reqs);
> > +   spin_unlock(>ctx_lock);
> 
> ... and io_cancel(2) comes, finds it and inhume^Wcompletes it, leaving us 
> to...
> 
> > +   spin_lock(>head->lock);
> 
> ... get buggered on attempt to dereference a pointer fetched from freed and
> reused object.

FWIW, how painful would it be to pull the following trick:
* insert into wait queue under ->ctx_lock
* have wakeup do schedule_work() with aio_complete() done from that
* have ->ki_cancel() grab queue lock, remove from queue and use
the same schedule_work()

That way you'd get ->ki_cancel() with the same semantics as originally for
everything - "ask politely to finish ASAP", and called in the same locking
environment for everyone - under ->ctx_lock, that is.  queue lock nests
inside ->ctx_lock; no magical flags, etc.

The cost is schedule_work() for each async poll-related completion as you
have for fsync.  I don't know whether that's too costly or not; it certainly
simplifies the things, but whether it's OK performance-wise...

Comments?


Re: [PATCH 10/32] aio: implement IOCB_CMD_POLL

2018-05-19 Thread Al Viro
On Tue, May 15, 2018 at 09:48:11PM +0200, Christoph Hellwig wrote:
> +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
> +{
> + struct kioctx *ctx = aiocb->ki_ctx;
> + struct poll_iocb *req = >poll;
> + unsigned long flags;
> + __poll_t mask;
> +
> + /* reject any unknown events outside the normal event mask. */
> + if ((u16)iocb->aio_buf != iocb->aio_buf)
> + return -EINVAL;
> + /* reject fields that are not defined for poll */
> + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
> + return -EINVAL;
> +
> + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
> + req->file = fget(iocb->aio_fildes);
> + if (unlikely(!req->file))
> + return -EBADF;
> + if (!file_has_poll_mask(req->file))
> + goto out_fail;
> +
> + req->head = req->file->f_op->get_poll_head(req->file, req->events);
> + if (!req->head)
> + goto out_fail;
> + if (IS_ERR(req->head)) {
> + mask = EPOLLERR;
> + goto done;
> + }
> +
> + init_waitqueue_func_entry(>wait, aio_poll_wake);
> + aiocb->ki_cancel = aio_poll_cancel;
> +
> + spin_lock_irqsave(>ctx_lock, flags);
> + list_add_tail(>ki_list, >delayed_cancel_reqs);
> + spin_unlock(>ctx_lock);

... and io_cancel(2) comes, finds it and inhume^Wcompletes it, leaving us to...

> + spin_lock(>head->lock);

... get buggered on attempt to dereference a pointer fetched from freed and
reused object.


Re: [PATCH 08/32] aio: replace kiocb_set_cancel_fn with a cancel_kiocb file operation

2018-05-19 Thread Al Viro
On Tue, May 15, 2018 at 09:48:09PM +0200, Christoph Hellwig wrote:

>   case -EIOCBQUEUED:
> + if (req->ki_filp->f_op->cancel_kiocb) {
> + struct aio_kiocb *iocb =
> + container_of(req, struct aio_kiocb, rw);
> + struct kioctx *ctx = iocb->ki_ctx;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>ctx_lock, flags);
> + list_add_tail(>ki_list, >active_reqs);

Use after free - that list insertion used to be done by drivers and doing
so before any ->ki_complete() calls might've happened used to be their
responsibility.  Now you've taken that to the point after ->read_iter()
(or ->write_iter()) return, so there's no way in hell to guarantee it's
not been completed (and freed) by that point.

Incidentally, none of the callers gives a damn about the difference between
0 and -EIOCBQUEUED now, so aio_rw_ret() might as well had been made void...


Re: simplify procfs code for seq_file instances V3

2018-05-16 Thread Al Viro
On Wed, May 16, 2018 at 11:43:04AM +0200, Christoph Hellwig wrote:
> We currently have hundreds of proc files that implement plain, read-only
> seq_file based interfaces.  This series consolidates them using new
> procfs helpers that take the seq_operations or simple show callback
> directly.
> 
> A git tree is available at:
> 
> git://git.infradead.org/users/hch/misc.git proc_create.3

Pulled, but the last bit is a bleeding atrocity in need of followup
cleanup.


Re: [PATCH] coredump: rename umh_pipe_setup() to coredump_pipe_setup()

2018-05-10 Thread Al Viro
On Thu, May 10, 2018 at 11:32:47PM +, Luis R. Rodriguez wrote:

> I think net-next makes sense if Al Viro is OK with that. This way it could go
> in regardless of the state of your series, but it also lines up with your 
> work.

Fine by me...


Re: simplify procfs code for seq_file instances V2

2018-05-06 Thread Al Viro
On Sun, May 06, 2018 at 08:19:49PM +0300, Alexey Dobriyan wrote:
> +++ b/fs/proc/internal.h
> @@ -48,8 +48,8 @@ struct proc_dir_entry {
>   const struct seq_operations *seq_ops;
>   int (*single_show)(struct seq_file *, void *);
>   };
> - unsigned int state_size;
>   void *data;
> + unsigned int state_size;
>   unsigned int low_ino;
>   nlink_t nlink;
>   kuid_t uid;

Makes sense

> @@ -62,9 +62,9 @@ struct proc_dir_entry {
>   umode_t mode;
>   u8 namelen;
>  #ifdef CONFIG_64BIT
> -#define SIZEOF_PDE_INLINE_NAME   (192-139)
> +#define SIZEOF_PDE_INLINE_NAME   (192-155)
>  #else
> -#define SIZEOF_PDE_INLINE_NAME   (128-87)
> +#define SIZEOF_PDE_INLINE_NAME   (128-95)

>  #endif
>   char inline_name[SIZEOF_PDE_INLINE_NAME];
>  } __randomize_layout;

*UGH*

Both to the original state and that kind of "adjustments".
Incidentally, with __bugger_layout in there these expressions
are simply wrong.

If nothing else, I would suggest turning the last one into
char inline_name[];
in hope that layout won't get... randomized that much and
used

#ifdef CONFIG_64BIT
#define PDE_SIZE 192
#else
#define PDE_SIZE 128
#endif

union __proc_dir_entry {
char pad[PDE_SIZE];
struct proc_dir_entry real;
};

#define SIZEOF_PDE_INLINE_NAME (PDE_SIZE - offsetof(struct proc_dir_entry, 
inline_name))

for constants, adjusted sizeof and sizeof_field when creating
proc_dir_entry_cache and turned proc_root into

union __proc_dir_entry __proc_root = { .real = {
.low_ino= PROC_ROOT_INO,
.namelen= 5,
.mode   = S_IFDIR | S_IRUGO | S_IXUGO,
.nlink  = 2,
.refcnt = REFCOUNT_INIT(1),
.proc_iops  = _root_inode_operations,
.proc_fops  = _root_operations,
.parent = &__proc_root.real,
.subdir = RB_ROOT,
.name   = __proc_root.real.inline_name,
.inline_name= "/proc",
}};

#define proc_root __proc_root.real
(or actually used __proc_root.real in all of a 6 places where it remains).

> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index baf1994289ce..7d94fa005b0d 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -40,7 +40,7 @@ static struct net *get_proc_net(const struct inode *inode)
>  
>  static int seq_open_net(struct inode *inode, struct file *file)
>  {
> - size_t state_size = PDE(inode)->state_size;
> + unsigned int state_size = PDE(inode)->state_size;
>   struct seq_net_private *p;
>   struct net *net;


You and your "size_t is evil" crusade...


Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote:

> IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
> ->mnt_pins of some other mount.  Which, AFAICS, means that
> it used to be mounted on that other mount.  How the hell can
> that happen?
> 
> It looks like you somehow get a long chain of MNT_INTERNAL mounts
> stacked on top of each other, which ought to be prevented by
> mnt_flags &= ~MNT_INTERNAL_FLAGS;
> in do_add_mount().  Nuts...

Argh...  Nuts is right - clone_mnt() preserves the sodding
MNT_INTERNAL, with obvious results.

netns is related to the problem, by exposing MNT_INTERNAL mounts
(in /proc/*/ns/*) for mount --bind to copy and attach to the
tree.  AFAICS, the minimal reproducer is

touch /tmp/a
unshare -m sh -c 'for i in `seq 1`; do mount --bind /proc/1/ns/net /tmp/a; 
done'

(and it can be anything in /proc/*/ns/*, really)

I think the fix should be along the lines of the following:

Don't leak MNT_INTERNAL away from internal mounts

We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies.

Cc: sta...@kernel.org
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
---
diff --git a/fs/namespace.c b/fs/namespace.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, struct 
dentry *root,
goto out_free;
}
 
-   mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
+   mnt->mnt.mnt_flags = old->mnt.mnt_flags;
+   mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
/* Don't allow unprivileged users to change mount flags */
if (flag & CL_UNPRIVILEGED) {
mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;


Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)

2018-04-19 Thread Al Viro
On Thu, Apr 19, 2018 at 03:50:25PM +0300, Kirill Tkhai wrote:
> Hi, Al,
> 
> commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the 
> below test crashes the kernel:
> 
>     Author: Al Viro <v...@zeniv.linux.org.uk>
> Date:   Sat Jan 10 19:01:08 2015 -0500
> 
> switch the IO-triggering parts of umount to fs_pin
> 
> Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
> 
> $modprobe dummy
> 
> $while true
>  do
>  mkdir /var/run/netns
>  touch /var/run/netns/init_net
>  mount --bind /proc/1/ns/net /var/run/netns/init_net
> 
>  ip netns add foo
>  ip netns exec foo ip link add dummy0 type dummy
>  ip netns delete foo
> done

I can reproduce that, all right, and with a stack chain that
looks like this:
[77132.414912]  pin_kill+0x81/0x150
[77132.424362]  ? finish_wait+0x80/0x80
[77132.433917]  mnt_pin_kill+0x1e/0x30
[77132.443829]  cleanup_mnt+0x6b/0x70
[77132.453477]  pin_kill+0x81/0x150
[77132.463064]  ? finish_wait+0x80/0x80
[77132.472553]  group_pin_kill+0x1a/0x30
[77132.481973]  namespace_unlock+0x6f/0x80
[77132.491801]  put_mnt_ns+0x1d/0x30
[77132.501258]  free_nsproxy+0x17/0x90
[77132.510604]  do_exit+0x2dc/0xb40
[77132.520146]  ? handle_mm_fault+0xaa/0x1e0
[77132.529725]  do_group_exit+0x3a/0xa0
[77132.539506]  SyS_exit_group+0x10/0x10
with the top 4 entries repeated a lot.  Those cleanup_mnt()
could be called from __cleanup_mnt(), delayed_mntput() or
mntput_no_expire().

__cleanup_mnt() is only fed to task_work_add(); no way in hell
would you get the call stack similar to that; it would be
called by task_work_run() from exit_task_work() from
do_exit().  Not in the evidence.

delayed_mntput() is only fed to schedule_delayed_work();
again, not a chance of having the call chain like that.

The one in mntput_no_expire() is a tail-call, with
mntput_no_expire() called from umount(2) and mntput()
(tail-calls both of them).  The former is never called
from exit(2), so that call chain reads

pin_kill -> mntput or something tail-calling mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill

Now, the thing called by pin_kill must be something passed to
init_fs_pin(), i.e. acct_pin_kill() or drop_mountpoint().
acct_pin_kill() ends with
pin_remove(pin);
acct_put(acct);
}
with
static void acct_put(struct bsd_acct_struct *p)
{
if (atomic_long_dec_and_test(>count))
kfree_rcu(p, rcu);
}

IOW, no tail-call of mntput() in there.  OTOH,
static void drop_mountpoint(struct fs_pin *p)
{
struct mount *m = container_of(p, struct mount, mnt_umount);
dput(m->mnt_ex_mountpoint);
pin_remove(p);
mntput(>mnt);
}
*does* have the tail-call, so this call chain must be
pin_kill -> drop_mountpoint -> mntput -> mntput_no_expire ->
cleanup_mnt -> mnt_pin_kill -> pin_kill

So far, so good, but if you look into mntput_no_expire() you see
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
init_task_work(>mnt_rcu, __cleanup_mnt);
if (!task_work_add(task, >mnt_rcu, true))
return;
}
if (llist_add(>mnt_llist, _mntput_list))
schedule_delayed_work(_mntput_work, 1);
return;
}
cleanup_mnt(mnt);

IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
->mnt_pins of some other mount.  Which, AFAICS, means that
it used to be mounted on that other mount.  How the hell can
that happen?

It looks like you somehow get a long chain of MNT_INTERNAL mounts
stacked on top of each other, which ought to be prevented by
mnt_flags &= ~MNT_INTERNAL_FLAGS;
in do_add_mount().  Nuts...


Re: [PATCH 05/32] fs: introduce new ->get_poll_head and ->poll_mask methods

2018-04-05 Thread Al Viro
On Fri, Mar 30, 2018 at 05:07:42PM +0200, Christoph Hellwig wrote:

> +  get_poll_head: Returns the struct wait_queue_head that poll, select,
> +  epoll or aio poll should wait on in case this instance only has single
> +  waitqueue.  Can return NULL to indicate polling is not supported,
> +  or a POLL* value using the POLL_TO_PTR helper in case a grave error
> +  occured and ->poll_mask shall not be called.

> + if (IS_ERR(head))
> + return PTR_TO_POLL(head);

> + * ->get_poll_head can return a __poll_t in the PTR_ERR, use these macros
> + * to return the value and recover it.  It takes care of the negation as
> + * well as off the annotations.
> + */
> +#define POLL_TO_PTR(mask)(ERR_PTR(-(__force int)(mask)))

Uh-oh...
static inline bool __must_check IS_ERR(__force const void *ptr)
{
return IS_ERR_VALUE((unsigned long)ptr);
}
#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned 
long)-MAX_ERRNO)
#define MAX_ERRNO   4095

IOW, your trick relies upon arguments of PTR_TO_POLL being no greater
than 4095.  Now, consider
#define EPOLLRDHUP  (__force __poll_t)0x2000
which is to say, 8192...

So anything that tries e.g. POLL_TO_PTR(EPOLLRDHUP | EPOLLERR) will be in
for a quiet unpleasant surprise...


Re: [PATCH 08/32] aio: replace kiocb_set_cancel_fn with a cancel_kiocb file operation

2018-04-05 Thread Al Viro
On Fri, Mar 30, 2018 at 05:07:45PM +0200, Christoph Hellwig wrote:
> The current kiocb_set_cancel_fn implementation assumes the kiocb is
> embedded into an aio_kiocb, which is fundamentally unsafe as it might
> have been submitted by non-aio callers.  Instead add a cancel_kiocb
> file operation that replaced the ki_cancel function pointer set by
> kiocb_set_cancel_fn, and only adds iocbs to the active list when
> the read/write_iter methods return -EIOCBQUEUED and the file has
> a cancel_kiocb method.

> @@ -1440,6 +1423,16 @@ static inline ssize_t aio_rw_ret(struct kiocb *req, 
> ssize_t ret)
>  {
>   switch (ret) {
>   case -EIOCBQUEUED:
> + if (req->ki_filp->f_op->cancel_kiocb) {

... and by that point req might've been already freed by IO completion on
another CPU.
> + struct aio_kiocb *iocb =
> + container_of(req, struct aio_kiocb, rw);
> + struct kioctx *ctx = iocb->ki_ctx;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>ctx_lock, flags);
> + list_add_tail(>ki_list, >active_reqs);



Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-03 Thread Al Viro
On Tue, Apr 03, 2018 at 12:33:05PM -0400, David Miller wrote:

> Yes Al, however the pattern choosen here is probably cheaper on little endian:
> 
>   __beXX val = x;
>   switch (val) {
>   case htons(ETH_P_FOO):
>...
>   }
> 
> This way only the compiler byte swaps the constants at compile time,
> no code is actually generated to do it.

That's not obvious, actually - depends upon how sparse the switch ends
up being.  You can easily lose more than a single byteswap insn on
worse cascase of comparisons.


Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-03 Thread Al Viro
On Mon, Apr 02, 2018 at 03:58:55PM -0700, Florian Fainelli wrote:
> skb->protocol is a __be16 which we would be calling htons() against,
> while this is not wrong per-se as it correctly results in swapping the
> value on LE hosts, this still upsets sparse. Adopt a similar pattern to
> what other drivers do and just assign ip_ver to skb->protocol, and then
> use htons() against the different constants such that the compiler can
> resolve the values at build time.

Fuck that and fuck other drivers sharing that "pattern".  It's not hard
to remember:
host-endian to net-endian short is htons
host-endian to net-endian long is htonl
net-endian to host-endian short is ntohs
net-endian to host-endian long is ntohl

That's *it*.  No convoluted changes needed, just use the right primitive.
You are turning trivial one-liners ("conversions between host- and net-endian
are the same in both directions, so the current code works even though we
are using the wrong primitive, confusing the readers.  Use the right one")
which are absolutely self-contained and safe into much more massive changes
that are harder to follow and which are *not* self-contained at all.

Don't do it.


Re: [PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()

2018-04-03 Thread Al Viro
On Mon, Apr 02, 2018 at 03:58:56PM -0700, Florian Fainelli wrote:
> skb->protocol is a __be16 which we would be calling htons() against,
> while this is not wrong per-se as it correctly results in swapping the
> value on LE hosts, this still upsets sparse. Adopt a similar pattern to
> what other drivers do and just assign ip_ver to skb->protocol, and then
> use htons() against the different constants such that the compiler can
> resolve the values at build time.

This is completely bogus.  What sparse is complaining about is htons used
to convert from big-endian to host-endian.  Which is what ntohs is for.
IOW, instead of all that crap just

-   ip_ver = htons(skb->protocol);
+   ip_ver = ntohs(skb->protocol);

and be done with that.  Same in other drivers with the same problem.



Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro
On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:

> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
> of bugs (most of them preexisting, but I should've checked instead
> of assuming that everything's fine).  Will push out later tonight.

OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
and double-dput in late failure exit in rpc_fill_super())
and 5 regressions from register_shrinker() failure recovery.  


Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro
On Mon, Apr 02, 2018 at 10:52:12PM +0100, Al Viro wrote:
> On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote:
> > Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> writes:
> 
> > > I don't think this is a dup of existing bug.
> > > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
> > 
> > Even if expanding mount_ns to more filesystems was magically fixed,
> > proc would still have this issue with the pid namespace rather than
> > the net namespace.
> > 
> > This is a mess.  I will take a look and see if I can see a a fix.
> 
> It's trivially fixable, and there's no need to modify mount_ns() at
> all.
> 
> All we need is for rpc_kill_sb() to recognize whether we are already
> through the point in rpc_fill_super() where the refcount is bumped.
> That's it.
> 
> The most trivial way to do that is to move
>   net = get_net(sb->s_fs_info);
> past
> if (!root)
> return -ENOMEM;
> in the latter and have
> out:
>   if (!sb->s_root)
>   net = NULL;
> kill_litter_super(sb);
>   if (net)
>   put_net(net);
> in the end of the former.  And similar changes in other affected
> instances.

FWIW, I'm going through the ->kill_sb() instances, fixing that sort
of bugs (most of them preexisting, but I should've checked instead
of assuming that everything's fine).  Will push out later tonight.


Re: WARNING: refcount bug in should_fail

2018-04-02 Thread Al Viro
On Mon, Apr 02, 2018 at 03:30:56PM -0500, Eric W. Biederman wrote:
> Tetsuo Handa  writes:

> > I don't think this is a dup of existing bug.
> > We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
> 
> Even if expanding mount_ns to more filesystems was magically fixed,
> proc would still have this issue with the pid namespace rather than
> the net namespace.
> 
> This is a mess.  I will take a look and see if I can see a a fix.

It's trivially fixable, and there's no need to modify mount_ns() at
all.

All we need is for rpc_kill_sb() to recognize whether we are already
through the point in rpc_fill_super() where the refcount is bumped.
That's it.

The most trivial way to do that is to move
net = get_net(sb->s_fs_info);
past
if (!root)
return -ENOMEM;
in the latter and have
out:
if (!sb->s_root)
net = NULL;
kill_litter_super(sb);
if (net)
put_net(net);
in the end of the former.  And similar changes in other affected
instances.


Re: [PATCH 07/30] aio: add delayed cancel support

2018-03-29 Thread Al Viro
On Thu, Mar 29, 2018 at 10:33:05PM +0200, Christoph Hellwig wrote:
> The upcoming aio poll support would like to be able to complete the
> iocb inline from the cancellation context, but that would cause a
> double lock of ctx_lock with the current locking scheme.  Move the
> cancelation outside the context lock to avoid this reversal, which
> suits the existing usb gadgets users just fine as well (in fact
> both unconditionally disable irqs and thus seem broken without
> this change).
> 
> To make this safe aio_complete needs to check if this call should
> complete the iocb.  If it didn't the callers must not release any
> other resources.

Uh-oh...  What happens to existing users of kiocb_set_cancel_fn() now?
AFAICS, those guys will *not* get aio_kiocb freed at all in case of
io_cancel(2).  Look: we mark them with AIO_IOCB_CANCELLED and
call whatever ->ki_cancel() the driver has set.  Later the damn
thing calls ->ki_complete() (i.e. aio_complete_rw()), which calls
aio_complete(iocb, res, res2, 0) and gets false.  Nothing's freed,
struct file is leaked.

Frankly, the more I look at that, the less I like what you've done
with ->ki_cancel() overloading.  In regular case it's just accelerating
the call of ->ki_complete(), which will do freeing.  Here you have
->ki_cancel() free the damn thing, with the resulting need to play
silly buggers with locking, freeing logics in aio_complete(), etc.


Re: [PATCH 07/30] aio: add delayed cancel support

2018-03-29 Thread Al Viro
On Thu, Mar 29, 2018 at 10:53:05AM +0200, Christoph Hellwig wrote:
> On Wed, Mar 28, 2018 at 05:35:26PM +0100, Al Viro wrote:
> > >   ret = vfs_fsync(req->file, req->datasync);
> > > - fput(req->file);
> > > - aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
> > > + if (aio_complete(iocb, ret, 0, 0))
> > > + fput(file);
> > 
> > IDGI.
> > 1) can aio_complete() ever return false here?
> 
> It won't.  But sometimes checking the return value and sometimes not
> seems like a bad pattern.
> 
> > 2) do we ever have aio_kiocb that would not have an associated
> > struct file * that needs to be dropped on successful aio_complete()?  
> > AFAICS,
> > rw, fsync and poll variants all have one, and I'm not sure what kind of
> > async IO *could* be done without an opened file.
> 
> All have a file assoiated at least right now.  As mentioned last time
> finding a struct to pass that file would be rather annoying, so we'd either
> have to pass it explicitly, or do something nasty like duplicating the
> pointer in the aio_kiocb in addition to struct kiocb.  Which might not
> be that bad after all, as it would only bloat the aio_kiocb and not
> struct kiocb used on stack all over.

OK.  Let's leave that alone for now.  Re deferred cancels - AFAICS, we *must*
remove the sucker from ctx->active_reqs before dropping ->ctx_lock.

As it is, you are creating a io_cancel()/io_cancel() race leading to double
fput().  It's not that hard to fix; I can do that myself while applying your
series (as described in previous posting - kiocb_cancel_locked() returning
NULL or ERR_PTR() in non-deferred case and pointer to aio_kiocb removed from
->active_reqs in deferred one) or you could fix it in some other way and
update your branch.

As it is, the race is user-exploitable and not that hard to trigger - AIO_POLL,
then have two threads try and cancel it at the same time.


Re: [PATCH 07/30] aio: add delayed cancel support

2018-03-28 Thread Al Viro
On Wed, Mar 28, 2018 at 05:35:26PM +0100, Al Viro wrote:
> On Wed, Mar 28, 2018 at 09:29:03AM +0200, Christoph Hellwig wrote:
> >  static void aio_fsync_work(struct work_struct *work)
> >  {
> > struct fsync_iocb *req = container_of(work, struct fsync_iocb, work);
> > +   struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, fsync);
> > +   struct file *file = req->file;
> > int ret;
> >  
> > ret = vfs_fsync(req->file, req->datasync);
> > -   fput(req->file);
> > -   aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
> > +   if (aio_complete(iocb, ret, 0, 0))
> > +   fput(file);
> 
> IDGI.
>   1) can aio_complete() ever return false here?
>   2) do we ever have aio_kiocb that would not have an associated
> struct file * that needs to be dropped on successful aio_complete()?  AFAICS,
> rw, fsync and poll variants all have one, and I'm not sure what kind of
> async IO *could* be done without an opened file.

OK, hell with that.  I've tried to play with turning kiocb into a struct with
anon union in it, with poll and fsync parts folded into that sucker and ki_filp
lifted into common part.  Possible, but it's hairy as hell and can be done
afterwards.

However, doing that digging has turned up something really nasty.  Look:
in io_cancel(2) you have 
spin_lock_irq(>ctx_lock);
kiocb = lookup_kiocb(ctx, iocb, key);   
if (kiocb) {
if (kiocb->flags & AIO_IOCB_DELAYED_CANCEL) {
kiocb->flags |= AIO_IOCB_CANCELLED;
} else {
ret = kiocb_cancel(kiocb);
kiocb = NULL;
}
}
spin_unlock_irq(>ctx_lock);

Now, suppose two threads call io_cancel() on the same aio_poll in progress.
Both hit that code and *both* find the same kiocb.  Sure, the first one
will shortly do
if (kiocb)
ret = kiocb_cancel(kiocb);
which will remove it from the list.  Too late, though - you've already dropped
->ctx_lock, letting the second one find it.  Result: two aio_poll_cancel() in
parallel, with resulting double-free and double-fput().

You really need to remove it from the ->active_reqs before dropping the lock.
free_ioctx_users() does it correctly, io_cancel(2) fucks it up.

I'd add something like
struct aio_kiocb *kiocb_cancel_locked(struct aio_kiocb *kiocb)
{
if (!kiocb)
return ERR_PTR(-EINVAL);
if (kiocb->flags & AIO_IOCB_DELAYED_CANCEL) {
list_del(>ki_list);
kiocb->flags |= AIO_IOCB_CANCELLED;
return kiocb;
} else {
return ERR_PTR(kiocb_cancel(kiocb));
}
}

with 
spin_lock_irq(>ctx_lock);
while (!list_empty(>active_reqs)) {
req = list_first_entry(>active_reqs,
   struct aio_kiocb, ki_list);
req = kiocb_cancel_locked(req);
if (!IS_ERR_OR_NULL(req))
list_add_tail(>ki_list, );
}
spin_unlock_irq(>ctx_lock);

in free_ioctx_users() and
spin_lock_irq(>ctx_lock);
kiocb = kiocb_cancel_locked(lookup_kiocb(ctx, iocb, key));
spin_unlock_irq(>ctx_lock);
ret = IS_ERR_OR_NULL(kiocb) ? PTR_ERR(kiocb) : kiocb_cancel(kiocb);
in io_cancel(2)...



Re: [PATCH 07/30] aio: add delayed cancel support

2018-03-28 Thread Al Viro
On Wed, Mar 28, 2018 at 09:29:03AM +0200, Christoph Hellwig wrote:
>  static void aio_fsync_work(struct work_struct *work)
>  {
>   struct fsync_iocb *req = container_of(work, struct fsync_iocb, work);
> + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, fsync);
> + struct file *file = req->file;
>   int ret;
>  
>   ret = vfs_fsync(req->file, req->datasync);
> - fput(req->file);
> - aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
> + if (aio_complete(iocb, ret, 0, 0))
> + fput(file);

IDGI.
1) can aio_complete() ever return false here?
2) do we ever have aio_kiocb that would not have an associated
struct file * that needs to be dropped on successful aio_complete()?  AFAICS,
rw, fsync and poll variants all have one, and I'm not sure what kind of
async IO *could* be done without an opened file.


Re: [PATCH 06/28] aio: implement IOCB_CMD_POLL

2018-03-22 Thread Al Viro
On Thu, Mar 22, 2018 at 06:24:10PM +0100, Christoph Hellwig wrote:

> -static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
> +static bool aio_complete(struct aio_kiocb *iocb, long res, long res2,
> + unsigned complete_flags)

Looks like all callers are following that with "if returned true,
fput(something)".  Does it really make any sense to keep that struct
file * in different fields?

Wait a sec...  What ordering do we want for
* call(s) of ->ki_complete
* call (if any) of ->ki_cancel
* dropping reference to struct file
and what are the expected call chains for all of those?


Re: [PATCH 06/28] aio: implement IOCB_CMD_POLL

2018-03-22 Thread Al Viro
On Wed, Mar 21, 2018 at 08:40:10AM +0100, Christoph Hellwig wrote:
> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
> 
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.

AFAICS, your wakeup can race with io_cancel(), leading to double fput().
You are checking the "somebody had committed itself to cancelling that
thing" bit outside of ->ctx_lock on the wakeup side, and I don't see
anything to prevent both getting to __aio_poll_complete() on the same
iocb, with obvious results.

I might be missing something subtle in there, but then it would be nice to
have it covered in commit message...


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-20 Thread Al Viro
On Tue, Mar 20, 2018 at 04:26:52PM -0700, Linus Torvalds wrote:
> On Tue, Mar 20, 2018 at 4:23 PM, Linus Torvalds
>  wrote:
> >
> > Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
> > test for "__is_constant()":
> >
> >   /* Glory to Martin Uecker  */
> >   #define __is_constant(a) \
> > (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
> >
> > that is actually *specified* by the C standard to work, and doesn't
> > even depend on any gcc extensions.
> 
> Well, it does depend on 'sizeof(*(void *)X)' being 1 and the compiler
> not complaining about it, and that sizeof(int) is not 1.
> 
> But since we depend on those things in the kernel anyway, that's fine.

It also depends upon "ICE for null pointer constant purposes" having the
same extensions as "ICE for enum purposes", etc., which is not obvious.

Back in 2007 or so we had a long thread regarding null pointer constants
in sparse; I probably still have notes from back then, but that'll take
some serious digging to find ;-/

What's more, gcc definitely has odd extensions.  Example I remember from
back then:
extern unsigned n;
struct {
int x : 1 + n - n;
} y;

is accepted.  Used to be quietly accepted with -Wpedantic -std=c99, even,
but that got fixed - with -Wpedantic it does, at least, warn.  What is
and what is not recognized is fairly random - 1 + n - n + 1 + n - n
is recognized as "constant", 1 + n + n + 1 - n - n is not.  Of course,
neither is an ICE.


Re: sctp: use proc_remove_subtree()

2018-03-16 Thread Al Viro
On Fri, Mar 16, 2018 at 12:44:15PM -0400, David Miller wrote:
> From: Al Viro <v...@zeniv.linux.org.uk>
> Date: Wed, 14 Mar 2018 20:51:19 +
> 
> > On Wed, Mar 14, 2018 at 08:46:21PM +, Al Viro wrote:
> >> use proc_remove_subtree() for subtree removal, both on setup failure
> >> halfway through and on teardown.  No need to make simple things
> >> complex...
> > 
> > ... and sctp_dbg_objcnt_exit() can be removed as well - 
> > remove_proc_subtree()
> > will get that sucker automatically.  Either a trivial addition, or equally
> > trivial followup...
> 
> Please add it to this patch and resubmit.

Done...


[PATCH v2] sctp: use proc_remove_subtree()

2018-03-16 Thread Al Viro
use proc_remove_subtree() for subtree removal, both on setup failure
halfway through and on teardown.  No need to make simple things
complex...

Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
---
 include/net/sctp/sctp.h | 11 +--
 net/sctp/objcnt.c   |  8 
 net/sctp/proc.c | 90 
+-
 net/sctp/protocol.c | 59 
++-
 4 files changed, 28 insertions(+), 140 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index f7ae6b0a21d0..72c5b8fc3232 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -180,14 +180,7 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
 /*
  * sctp/proc.c
  */
-int sctp_snmp_proc_init(struct net *net);
-void sctp_snmp_proc_exit(struct net *net);
-int sctp_eps_proc_init(struct net *net);
-void sctp_eps_proc_exit(struct net *net);
-int sctp_assocs_proc_init(struct net *net);
-void sctp_assocs_proc_exit(struct net *net);
-int sctp_remaddr_proc_init(struct net *net);
-void sctp_remaddr_proc_exit(struct net *net);
+int __net_init sctp_proc_init(struct net *net);
 
 /*
  * sctp/offload.c
@@ -318,7 +311,6 @@ atomic_t sctp_dbg_objcnt_## name = ATOMIC_INIT(0)
 {.label= #name, .counter= _dbg_objcnt_## name}
 
 void sctp_dbg_objcnt_init(struct net *);
-void sctp_dbg_objcnt_exit(struct net *);
 
 #else
 
@@ -326,7 +318,6 @@ void sctp_dbg_objcnt_exit(struct net *);
 #define SCTP_DBG_OBJCNT_DEC(name)
 
 static inline void sctp_dbg_objcnt_init(struct net *net) { return; }
-static inline void sctp_dbg_objcnt_exit(struct net *net) { return; }
 
 #endif /* CONFIG_SCTP_DBG_OBJCOUNT */
 
diff --git a/net/sctp/objcnt.c b/net/sctp/objcnt.c
index aeea6da81441..fd2684ad94c8 100644
--- a/net/sctp/objcnt.c
+++ b/net/sctp/objcnt.c
@@ -130,11 +130,3 @@ void sctp_dbg_objcnt_init(struct net *net)
if (!ent)
pr_warn("sctp_dbg_objcnt: Unable to create /proc entry.\n");
 }
-
-/* Cleanup the objcount entry in the proc filesystem.  */
-void sctp_dbg_objcnt_exit(struct net *net)
-{
-   remove_proc_entry("sctp_dbg_objcnt", net->sctp.proc_net_sctp);
-}
-
-
diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index 537545ebcb0e..17d0155d9de3 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -101,25 +101,6 @@ static const struct file_operations sctp_snmp_seq_fops = {
.release = single_release_net,
 };
 
-/* Set up the proc fs entry for 'snmp' object. */
-int __net_init sctp_snmp_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("snmp", S_IRUGO, net->sctp.proc_net_sctp,
-   _snmp_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'snmp' object. */
-void sctp_snmp_proc_exit(struct net *net)
-{
-   remove_proc_entry("snmp", net->sctp.proc_net_sctp);
-}
-
 /* Dump local addresses of an association/endpoint. */
 static void sctp_seq_dump_local_addrs(struct seq_file *seq, struct 
sctp_ep_common *epb)
 {
@@ -259,25 +240,6 @@ static const struct file_operations sctp_eps_seq_fops = {
.release = seq_release_net,
 };
 
-/* Set up the proc fs entry for 'eps' object. */
-int __net_init sctp_eps_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("eps", S_IRUGO, net->sctp.proc_net_sctp,
-   _eps_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'eps' object. */
-void sctp_eps_proc_exit(struct net *net)
-{
-   remove_proc_entry("eps", net->sctp.proc_net_sctp);
-}
-
 struct sctp_ht_iter {
struct seq_net_private p;
struct rhashtable_iter hti;
@@ -390,25 +352,6 @@ static const struct file_operations sctp_assocs_seq_fops = 
{
.release = seq_release_net,
 };
 
-/* Set up the proc fs entry for 'assocs' object. */
-int __net_init sctp_assocs_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("assocs", S_IRUGO, net->sctp.proc_net_sctp,
-   _assocs_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'assocs' object. */
-void sctp_assocs_proc_exit(struct net *net)
-{
-   remove_proc_entry("assocs", net->sctp.proc_net_sctp);
-}
-
 static int sctp_remaddr_seq_show(struct seq_file *seq, void *v)
 {
struct sctp_association *assoc;
@@ -488,12 +431,6 @@ static const struct seq_operations sctp_remaddr_ops = {
.show  = sctp_remaddr_seq_show,
 };
 
-/* Cleanup the proc fs entry for 'remaddr' object. */
-void sctp_remaddr_proc_exit(struct net *net)
-{
-   remove_proc_entry("remaddr", net->sctp.proc_net_sctp);
-}
-
 static int sctp_remaddr_seq_open(struct

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro
On Fri, Mar 16, 2018 at 01:15:27PM -0700, Linus Torvalds wrote:
> On Fri, Mar 16, 2018 at 1:12 PM, Al Viro <v...@zeniv.linux.org.uk> wrote:
> >
> > That's C99, straight from N1256.pdf (C99-TC3)...
> 
> I checked C90, since the error is
> 
>ISO C90 forbids variable length array
> 
> and I didn't see anything there.

Well, yes - VLA are new in C99; C90 never had those...


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro
On Fri, Mar 16, 2018 at 12:27:23PM -0700, Linus Torvalds wrote:

> But it sure isn't "variable" either as far as the standard is
> concerned, because the standard doesn't even have that concept (it
> uses "variable" for argument numbers and for variables).

Huh?  6.7.5.2p4:

If the size is not present, the array type is an incomplete type.
If the size is * instead of being an expression, the array type is
a variable length array type of unspecified size, which can only be
used in declarations with function prototype scope [footnote]; such
arrays are nonetheless complete types.  If the size is an integer
constant expression and the element type has a known constant size,
the array type is not a variable length array type; otherwise, the
array type is a variable length array type.

footnote: Thus, * can be used only in function declarations that are
not definitions (see 6.7.5.3).

That's C99, straight from N1256.pdf (C99-TC3)...


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro
On Fri, Mar 16, 2018 at 05:55:02PM +, Al Viro wrote:
> On Fri, Mar 16, 2018 at 10:29:16AM -0700, Linus Torvalds wrote:
> >t.c: In function ‘test’:
> >t.c:6:6: error: argument to variable-length array is too large
> > [-Werror=vla-larger-than=]
> >  int array[(1,100)];
> > 
> > Gcc people are crazy.
> 
> That's not them, that's C standard regarding ICE.  1,100 is *not* a
> constant expression as far as the standard is concerned, and that
> type is actually a VLA with the size that can be optimized into
> a compiler-calculated value.
> 
> Would you argue that in

s/argue/agree/, sorry

> void foo(char c)
> {
>   int a[(c<<1) + 10 - c + 2 - c];
> 
> a is not a VLA?

FWIW, 6.6 starts with
 constant-expression:
conditional-expression
for syntax, with 6.6p3 being "Constant expression shall not contain
assignment, increment, decrement, function call or comma operators,
except when they are contained in a subexpression that is not evaluated",
with "The operand of sizeof operator is usually not evaluated (6.5.3.4)"
as a footnote.

6.6p10 allows implementation to accept other forms of constant expressions,
but arguing that such-and-such construct surely must be recognized as one,
when there are perfectly portable ways to achieve the same...

Realistically, code like that can come only from macros, and one can wrap
the damn thing into 0 * sizeof(..., 0) + just fine there.  Which will
satisfy the conditions for sizeof argument not being evaluated...


Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro
On Fri, Mar 16, 2018 at 10:29:16AM -0700, Linus Torvalds wrote:
>t.c: In function ‘test’:
>t.c:6:6: error: argument to variable-length array is too large
> [-Werror=vla-larger-than=]
>  int array[(1,100)];
> 
> Gcc people are crazy.

That's not them, that's C standard regarding ICE.  1,100 is *not* a
constant expression as far as the standard is concerned, and that
type is actually a VLA with the size that can be optimized into
a compiler-calculated value.

Would you argue that in
void foo(char c)
{
int a[(c<<1) + 10 - c + 2 - c];

a is not a VLA?  Sure, compiler probably would be able to reduce
that expression to 12, but demanding that to be recognized means
that compiler must do a bunch of optimizations in the middle of
typechecking.

expr, constant_expression is not a constant_expression.  And in
this particular case the standard is not insane - the only reason
for using that is typechecking and _that_ can be achieved without
violating 6.6p6:
sizeof(expr,0) * 0 + ICE
*is* an integer constant expression, and it gives you exact same
typechecking.  So if somebody wants to play odd games, they can
do that just fine, without complicating the logics for compilers...


Re: sctp: use proc_remove_subtree()

2018-03-14 Thread Al Viro
On Wed, Mar 14, 2018 at 08:46:21PM +, Al Viro wrote:
> use proc_remove_subtree() for subtree removal, both on setup failure
> halfway through and on teardown.  No need to make simple things
> complex...

... and sctp_dbg_objcnt_exit() can be removed as well - remove_proc_subtree()
will get that sucker automatically.  Either a trivial addition, or equally
trivial followup...


sctp: use proc_remove_subtree()

2018-03-14 Thread Al Viro
use proc_remove_subtree() for subtree removal, both on setup failure
halfway through and on teardown.  No need to make simple things
complex...

Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
---
 include/net/sctp/sctp.h |  9 +
 net/sctp/proc.c | 90 
+-
 net/sctp/protocol.c | 57 
++---
 3 files changed, 28 insertions(+), 128 deletions(-)

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index f7ae6b0a21d0..3f90134008e9 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -180,14 +180,7 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
 /*
  * sctp/proc.c
  */
-int sctp_snmp_proc_init(struct net *net);
-void sctp_snmp_proc_exit(struct net *net);
-int sctp_eps_proc_init(struct net *net);
-void sctp_eps_proc_exit(struct net *net);
-int sctp_assocs_proc_init(struct net *net);
-void sctp_assocs_proc_exit(struct net *net);
-int sctp_remaddr_proc_init(struct net *net);
-void sctp_remaddr_proc_exit(struct net *net);
+int __net_init sctp_proc_init(struct net *net);
 
 /*
  * sctp/offload.c
diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index 537545ebcb0e..17d0155d9de3 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -101,25 +101,6 @@ static const struct file_operations sctp_snmp_seq_fops = {
.release = single_release_net,
 };
 
-/* Set up the proc fs entry for 'snmp' object. */
-int __net_init sctp_snmp_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("snmp", S_IRUGO, net->sctp.proc_net_sctp,
-   _snmp_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'snmp' object. */
-void sctp_snmp_proc_exit(struct net *net)
-{
-   remove_proc_entry("snmp", net->sctp.proc_net_sctp);
-}
-
 /* Dump local addresses of an association/endpoint. */
 static void sctp_seq_dump_local_addrs(struct seq_file *seq, struct 
sctp_ep_common *epb)
 {
@@ -259,25 +240,6 @@ static const struct file_operations sctp_eps_seq_fops = {
.release = seq_release_net,
 };
 
-/* Set up the proc fs entry for 'eps' object. */
-int __net_init sctp_eps_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("eps", S_IRUGO, net->sctp.proc_net_sctp,
-   _eps_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'eps' object. */
-void sctp_eps_proc_exit(struct net *net)
-{
-   remove_proc_entry("eps", net->sctp.proc_net_sctp);
-}
-
 struct sctp_ht_iter {
struct seq_net_private p;
struct rhashtable_iter hti;
@@ -390,25 +352,6 @@ static const struct file_operations sctp_assocs_seq_fops = 
{
.release = seq_release_net,
 };
 
-/* Set up the proc fs entry for 'assocs' object. */
-int __net_init sctp_assocs_proc_init(struct net *net)
-{
-   struct proc_dir_entry *p;
-
-   p = proc_create("assocs", S_IRUGO, net->sctp.proc_net_sctp,
-   _assocs_seq_fops);
-   if (!p)
-   return -ENOMEM;
-
-   return 0;
-}
-
-/* Cleanup the proc fs entry for 'assocs' object. */
-void sctp_assocs_proc_exit(struct net *net)
-{
-   remove_proc_entry("assocs", net->sctp.proc_net_sctp);
-}
-
 static int sctp_remaddr_seq_show(struct seq_file *seq, void *v)
 {
struct sctp_association *assoc;
@@ -488,12 +431,6 @@ static const struct seq_operations sctp_remaddr_ops = {
.show  = sctp_remaddr_seq_show,
 };
 
-/* Cleanup the proc fs entry for 'remaddr' object. */
-void sctp_remaddr_proc_exit(struct net *net)
-{
-   remove_proc_entry("remaddr", net->sctp.proc_net_sctp);
-}
-
 static int sctp_remaddr_seq_open(struct inode *inode, struct file *file)
 {
return seq_open_net(inode, file, _remaddr_ops,
@@ -507,13 +444,28 @@ static const struct file_operations sctp_remaddr_seq_fops 
= {
.release = seq_release_net,
 };
 
-int __net_init sctp_remaddr_proc_init(struct net *net)
+/* Set up the proc fs entry for the SCTP protocol. */
+int __net_init sctp_proc_init(struct net *net)
 {
-   struct proc_dir_entry *p;
-
-   p = proc_create("remaddr", S_IRUGO, net->sctp.proc_net_sctp,
-   _remaddr_seq_fops);
-   if (!p)
+   net->sctp.proc_net_sctp = proc_net_mkdir(net, "sctp", net->proc_net);
+   if (!net->sctp.proc_net_sctp)
return -ENOMEM;
+   if (!proc_create("snmp", S_IRUGO, net->sctp.proc_net_sctp,
+   _snmp_seq_fops))
+   goto cleanup;
+   if (!proc_create("eps", S_IRUGO, net->sctp.proc_net_sctp,
+   _eps_seq_fops))
+   goto cleanup;
+   if (!proc_cr

  1   2   3   4   >