Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-10-01 Thread Denis V. Lunev
Patrick McHardy wrote:
 Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:


 Maybe I can save you some time: we used to do down_trylock()
 for the rtnl mutex, so senders would simply return if someone
 else was already processing the queue *or* the rtnl was locked
 for some other reason. In the first case the process already
 processing the queue would also process the new messages, but
 if it the rtnl was locked for some other reason (for example
 during module registration) the message would sit in the
 queue until the next rtnetlink sendmsg call, which is why
 rtnl_unlock does queue processing. Commit 6756ae4b changed
 the down_trylock to mutex_lock, so senders will now simply wait
 until the mutex is released and then call netlink_run_queue
 themselves. This means its not needed anymore.

 Sounds reasonable.

 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.

 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.
 
 
 That sounds like a bug. Did you place the WARN_ON before or after
 the mutex_unlock()?

The presence of the message in the queue during rtnl_unlock is quite
possible as normal user-kernel message processing path for rtnl is the
following:

netlink_sendmsg
   netlink_unicast
  netlink_sendskb
  skb_queue_tail
  netlink_data_ready
  rtnetlink_rcv
  mutex_lock(rtnl_mutex);
  netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
  mutex_unlock(rtnl_mutex);

so, the presence of the packet in the rtnl queue on rtnl_unlock is
normal race with a rtnetlink_rcv for me.

Regards,
Den
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-10-01 Thread Eric W. Biederman
Denis V. Lunev [EMAIL PROTECTED] writes:

 The presence of the message in the queue during rtnl_unlock is quite
 possible as normal user-kernel message processing path for rtnl is the
 following:

 netlink_sendmsg
netlink_unicast
   netlink_sendskb
   skb_queue_tail
   netlink_data_ready
   rtnetlink_rcv
   mutex_lock(rtnl_mutex);
   netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
   mutex_unlock(rtnl_mutex);

 so, the presence of the packet in the rtnl queue on rtnl_unlock is
 normal race with a rtnetlink_rcv for me.

Yes.  That is what I saw in practice as well.
Thanks for confirming this.

It happened to reproducible because I had a dhcp client asking
for a list of links in parallel with the actual link coming up
during boot.

Looking at netlink_unicast and netlink_broadcast I am generally
convinced that we can remove the call of sk_data_ready in
rtnl_unlock.   I think those are the only two possible paths
through there and I don't see how we could miss a processing a
packet on the way through there.

What would be nice is if we could figure out how to eliminate
this race.  As that would allow netlink packets to be processed
synchronously and we could actually use current for security
checks, and for getting the context of the calling process.

Right now we are 99% of the way there but because of the above
race the code must all be written as if netlink packets were coming
in completely asynchronously.  Which is unfortunate and a pain.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-30 Thread Denis V. Lunev
Hmm, so it looks like we do not need this queue processing at all...

Regards,
Den

Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 Maybe I can save you some time: we used to do down_trylock()
 for the rtnl mutex, so senders would simply return if someone
 else was already processing the queue *or* the rtnl was locked
 for some other reason. In the first case the process already
 processing the queue would also process the new messages, but
 if it the rtnl was locked for some other reason (for example
 during module registration) the message would sit in the
 queue until the next rtnetlink sendmsg call, which is why
 rtnl_unlock does queue processing. Commit 6756ae4b changed
 the down_trylock to mutex_lock, so senders will now simply wait
 until the mutex is released and then call netlink_run_queue
 themselves. This means its not needed anymore.
 
 Sounds reasonable.
 
 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.
 
 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.
 
 If we can remove the extra queue processing that would be great,
 as it looks like a nice way to simplify the locking and the special
 cases in the code.
 
 Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-30 Thread Patrick McHardy
Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 
Maybe I can save you some time: we used to do down_trylock()
for the rtnl mutex, so senders would simply return if someone
else was already processing the queue *or* the rtnl was locked
for some other reason. In the first case the process already
processing the queue would also process the new messages, but
if it the rtnl was locked for some other reason (for example
during module registration) the message would sit in the
queue until the next rtnetlink sendmsg call, which is why
rtnl_unlock does queue processing. Commit 6756ae4b changed
the down_trylock to mutex_lock, so senders will now simply wait
until the mutex is released and then call netlink_run_queue
themselves. This means its not needed anymore.
 
 
 Sounds reasonable.
 
 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.
 
 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.


That sounds like a bug. Did you place the WARN_ON before or after
the mutex_unlock()?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-29 Thread Patrick McHardy
Eric W. Biederman wrote:
  void rtnl_unlock(void)
  {
 - mutex_unlock(rtnl_mutex);
 - if (rtnl  rtnl-sk_receive_queue.qlen)
 + struct net *net;
 +
 + /*
 +  * Loop through all of the rtnl sockets until none of them (in
 +  * a live network namespace) have queue packets.
 +  *
 +  * We have to be careful with the locking here as
 +  * sk_data_ready aka rtnetlink_rcv takes the rtnl_mutex.
 +  *
 +  * To ensure the network namespace does not exit while
 +  * we are processing packets on it's rtnl socket we
 +  * grab a reference to the network namespace, ignoring
 +  * it if the network namespace has already exited.
 +  */
 +retry:
 + for_each_net(net) {
 + struct sock *rtnl = net-rtnl;
 +
 + if (!rtnl || !rtnl-sk_receive_queue.qlen)
 + continue;
 +
 + if (!maybe_get_net(net))
 + continue;
 +
 + mutex_unlock(rtnl_mutex);
   rtnl-sk_data_ready(rtnl, 0);
 + mutex_lock(rtnl_mutex);
 + put_net(net);
 + goto retry;
 + }
 + mutex_unlock(rtnl_mutex);
 +
   netdev_run_todo();
  }


I'm wondering why this receive queue processing on unlock is still
necessary today, we don't do trylock in rtnetlink_rcv anymore, so
all senders will simply wait until the lock is released and then
process the queue.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-29 Thread Eric W. Biederman
Patrick McHardy [EMAIL PROTECTED] writes:

 I'm wondering why this receive queue processing on unlock is still
 necessary today, we don't do trylock in rtnetlink_rcv anymore, so
 all senders will simply wait until the lock is released and then
 process the queue.

Good question, I should probably look.  I was lazy and didn't go back
and audit why we were doing this.  I just coded a routine that I was
certain would work.  It does appear that we are processing the queue
with sk_data_read when we add a message, so this may be completely
unnecessary.  I will go back and look.  If we can remove this bit
things should be simpler.

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-29 Thread Patrick McHardy
Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 
I'm wondering why this receive queue processing on unlock is still
necessary today, we don't do trylock in rtnetlink_rcv anymore, so
all senders will simply wait until the lock is released and then
process the queue.
 
 
 Good question, I should probably look.  I was lazy and didn't go back
 and audit why we were doing this.  I just coded a routine that I was
 certain would work.  It does appear that we are processing the queue
 with sk_data_read when we add a message, so this may be completely
 unnecessary.  I will go back and look.  If we can remove this bit
 things should be simpler.


Maybe I can save you some time: we used to do down_trylock()
for the rtnl mutex, so senders would simply return if someone
else was already processing the queue *or* the rtnl was locked
for some other reason. In the first case the process already
processing the queue would also process the new messages, but
if it the rtnl was locked for some other reason (for example
during module registration) the message would sit in the
queue until the next rtnetlink sendmsg call, which is why
rtnl_unlock does queue processing. Commit 6756ae4b changed
the down_trylock to mutex_lock, so senders will now simply wait
until the mutex is released and then call netlink_run_queue
themselves. This means its not needed anymore.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-29 Thread Eric W. Biederman
Patrick McHardy [EMAIL PROTECTED] writes:

 Maybe I can save you some time: we used to do down_trylock()
 for the rtnl mutex, so senders would simply return if someone
 else was already processing the queue *or* the rtnl was locked
 for some other reason. In the first case the process already
 processing the queue would also process the new messages, but
 if it the rtnl was locked for some other reason (for example
 during module registration) the message would sit in the
 queue until the next rtnetlink sendmsg call, which is why
 rtnl_unlock does queue processing. Commit 6756ae4b changed
 the down_trylock to mutex_lock, so senders will now simply wait
 until the mutex is released and then call netlink_run_queue
 themselves. This means its not needed anymore.

Sounds reasonable.

I started looking through the code paths and I currently cannot
see anything that would leave a message on a kernel rtnl socket.

However I did a quick test adding a WARN_ON if there were any messages
found in the queue during rtnl_unlock and I found this code path
getting invoked from linkwatch_event.  So there is clearly something I
don't understand, and it sounds at odds just a bit from your
description.

If we can remove the extra queue processing that would be great,
as it looks like a nice way to simplify the locking and the special
cases in the code.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-28 Thread Eric W. Biederman

After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/rtnetlink.h   |8 ++--
 include/net/net_namespace.h |3 +
 net/bridge/br_netlink.c |4 +-
 net/core/fib_rules.c|4 +-
 net/core/neighbour.c|4 +-
 net/core/rtnetlink.c|   96 +--
 net/decnet/dn_dev.c |4 +-
 net/decnet/dn_route.c   |2 +-
 net/decnet/dn_table.c   |4 +-
 net/ipv4/devinet.c  |4 +-
 net/ipv4/fib_semantics.c|4 +-
 net/ipv4/ipmr.c |4 +-
 net/ipv4/route.c|2 +-
 net/ipv6/addrconf.c |   14 +++---
 net/ipv6/route.c|6 +-
 net/sched/act_api.c |8 ++--
 net/sched/cls_api.c |2 +-
 net/sched/sch_api.c |4 +-
 net/wireless/wext.c |5 ++-
 19 files changed, 129 insertions(+), 53 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9c21e45..518247e 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -582,11 +582,11 @@ extern int __rtattr_parse_nested_compat(struct rtattr 
*tb[], int maxattr,
 ({ data = RTA_PAYLOAD(rta) = len ? RTA_DATA(rta) : NULL; \
__rtattr_parse_nested_compat(tb, max, rta, len); })
 
-extern int rtnetlink_send(struct sk_buff *skb, u32 pid, u32 group, int echo);
-extern int rtnl_unicast(struct sk_buff *skb, u32 pid);
-extern int rtnl_notify(struct sk_buff *skb, u32 pid, u32 group,
+extern int rtnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, u32 
group, int echo);
+extern int rtnl_unicast(struct sk_buff *skb, struct net *net, u32 pid);
+extern int rtnl_notify(struct sk_buff *skb, struct net *net, u32 pid, u32 
group,
   struct nlmsghdr *nlh, gfp_t flags);
-extern void rtnl_set_sk_err(u32 group, int error);
+extern void rtnl_set_sk_err(struct net *net, u32 group, int error);
 extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics);
 extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst,
  u32 id, u32 ts, u32 tsage, long expires,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 934c840..f75607a 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -10,6 +10,7 @@
 
 struct proc_dir_entry;
 struct net_device;
+struct sock;
 struct net {
atomic_tcount;  /* To decided when the network
 *  namespace should be freed.
@@ -29,6 +30,8 @@ struct net {
struct list_headdev_base_head;
struct hlist_head   *dev_name_head;
struct hlist_head   *dev_index_head;
+
+   struct sock *rtnl;  /* rtnetlink socket */
 };
 
 #ifdef CONFIG_NET
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index a4ffa2b..f5d6933 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -97,10 +97,10 @@ void br_ifinfo_notify(int event, struct net_bridge_port 
*port)
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
+   err = rtnl_notify(skb, init_net,0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
 errout:
if (err  0)
-   rtnl_set_sk_err(RTNLGRP_LINK, err);
+   rtnl_set_sk_err(init_net, RTNLGRP_LINK, err);
 }
 
 /*
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 357bfa0..03c803c 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -577,10 +577,10 @@ static void notify_rule_change(int event, struct fib_rule 
*rule,
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, pid, ops-nlgroup, nlh, GFP_KERNEL);
+   err = rtnl_notify(skb, init_net, pid, ops-nlgroup, nlh, GFP_KERNEL);
 errout:
if (err  0)
-   rtnl_set_sk_err(ops-nlgroup, err);
+   rtnl_set_sk_err(init_net, ops-nlgroup, err);
 }
 
 static void attach_rules(struct list_head *rules, struct net_device *dev)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 27001db..c452584 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2466,10 +2466,10 @@ static void __neigh_notify(struct neighbour *n, int 
type, int flags)
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
+   err = rtnl_notify(skb, init_net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
 errout:
if (err  0)
-   rtnl_set_sk_err(RTNLGRP_NEIGH, err);
+   rtnl_set_sk_err(init_net, RTNLGRP_NEIGH, err);
 }
 
 #ifdef CONFIG_ARPD
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 56fc4f9..fc49104 100644
---