Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Patrick McHardy wrote: Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. That sounds like a bug. Did you place the WARN_ON before or after the mutex_unlock()? The presence of the message in the queue during rtnl_unlock is quite possible as normal user-kernel message processing path for rtnl is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(rtnl_mutex); netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); mutex_unlock(rtnl_mutex); so, the presence of the packet in the rtnl queue on rtnl_unlock is normal race with a rtnetlink_rcv for me. Regards, Den - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Denis V. Lunev [EMAIL PROTECTED] writes: The presence of the message in the queue during rtnl_unlock is quite possible as normal user-kernel message processing path for rtnl is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(rtnl_mutex); netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); mutex_unlock(rtnl_mutex); so, the presence of the packet in the rtnl queue on rtnl_unlock is normal race with a rtnetlink_rcv for me. Yes. That is what I saw in practice as well. Thanks for confirming this. It happened to reproducible because I had a dhcp client asking for a list of links in parallel with the actual link coming up during boot. Looking at netlink_unicast and netlink_broadcast I am generally convinced that we can remove the call of sk_data_ready in rtnl_unlock. I think those are the only two possible paths through there and I don't see how we could miss a processing a packet on the way through there. What would be nice is if we could figure out how to eliminate this race. As that would allow netlink packets to be processed synchronously and we could actually use current for security checks, and for getting the context of the calling process. Right now we are 99% of the way there but because of the above race the code must all be written as if netlink packets were coming in completely asynchronously. Which is unfortunate and a pain. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Hmm, so it looks like we do not need this queue processing at all... Regards, Den Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. If we can remove the extra queue processing that would be great, as it looks like a nice way to simplify the locking and the special cases in the code. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. That sounds like a bug. Did you place the WARN_ON before or after the mutex_unlock()? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Eric W. Biederman wrote: void rtnl_unlock(void) { - mutex_unlock(rtnl_mutex); - if (rtnl rtnl-sk_receive_queue.qlen) + struct net *net; + + /* + * Loop through all of the rtnl sockets until none of them (in + * a live network namespace) have queue packets. + * + * We have to be careful with the locking here as + * sk_data_ready aka rtnetlink_rcv takes the rtnl_mutex. + * + * To ensure the network namespace does not exit while + * we are processing packets on it's rtnl socket we + * grab a reference to the network namespace, ignoring + * it if the network namespace has already exited. + */ +retry: + for_each_net(net) { + struct sock *rtnl = net-rtnl; + + if (!rtnl || !rtnl-sk_receive_queue.qlen) + continue; + + if (!maybe_get_net(net)) + continue; + + mutex_unlock(rtnl_mutex); rtnl-sk_data_ready(rtnl, 0); + mutex_lock(rtnl_mutex); + put_net(net); + goto retry; + } + mutex_unlock(rtnl_mutex); + netdev_run_todo(); } I'm wondering why this receive queue processing on unlock is still necessary today, we don't do trylock in rtnetlink_rcv anymore, so all senders will simply wait until the lock is released and then process the queue. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Patrick McHardy [EMAIL PROTECTED] writes: I'm wondering why this receive queue processing on unlock is still necessary today, we don't do trylock in rtnetlink_rcv anymore, so all senders will simply wait until the lock is released and then process the queue. Good question, I should probably look. I was lazy and didn't go back and audit why we were doing this. I just coded a routine that I was certain would work. It does appear that we are processing the queue with sk_data_read when we add a message, so this may be completely unnecessary. I will go back and look. If we can remove this bit things should be simpler. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: I'm wondering why this receive queue processing on unlock is still necessary today, we don't do trylock in rtnetlink_rcv anymore, so all senders will simply wait until the lock is released and then process the queue. Good question, I should probably look. I was lazy and didn't go back and audit why we were doing this. I just coded a routine that I was certain would work. It does appear that we are processing the queue with sk_data_read when we add a message, so this may be completely unnecessary. I will go back and look. If we can remove this bit things should be simpler. Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. If we can remove the extra queue processing that would be great, as it looks like a nice way to simplify the locking and the special cases in the code. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
After this patch none of the netlink callback support anything except the initial network namespace but the rtnetlink infrastructure now handles multiple network namespaces. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/rtnetlink.h |8 ++-- include/net/net_namespace.h |3 + net/bridge/br_netlink.c |4 +- net/core/fib_rules.c|4 +- net/core/neighbour.c|4 +- net/core/rtnetlink.c| 96 +-- net/decnet/dn_dev.c |4 +- net/decnet/dn_route.c |2 +- net/decnet/dn_table.c |4 +- net/ipv4/devinet.c |4 +- net/ipv4/fib_semantics.c|4 +- net/ipv4/ipmr.c |4 +- net/ipv4/route.c|2 +- net/ipv6/addrconf.c | 14 +++--- net/ipv6/route.c|6 +- net/sched/act_api.c |8 ++-- net/sched/cls_api.c |2 +- net/sched/sch_api.c |4 +- net/wireless/wext.c |5 ++- 19 files changed, 129 insertions(+), 53 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 9c21e45..518247e 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -582,11 +582,11 @@ extern int __rtattr_parse_nested_compat(struct rtattr *tb[], int maxattr, ({ data = RTA_PAYLOAD(rta) = len ? RTA_DATA(rta) : NULL; \ __rtattr_parse_nested_compat(tb, max, rta, len); }) -extern int rtnetlink_send(struct sk_buff *skb, u32 pid, u32 group, int echo); -extern int rtnl_unicast(struct sk_buff *skb, u32 pid); -extern int rtnl_notify(struct sk_buff *skb, u32 pid, u32 group, +extern int rtnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, u32 group, int echo); +extern int rtnl_unicast(struct sk_buff *skb, struct net *net, u32 pid); +extern int rtnl_notify(struct sk_buff *skb, struct net *net, u32 pid, u32 group, struct nlmsghdr *nlh, gfp_t flags); -extern void rtnl_set_sk_err(u32 group, int error); +extern void rtnl_set_sk_err(struct net *net, u32 group, int error); extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics); extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst, u32 id, u32 ts, u32 tsage, long expires, diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 934c840..f75607a 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -10,6 +10,7 @@ struct proc_dir_entry; struct net_device; +struct sock; struct net { atomic_tcount; /* To decided when the network * namespace should be freed. @@ -29,6 +30,8 @@ struct net { struct list_headdev_base_head; struct hlist_head *dev_name_head; struct hlist_head *dev_index_head; + + struct sock *rtnl; /* rtnetlink socket */ }; #ifdef CONFIG_NET diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index a4ffa2b..f5d6933 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -97,10 +97,10 @@ void br_ifinfo_notify(int event, struct net_bridge_port *port) kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC); + err = rtnl_notify(skb, init_net,0, RTNLGRP_LINK, NULL, GFP_ATOMIC); errout: if (err 0) - rtnl_set_sk_err(RTNLGRP_LINK, err); + rtnl_set_sk_err(init_net, RTNLGRP_LINK, err); } /* diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 357bfa0..03c803c 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -577,10 +577,10 @@ static void notify_rule_change(int event, struct fib_rule *rule, kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, pid, ops-nlgroup, nlh, GFP_KERNEL); + err = rtnl_notify(skb, init_net, pid, ops-nlgroup, nlh, GFP_KERNEL); errout: if (err 0) - rtnl_set_sk_err(ops-nlgroup, err); + rtnl_set_sk_err(init_net, ops-nlgroup, err); } static void attach_rules(struct list_head *rules, struct net_device *dev) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 27001db..c452584 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -2466,10 +2466,10 @@ static void __neigh_notify(struct neighbour *n, int type, int flags) kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC); + err = rtnl_notify(skb, init_net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC); errout: if (err 0) - rtnl_set_sk_err(RTNLGRP_NEIGH, err); + rtnl_set_sk_err(init_net, RTNLGRP_NEIGH, err); } #ifdef CONFIG_ARPD diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 56fc4f9..fc49104 100644 ---