Hi Hoang, Just some improvements to (my own) log message text below. Then you can go ahead and add "acked-by" from me.
///jon > -----Original Message----- > From: Hoang Le <hoang.h...@dektech.com.au> > Sent: 21-Oct-19 00:16 > To: tipc-discussion@lists.sourceforge.net; Jon Maloy > <jon.ma...@ericsson.com>; ma...@donjonn.com; ying....@windriver.com; > l...@redhat.com > Subject: [net-next v2] tipc: improve throughput between nodes in netns > > Currently, TIPC transports intra-node user data messages directly socket to > socket, hence shortcutting all the lower layers of the communication stack. > This gives TIPC very good intra node performance, both regarding throughput > and latency. > > We now introduce a similar mechanism for TIPC data traffic across network > name spaces located in the same kernel. On the send path, the call chain is as > always accompanied by the sending node's network name space pointer. > However, once we have reliably established that the receiving node is > represented by a name space on the same host, we just replace the name > space pointer with the receiving node/name space's ditto, and follow the > regular socket receive patch though the receiving node. This technique gives > us a throughput similar to the node internal throughput, several times larger > than if we let the traffic go though the full network stack. As a comparison, > max throughput for 64k messages is four times larger than TCP throughput for > the same type of traffic in a similar environment. > > To meet any security concerns, the following should be noted. > > - All nodes joining a cluster are supposed to have been be certified and > authenticated by mechanisms outside TIPC. This is no different for > nodes/name spaces on the same host; they have to auto discover each other > using the attached interfaces, and establish links which are supervised via > the > regular link monitoring mechanism. Hence, a kernel local node has no other > way to join a cluster than any other node, and have to obey to policies set in > the IP or device layers of the stack. > > - Only when a sender has established with 100% certainty that the peer node > is located in a kernel local name space does it choose to let user data > messages, > and only those, take the crossover path to the receiving node/name space. > > - If the receiving node/name space is removed, its name space pointer is > invalidated at all peer nodes, and their neighbor link monitoring will > eventually > note that this node is gone. > > - To ensure the "100% certainty" criteria, and prevent any possible spoofing, > received discovery messages must contain a proof that s/they know a common secret./the sender knows a common secret./g > We use the hash_mix of the sending node/name space for this > purpose, since it can be accessed directly by all other name spaces in the > kernel. Upon reception of a discovery message, the receiver checks this proof > against all the local name spaces' > hash_mix:es. If it finds a match, that, along with a matching node id and > cluster id, this is deemed sufficient proof that the peer node in question is > in a > local name space, and a wormhole can be opened. > > - We should also consider that TIPC is intended to be a cluster local IPC > mechanism (just like e.g. UNIX sockets) rather than a network protocol, and > hence s/should be given more freedom to shortcut the lower protocol than other protocols/ we think it can justified to allow it to shortcut the lower protocol layers./g > > Regarding traceability, we should notice that since commit 6c9081a3915d > ("tipc: add loopback device tracking") it is possible to follow the node > internal > packet flow by just activating tcpdump on the loopback interface. This will be > true even for this mechanism; by activating tcpdump on the invloved nodes' > loopback interfaces their inter-name space messaging can easily be tracked. > > Suggested-by: Jon Maloy <jon.ma...@ericsson.com> > Signed-off-by: Hoang Le <hoang.h...@dektech.com.au> > --- > net/tipc/discover.c | 10 ++++- > net/tipc/msg.h | 10 +++++ > net/tipc/name_distr.c | 2 +- > net/tipc/node.c | 100 > ++++++++++++++++++++++++++++++++++++++++-- > net/tipc/node.h | 4 +- > net/tipc/socket.c | 6 +-- > 6 files changed, 121 insertions(+), 11 deletions(-) > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > c138d68e8a69..338d402fcf39 100644 > --- a/net/tipc/discover.c > +++ b/net/tipc/discover.c > @@ -38,6 +38,8 @@ > #include "node.h" > #include "discover.h" > > +#include <net/netns/hash.h> > + > /* min delay during bearer start up */ > #define TIPC_DISC_INIT msecs_to_jiffies(125) > /* max delay if bearer has no links */ > @@ -83,6 +85,7 @@ static void tipc_disc_init_msg(struct net *net, struct > sk_buff *skb, > struct tipc_net *tn = tipc_net(net); > u32 dest_domain = b->domain; > struct tipc_msg *hdr; > + u32 hash; > > hdr = buf_msg(skb); > tipc_msg_init(tn->trial_addr, hdr, LINK_CONFIG, mtyp, @@ -94,6 > +97,10 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > msg_set_dest_domain(hdr, dest_domain); > msg_set_bc_netid(hdr, tn->net_id); > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > + hash = tn->random; > + hash ^= net_hash_mix(&init_net); > + hash ^= net_hash_mix(net); > + msg_set_peer_net_hash(hdr, hash); > msg_set_node_id(hdr, tipc_own_id(net)); } > > @@ -242,7 +249,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > if (!tipc_in_scope(legacy, b->domain, src)) > return; > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > - &maddr, &respond, &dupl_addr); > + msg_peer_net_hash(hdr), &maddr, &respond, > + &dupl_addr); > if (dupl_addr) > disc_dupl_alert(b, src, &maddr); > if (!respond) > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > 0daa6f04ca81..a8d0f28094f2 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > tipc_msg *m, u16 n) > msg_set_bits(m, 9, 16, 0xffff, n); > } > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > + msg_set_word(m, 9, n); > +} > + > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > + return msg_word(m, 9); > +} > + > /* Word 10 > */ > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > 836e629e8f4a..5feaf3b67380 100644 > --- a/net/tipc/name_distr.c > +++ b/net/tipc/name_distr.c > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > sk_buff_head *list, > struct publication *publ; > struct sk_buff *skb = NULL; > struct distr_item *item = NULL; > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > +/ > ITEM_SIZE) * ITEM_SIZE; > u32 msg_rem = msg_dsz; > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > c8f6177dd5a2..780b726041dd 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -45,6 +45,8 @@ > #include "netlink.h" > #include "trace.h" > > +#include <net/netns/hash.h> > + > #define INVALID_NODE_SIG 0x10000 > #define NODE_CLEANUP_AFTER 300000 > > @@ -126,6 +128,7 @@ struct tipc_node { > struct timer_list timer; > struct rcu_head rcu; > unsigned long delete_at; > + struct net *pnet; > }; > > /* Node FSM states and events: > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > tipc_node *n, int sel) > return n->links[bearer_id].link; > } > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected) > { > struct tipc_node *n; > int bearer_id; > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > u32 sel) > if (unlikely(!n)) > return mtu; > > + /* Allow MAX_MSG_SIZE when building connection oriented message > + * if they are in the same core network > + */ > + if (n->pnet && connected) { > + tipc_node_put(n); > + return mtu; > + } > + > bearer_id = n->active_links[sel & 1]; > if (likely(bearer_id != INVALID_BEARER_ID)) > mtu = n->links[bearer_id].mtu; > @@ -361,12 +372,16 @@ static void tipc_node_write_unlock(struct > tipc_node *n) } > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > - u8 *peer_id, u16 capabilities) > + u8 *peer_id, u16 capabilities, > + u32 signature, u32 hash_mixes) > { > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct tipc_node *n, *temp_node; > + struct tipc_net *tn_peer; > struct tipc_link *l; > + struct net *tmp; > int bearer_id; > + u32 hash_chk; > int i; > > spin_lock_bh(&tn->node_list_lock); > @@ -400,6 +415,25 @@ static struct tipc_node *tipc_node_create(struct net > *net, u32 addr, > memcpy(&n->peer_id, peer_id, 16); > n->net = net; > n->capabilities = capabilities; > + n->pnet = NULL; > + for_each_net_rcu(tmp) { > + tn_peer = net_generic(tmp, tipc_net_id); > + if (!tn_peer) > + continue; > + /* Integrity checking whether node exists in namespace or not */ > + if (tn_peer->net_id != tn->net_id) > + continue; > + if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN)) > + continue; > + > + hash_chk = tn_peer->random; > + hash_chk ^= net_hash_mix(&init_net); > + hash_chk ^= net_hash_mix(tmp); > + if (hash_chk ^ hash_mixes) > + continue; > + n->pnet = tmp; > + break; > + } > kref_init(&n->kref); > rwlock_init(&n->lock); > INIT_HLIST_NODE(&n->hash); > @@ -979,7 +1013,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > u32 addr) > > void tipc_node_check_dest(struct net *net, u32 addr, > u8 *peer_id, struct tipc_bearer *b, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr) > { > @@ -998,7 +1032,8 @@ void tipc_node_check_dest(struct net *net, u32 > addr, > *dupl_addr = false; > *respond = false; > > - n = tipc_node_create(net, addr, peer_id, capabilities); > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > + hash_mixes); > if (!n) > return; > > @@ -1424,6 +1459,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg > *msg, struct tipc_node *node) > return -EMSGSIZE; > } > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > +{ > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > + struct sk_buff_head inputq; > + > + switch (msg_user(hdr)) { > + case TIPC_LOW_IMPORTANCE: > + case TIPC_MEDIUM_IMPORTANCE: > + case TIPC_HIGH_IMPORTANCE: > + case TIPC_CRITICAL_IMPORTANCE: > + if (msg_connected(hdr) || msg_named(hdr)) { > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + } > + if (msg_mcast(hdr)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + return; > + } > + return; > + case MSG_FRAGMENTER: > + if (tipc_msg_assemble(list)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + } > + return; > + case GROUP_PROTOCOL: > + case CONN_MANAGER: > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + case LINK_PROTOCOL: > + case NAME_DISTRIBUTOR: > + case TUNNEL_PROTOCOL: > + case BCAST_PROTOCOL: > + return; > + default: > + return; > + }; > +} > + > /** > * tipc_node_xmit() is the general link level function for message sending > * @net: the applicable net namespace > @@ -1439,6 +1520,7 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > struct tipc_link_entry *le = NULL; > struct tipc_node *n; > struct sk_buff_head xmitq; > + bool node_up = false; > int bearer_id; > int rc; > > @@ -1455,6 +1537,16 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > return -EHOSTUNREACH; > } > > + node_up = node_is_up(n); > + if (node_up && n->pnet && check_net(n->pnet)) { > + /* xmit inner linux container */ > + tipc_lxc_xmit(n->pnet, list); > + if (likely(skb_queue_empty(list))) { > + tipc_node_put(n); > + return 0; > + } > + } > + > tipc_node_read_lock(n); > bearer_id = n->active_links[selector & 1]; > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..2557d40fd417 > 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > struct tipc_bearer *bearer, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr); > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected); > bool tipc_node_is_up(struct net *net, u32 addr); > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff > --git > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > struct tipc_sock *tsk, > > /* Build message as chain of buffers */ > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > struct msghdr *m, size_t dlen) > return rc; > > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > *tsk, u32 peer_port, > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > -- > 2.20.1 _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion