Hi Hoang,
Just some improvements to (my own) log message text below.  Then you can go 
ahead and add "acked-by" from me.

///jon


> -----Original Message-----
> From: Hoang Le <hoang.h...@dektech.com.au>
> Sent: 21-Oct-19 00:16
> To: tipc-discussion@lists.sourceforge.net; Jon Maloy
> <jon.ma...@ericsson.com>; ma...@donjonn.com; ying....@windriver.com;
> l...@redhat.com
> Subject: [net-next v2] tipc: improve throughput between nodes in netns
> 
> Currently, TIPC transports intra-node user data messages directly socket to
> socket, hence shortcutting all the lower layers of the communication stack.
> This gives TIPC very good intra node performance, both regarding throughput
> and latency.
> 
> We now introduce a similar mechanism for TIPC data traffic across network
> name spaces located in the same kernel. On the send path, the call chain is as
> always accompanied by the sending node's network name space pointer.
> However, once we have reliably established that the receiving node is
> represented by a name space on the same host, we just replace the name
> space pointer with the receiving node/name space's ditto, and follow the
> regular socket receive patch though the receiving node. This technique gives
> us a throughput similar to the node internal throughput, several times larger
> than if we let the traffic go though the full network stack. As a comparison,
> max throughput for 64k messages is four times larger than TCP throughput for
> the same type of traffic in a similar environment.
> 
> To meet any security concerns, the following should be noted.
> 
> - All nodes joining a cluster are supposed to have been be certified and
> authenticated by mechanisms outside TIPC. This is no different for
> nodes/name spaces on the same host; they have to auto discover each other
> using the attached interfaces, and establish links which are supervised via 
> the
> regular link monitoring mechanism. Hence, a kernel local node has no other
> way to join a cluster than any other node, and have to obey to policies set in
> the IP or device layers of the stack.
> 
> - Only when a sender has established with 100% certainty that the peer node
> is located in a kernel local name space does it choose to let user data 
> messages,
> and only those, take the crossover path to the receiving node/name space.
> 
> - If the receiving node/name space  is removed, its name space pointer is
> invalidated at all peer nodes, and their neighbor link monitoring will 
> eventually
> note that this node is gone.
> 
> - To ensure the "100% certainty" criteria, and prevent any possible spoofing,
> received discovery messages must contain a proof that 

s/they know a common secret./the sender knows a common secret./g

> We use the hash_mix of the sending node/name space for this
> purpose, since it can be accessed directly by all other name spaces in the
> kernel. Upon reception of a discovery message, the receiver checks this proof
> against all the local name spaces'
> hash_mix:es.  If it finds a match, that, along with a matching node id and
> cluster id, this is deemed sufficient proof that the peer node in question is 
> in a
> local name space, and a wormhole can be opened.
> 
> - We should also consider that TIPC is intended to be a cluster local IPC
> mechanism (just like e.g. UNIX sockets)  rather than a network protocol, and
> hence 

s/should be given more freedom to shortcut the lower protocol than other 
protocols/ 
   we think it can justified to allow it to shortcut the lower protocol 
layers./g
> 
> Regarding traceability, we should notice that since commit 6c9081a3915d
> ("tipc: add loopback device tracking") it is possible to follow the node 
> internal
> packet flow by just activating tcpdump on the loopback interface. This will be
> true even for this mechanism; by activating tcpdump on the invloved nodes'
> loopback interfaces their inter-name space messaging can easily be tracked.
> 
> Suggested-by: Jon Maloy <jon.ma...@ericsson.com>
> Signed-off-by: Hoang Le <hoang.h...@dektech.com.au>
> ---
>  net/tipc/discover.c   |  10 ++++-
>  net/tipc/msg.h        |  10 +++++
>  net/tipc/name_distr.c |   2 +-
>  net/tipc/node.c       | 100
> ++++++++++++++++++++++++++++++++++++++++--
>  net/tipc/node.h       |   4 +-
>  net/tipc/socket.c     |   6 +--
>  6 files changed, 121 insertions(+), 11 deletions(-)
> 
> diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> c138d68e8a69..338d402fcf39 100644
> --- a/net/tipc/discover.c
> +++ b/net/tipc/discover.c
> @@ -38,6 +38,8 @@
>  #include "node.h"
>  #include "discover.h"
> 
> +#include <net/netns/hash.h>
> +
>  /* min delay during bearer start up */
>  #define TIPC_DISC_INIT       msecs_to_jiffies(125)
>  /* max delay if bearer has no links */
> @@ -83,6 +85,7 @@ static void tipc_disc_init_msg(struct net *net, struct
> sk_buff *skb,
>       struct tipc_net *tn = tipc_net(net);
>       u32 dest_domain = b->domain;
>       struct tipc_msg *hdr;
> +     u32 hash;
> 
>       hdr = buf_msg(skb);
>       tipc_msg_init(tn->trial_addr, hdr, LINK_CONFIG, mtyp, @@ -94,6
> +97,10 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb,
>       msg_set_dest_domain(hdr, dest_domain);
>       msg_set_bc_netid(hdr, tn->net_id);
>       b->media->addr2msg(msg_media_addr(hdr), &b->addr);
> +     hash = tn->random;
> +     hash ^= net_hash_mix(&init_net);
> +     hash ^= net_hash_mix(net);
> +     msg_set_peer_net_hash(hdr, hash);
>       msg_set_node_id(hdr, tipc_own_id(net));  }
> 
> @@ -242,7 +249,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff
> *skb,
>       if (!tipc_in_scope(legacy, b->domain, src))
>               return;
>       tipc_node_check_dest(net, src, peer_id, b, caps, signature,
> -                          &maddr, &respond, &dupl_addr);
> +                          msg_peer_net_hash(hdr), &maddr, &respond,
> +                          &dupl_addr);
>       if (dupl_addr)
>               disc_dupl_alert(b, src, &maddr);
>       if (!respond)
> diff --git a/net/tipc/msg.h b/net/tipc/msg.h index
> 0daa6f04ca81..a8d0f28094f2 100644
> --- a/net/tipc/msg.h
> +++ b/net/tipc/msg.h
> @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct
> tipc_msg *m, u16 n)
>       msg_set_bits(m, 9, 16, 0xffff, n);
>  }
> 
> +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) {
> +     msg_set_word(m, 9, n);
> +}
> +
> +static inline u32 msg_peer_net_hash(struct tipc_msg *m) {
> +     return msg_word(m, 9);
> +}
> +
>  /* Word 10
>   */
>  static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git
> a/net/tipc/name_distr.c b/net/tipc/name_distr.c index
> 836e629e8f4a..5feaf3b67380 100644
> --- a/net/tipc/name_distr.c
> +++ b/net/tipc/name_distr.c
> @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct
> sk_buff_head *list,
>       struct publication *publ;
>       struct sk_buff *skb = NULL;
>       struct distr_item *item = NULL;
> -     u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) /
> +     u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE)
> +/
>                       ITEM_SIZE) * ITEM_SIZE;
>       u32 msg_rem = msg_dsz;
> 
> diff --git a/net/tipc/node.c b/net/tipc/node.c index
> c8f6177dd5a2..780b726041dd 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -45,6 +45,8 @@
>  #include "netlink.h"
>  #include "trace.h"
> 
> +#include <net/netns/hash.h>
> +
>  #define INVALID_NODE_SIG     0x10000
>  #define NODE_CLEANUP_AFTER   300000
> 
> @@ -126,6 +128,7 @@ struct tipc_node {
>       struct timer_list timer;
>       struct rcu_head rcu;
>       unsigned long delete_at;
> +     struct net *pnet;
>  };
> 
>  /* Node FSM states and events:
> @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct
> tipc_node *n, int sel)
>       return n->links[bearer_id].link;
>  }
> 
> -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel)
> +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool
> +connected)
>  {
>       struct tipc_node *n;
>       int bearer_id;
> @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr,
> u32 sel)
>       if (unlikely(!n))
>               return mtu;
> 
> +     /* Allow MAX_MSG_SIZE when building connection oriented message
> +      * if they are in the same core network
> +      */
> +     if (n->pnet && connected) {
> +             tipc_node_put(n);
> +             return mtu;
> +     }
> +
>       bearer_id = n->active_links[sel & 1];
>       if (likely(bearer_id != INVALID_BEARER_ID))
>               mtu = n->links[bearer_id].mtu;
> @@ -361,12 +372,16 @@ static void tipc_node_write_unlock(struct
> tipc_node *n)  }
> 
>  static struct tipc_node *tipc_node_create(struct net *net, u32 addr,
> -                                       u8 *peer_id, u16 capabilities)
> +                                       u8 *peer_id, u16 capabilities,
> +                                       u32 signature, u32 hash_mixes)
>  {
>       struct tipc_net *tn = net_generic(net, tipc_net_id);
>       struct tipc_node *n, *temp_node;
> +     struct tipc_net *tn_peer;
>       struct tipc_link *l;
> +     struct net *tmp;
>       int bearer_id;
> +     u32 hash_chk;
>       int i;
> 
>       spin_lock_bh(&tn->node_list_lock);
> @@ -400,6 +415,25 @@ static struct tipc_node *tipc_node_create(struct net
> *net, u32 addr,
>       memcpy(&n->peer_id, peer_id, 16);
>       n->net = net;
>       n->capabilities = capabilities;
> +     n->pnet = NULL;
> +     for_each_net_rcu(tmp) {
> +             tn_peer = net_generic(tmp, tipc_net_id);
> +             if (!tn_peer)
> +                     continue;
> +             /* Integrity checking whether node exists in namespace or not */
> +             if (tn_peer->net_id != tn->net_id)
> +                     continue;
> +             if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN))
> +                     continue;
> +
> +             hash_chk = tn_peer->random;
> +             hash_chk ^= net_hash_mix(&init_net);
> +             hash_chk ^= net_hash_mix(tmp);
> +             if (hash_chk ^ hash_mixes)
> +                     continue;
> +             n->pnet = tmp;
> +             break;
> +     }
>       kref_init(&n->kref);
>       rwlock_init(&n->lock);
>       INIT_HLIST_NODE(&n->hash);
> @@ -979,7 +1013,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id,
> u32 addr)
> 
>  void tipc_node_check_dest(struct net *net, u32 addr,
>                         u8 *peer_id, struct tipc_bearer *b,
> -                       u16 capabilities, u32 signature,
> +                       u16 capabilities, u32 signature, u32 hash_mixes,
>                         struct tipc_media_addr *maddr,
>                         bool *respond, bool *dupl_addr)
>  {
> @@ -998,7 +1032,8 @@ void tipc_node_check_dest(struct net *net, u32
> addr,
>       *dupl_addr = false;
>       *respond = false;
> 
> -     n = tipc_node_create(net, addr, peer_id, capabilities);
> +     n = tipc_node_create(net, addr, peer_id, capabilities, signature,
> +                          hash_mixes);
>       if (!n)
>               return;
> 
> @@ -1424,6 +1459,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg
> *msg, struct tipc_node *node)
>       return -EMSGSIZE;
>  }
> 
> +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list)
> +{
> +     struct tipc_msg *hdr = buf_msg(skb_peek(list));
> +     struct sk_buff_head inputq;
> +
> +     switch (msg_user(hdr)) {
> +     case TIPC_LOW_IMPORTANCE:
> +     case TIPC_MEDIUM_IMPORTANCE:
> +     case TIPC_HIGH_IMPORTANCE:
> +     case TIPC_CRITICAL_IMPORTANCE:
> +             if (msg_connected(hdr) || msg_named(hdr)) {
> +                     spin_lock_init(&list->lock);
> +                     tipc_sk_rcv(pnet, list);
> +                     return;
> +             }
> +             if (msg_mcast(hdr)) {
> +                     skb_queue_head_init(&inputq);
> +                     tipc_sk_mcast_rcv(pnet, list, &inputq);
> +                     __skb_queue_purge(list);
> +                     skb_queue_purge(&inputq);
> +                     return;
> +             }
> +             return;
> +     case MSG_FRAGMENTER:
> +             if (tipc_msg_assemble(list)) {
> +                     skb_queue_head_init(&inputq);
> +                     tipc_sk_mcast_rcv(pnet, list, &inputq);
> +                     __skb_queue_purge(list);
> +                     skb_queue_purge(&inputq);
> +             }
> +             return;
> +     case GROUP_PROTOCOL:
> +     case CONN_MANAGER:
> +             spin_lock_init(&list->lock);
> +             tipc_sk_rcv(pnet, list);
> +             return;
> +     case LINK_PROTOCOL:
> +     case NAME_DISTRIBUTOR:
> +     case TUNNEL_PROTOCOL:
> +     case BCAST_PROTOCOL:
> +             return;
> +     default:
> +             return;
> +     };
> +}
> +
>  /**
>   * tipc_node_xmit() is the general link level function for message sending
>   * @net: the applicable net namespace
> @@ -1439,6 +1520,7 @@ int tipc_node_xmit(struct net *net, struct
> sk_buff_head *list,
>       struct tipc_link_entry *le = NULL;
>       struct tipc_node *n;
>       struct sk_buff_head xmitq;
> +     bool node_up = false;
>       int bearer_id;
>       int rc;
> 
> @@ -1455,6 +1537,16 @@ int tipc_node_xmit(struct net *net, struct
> sk_buff_head *list,
>               return -EHOSTUNREACH;
>       }
> 
> +     node_up = node_is_up(n);
> +     if (node_up && n->pnet && check_net(n->pnet)) {
> +             /* xmit inner linux container */
> +             tipc_lxc_xmit(n->pnet, list);
> +             if (likely(skb_queue_empty(list))) {
> +                     tipc_node_put(n);
> +                     return 0;
> +             }
> +     }
> +
>       tipc_node_read_lock(n);
>       bearer_id = n->active_links[selector & 1];
>       if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git
> a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..2557d40fd417
> 100644
> --- a/net/tipc/node.h
> +++ b/net/tipc/node.h
> @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node);
>  u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr);  void
> tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128,
>                         struct tipc_bearer *bearer,
> -                       u16 capabilities, u32 signature,
> +                       u16 capabilities, u32 signature, u32 hash_mixes,
>                         struct tipc_media_addr *maddr,
>                         bool *respond, bool *dupl_addr);
>  void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7
> @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr,
> u32 addr);  void tipc_node_broadcast(struct net *net, struct sk_buff *skb);
> int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32
> peer_port);  void tipc_node_remove_conn(struct net *net, u32 dnode, u32
> port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel);
> +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool
> +connected);
>  bool tipc_node_is_up(struct net *net, u32 addr);
>  u16 tipc_node_get_capabilities(struct net *net, u32 addr);  int
> tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff 
> --git
> a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c
> 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net,
> struct tipc_sock *tsk,
> 
>       /* Build message as chain of buffers */
>       __skb_queue_head_init(&pkts);
> -     mtu = tipc_node_get_mtu(net, dnode, tsk->portid);
> +     mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false);
>       rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
>       if (unlikely(rc != dlen))
>               return rc;
> @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock,
> struct msghdr *m, size_t dlen)
>               return rc;
> 
>       __skb_queue_head_init(&pkts);
> -     mtu = tipc_node_get_mtu(net, dnode, tsk->portid);
> +     mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false);
>       rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
>       if (unlikely(rc != dlen))
>               return rc;
> @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock
> *tsk, u32 peer_port,
>       sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV);
>       tipc_set_sk_state(sk, TIPC_ESTABLISHED);
>       tipc_node_add_conn(net, peer_node, tsk->portid, peer_port);
> -     tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid);
> +     tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true);
>       tsk->peer_caps = tipc_node_get_capabilities(net, peer_node);
>       __skb_queue_purge(&sk->sk_write_queue);
>       if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL)
> --
> 2.20.1



_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to