cc to tipc-discussion and others.
///jon
-------- Forwarded Message --------
Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
Date: Tue, 12 May 2020 14:38:29 -0400
From: Jon Maloy <jma...@redhat.com>
To: Hoang Huu Le <hoang.h...@dektech.com.au>, ma...@donjonn.com
<ma...@donjonn.com>
Hi Hoang,
I had to re-read the discussion we had back in November to refresh my
memory about this.
To me v2 looks like a better base for further development than the most
recent version,
so I am sorry to have wasted your time here. I am still convinced that
the mcast/rcast synch
protocol we developed can solve almost all our problems if we understand
to use it right.
What about the following:
Sending side:
- Even the initial bulk is sent via tipc_mxast_xmit(), but with only one
destination node and as "mandatory rcast". The broadcast link may
still choose to send it as bcast in case rcast has been manually disabled
by the user, but the surplus messages will be dropped on the
non-destination nodes. (In case of legacy nodes this is certain, in the
case of newer nodes we must make sure that it happens by setting
the 'destnode' field in the header and always check for this in
tipc_named_rcv().) I think this is a weird and rare case where we
can accept the extra overhead of broadcast. Under all normal
conditions rcast will be used here.
- Regular PUBLISH/WITHDRAW messages are sent as "mandatory bcast",
or if some nodes don't not support TIPC_MCAST_RBCTL as "mandatory rcast".
The broadcast link may still choose to send it as rcast, but this is
nothing
we need to worry about as long as the syncronization mechanism is
active. Under all normal conditions bcast will be selected when possible,
since all nodes are destinations.
Receiving side:
- We need a deferred queue, but last time i missed that we don't need
one per node, it is sufficient to have a global one, as you were
suggesting.
So to me the code is ok here.
- As mentioned earlier, tipc_named_rcv() only accepts messages where
destnode is "self" (bulk) or zero (publish/withdraw).
The point with all this is that all messages will now be subject to the
synchronization mechanism, so the bulk messages are
guaranteed to be delivered in order and ahead of individual
publications/withdraws, and even topology changes will almost
never lead to change of sending method and need for synchronization.
We must make one change to the broadcast protocol though: even
"mandatory" messages must be made subject to the synchronization
mechanism, something that doesn't seem to be the case now.
("Or maybe we don't need to set "mandatory" at all? This needs to
be checked.)
The advantage of this method is simplicity.
- We don't need any separate synchronization mechanism for
NAME_DISTR messages at all, we just use what is already there.
The disadvantage is that there will be a lot of SYN messages sent
if we are not careful.
- Every time a new node comes up, the other nodes will send it a
a bcast SYN before they send the bulk, because they believe they have
just switched from bcast (normal mode) to rcast (bulk mode).
This one is in reality unnecessary, since we can be sure that
the new node has not been sent any previous bcast that needs
to be synched with.
- At the first publication/withdraw sent after the bulk the neighbor
nodes will send an rcast SYN to all other nodes, because they just
"switched" back from rcast to bcast.
In this case we really need to send an rcast SYN to the new node
that came up, since this is the only one where there is risk of
a race. This message does in reality serve as "end-of-bulk" message
but is handled like any other SYN by the reciving tipc_mcast_filter_rcv()
So, to avoid this the broadcast protocol needs to be able to recognize
NAME_DISTR messages and treat then slightly different from the others,
or it needs to be told what to do by the name_distr.c code via the API.
Maybe a couple of new fields in struct tipc_mcast_method?
What do you think?
Regards
///jon
On 5/12/20 6:22 AM, Hoang Huu Le wrote:
Just forward the patch I mentioned.
-----Original Message-----
From: Hoang Le <hoang.h...@dektech.com.au>
Sent: Tuesday, November 19, 2019 5:01 PM
To: jon.ma...@ericsson.com; ma...@donjonn.com; tipc-...@dektech.com.au
Subject: [PATCH 2/2] tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a cluster one
by one.
It is worth to use broadcast to update a binding service. Then binding
table updates in all nodes for one shot.
The mechanism is backward compatible as sync message slient dropped.
v2: resolve synchronization problem when switching from unicast to
broadcast
Signed-off-by: Hoang Le <hoang.h...@dektech.com.au>
---
net/tipc/bcast.c | 3 ++-
net/tipc/link.c | 6 ++++++
net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++---
net/tipc/name_table.h | 4 ++++
net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++
net/tipc/node.h | 2 ++
6 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index e06f05d55534..44ed481fec47 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct net *net,
struct sk_buff *skb,
hdr = buf_msg(skb);
if (msg_user(hdr) == MSG_FRAGMENTER)
hdr = msg_inner_hdr(hdr);
- if (msg_type(hdr) != TIPC_MCAST_MSG)
+ if (msg_user(hdr) != NAME_DISTRIBUTOR &&
+ msg_type(hdr) != TIPC_MCAST_MSG)
return 0;
/* Allocate dummy message */
diff --git a/net/tipc/link.c b/net/tipc/link.c
index fb72031228c9..a2e9a64d5a0f 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct tipc_link *l,
struct sk_buff *skb,
struct sk_buff_head *inputq)
{
struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;
+ struct name_table *nt = tipc_name_table(l->net);
+ struct sk_buff_head *defnq = &nt->defer_namedq;
struct tipc_msg *hdr = buf_msg(skb);
switch (msg_user(hdr)) {
@@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct tipc_link
*l, struct sk_buff *skb,
case NAME_DISTRIBUTOR:
l->bc_rcvlink->state = LINK_ESTABLISHED;
skb_queue_tail(l->namedq, skb);
+
+ spin_lock_bh(&defnq->lock);
+ tipc_mcast_filter_msg(l->net, defnq, l->namedq);
+ spin_unlock_bh(&defnq->lock);
return true;
case MSG_BUNDLER:
case TUNNEL_PROTOCOL:
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 66a65c2cdb23..593dcd11357f 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -615,9 +615,11 @@ struct publication *tipc_nametbl_publish(struct
net *net, u32 type, u32 lower,
struct tipc_net *tn = tipc_net(net);
struct publication *p = NULL;
struct sk_buff *skb = NULL;
+ bool rcast;
spin_lock_bh(&tn->nametbl_lock);
+ rcast = nt->rcast;
if (nt->local_publ_count >= TIPC_MAX_PUBL) {
pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL);
goto exit;
@@ -632,8 +634,18 @@ struct publication *tipc_nametbl_publish(struct
net *net, u32 type, u32 lower,
exit:
spin_unlock_bh(&tn->nametbl_lock);
- if (skb)
- tipc_node_broadcast(net, skb);
+ if (skb) {
+ /* Use broadcast if all nodes support broadcast NAME_DISTR */
+ if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
+ tipc_node_broadcast_named_publish(net, skb, &rcast);
+ spin_lock_bh(&tn->nametbl_lock);
+ nt->rcast = rcast;
+ spin_unlock_bh(&tn->nametbl_lock);
+ } else {
+ /* Otherwise, be backwards compatible */
+ tipc_node_broadcast(net, skb);
+ }
+ }
return p;
}
@@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32
type, u32 lower,
u32 self = tipc_own_addr(net);
struct sk_buff *skb = NULL;
struct publication *p;
+ bool rcast;
spin_lock_bh(&tn->nametbl_lock);
+ rcast = nt->rcast;
p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key);
if (p) {
@@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32
type, u32 lower,
spin_unlock_bh(&tn->nametbl_lock);
if (skb) {
- tipc_node_broadcast(net, skb);
+ /* Use broadcast if all nodes support broadcast NAME_DISTR */
+ if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
+ tipc_node_broadcast_named_publish(net, skb, &rcast);
+ spin_lock_bh(&tn->nametbl_lock);
+ nt->rcast = rcast;
+ spin_unlock_bh(&tn->nametbl_lock);
+ } else {
+ /* Otherwise, be backwards compatible */
+ tipc_node_broadcast(net, skb);
+ }
return 1;
}
return 0;
@@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)
INIT_LIST_HEAD(&nt->cluster_scope);
rwlock_init(&nt->cluster_scope_lock);
tn->nametbl = nt;
+ /* 'bulk' updated messages via unicast */
+ nt->rcast = true;
+ skb_queue_head_init(&nt->defer_namedq);
spin_lock_init(&tn->nametbl_lock);
return 0;
}
@@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net)
* publications, then release the name table
*/
spin_lock_bh(&tn->nametbl_lock);
+ skb_queue_purge(&nt->defer_namedq);
for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {
if (hlist_empty(&nt->services[i]))
continue;
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index f79066334cc8..b8cdf2a29d48 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -95,6 +95,8 @@ struct publication {
* - used by name_distr to send bulk updates to new nodes
* - used by name_distr during re-init of name table
* @local_publ_count: number of publications issued by this node
+ * @defer_namedq: temporarily queue for 'synching' update
+ * @rcast: previous method used to publish/withdraw a service
*/
struct name_table {
struct hlist_head services[TIPC_NAMETBL_SIZE];
@@ -102,6 +104,8 @@ struct name_table {
struct list_head cluster_scope;
rwlock_t cluster_scope_lock;
u32 local_publ_count;
+ struct sk_buff_head defer_namedq;
+ bool rcast;
};
int tipc_nl_name_table_dump(struct sk_buff *skb, struct
netlink_callback *cb);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index aaf595613e6e..b058647fa78b 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2981,3 +2981,35 @@ void tipc_node_pre_cleanup_net(struct net
*exit_net)
}
rcu_read_unlock();
}
+
+int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff
*skb,
+ bool *rcast)
+{
+ struct tipc_mc_method method = {.rcast = *rcast};
+ struct sk_buff_head xmitq;
+ struct tipc_nlist dests;
+ struct tipc_node *n;
+ u16 cong_link_cnt;
+ int rc = 0;
+
+ __skb_queue_head_init(&xmitq);
+ __skb_queue_tail(&xmitq, skb);
+
+ tipc_nlist_init(&dests, tipc_own_addr(net));
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, tipc_nodes(net), list) {
+ if (in_own_node(net, n->addr))
+ continue;
+ if (!node_is_up(n))
+ continue;
+ tipc_nlist_add(&dests, n->addr);
+ }
+ rcu_read_unlock();
+
+ rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, &cong_link_cnt);
+ *rcast = method.rcast;
+
+ tipc_nlist_purge(&dests);
+ __skb_queue_purge(&xmitq);
+ return rc;
+}
diff --git a/net/tipc/node.h b/net/tipc/node.h
index a6803b449a2c..d7d19f9932b1 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct sk_buff *skb,
struct genl_info *info);
int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info);
#endif
void tipc_node_pre_cleanup_net(struct net *exit_net);
+int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff
*skb,
+ bool *rcast);
#endif
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion