Added more recipients.
-------- Forwarded Message --------
Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
Date: Fri, 29 May 2020 12:08:02 -0400
From: Jon Maloy <jma...@redhat.com>
To: Hoang Huu Le <hoang.h...@dektech.com.au>, ma...@donjonn.com
<ma...@donjonn.com>
Hi Hoang,
See below.
On 5/27/20 6:49 AM, Hoang Huu Le wrote:
Hi Jon,
I got DRAFT version base on your idea (attachment file).
But from my point, this version introduce too much code implementation
at sending side.
I don't think this is bright idea to keep rcast_list and bcast_list in
the name table.
I think we should find out a new way or just ignore the feature.
Yes, you are right.
I came up with a new idea, to just add a sequence number to the
broadcast/replicast messages and re-order them at reception. This even
handles the case if the first broadcast message arrives before the first
bulk message, something we have not anticipated before.
I couldn't resist the temptation trying to code it, as you can see i the
patch I just sent out.
It is totally untested, I just added the code as I thought it should be
and made sure it compiled.
There is still a little too much new code to my taste, but this might be
a way forward.
Please give your feedback on this.
I also noticed a couple of things while working with this:
1) There is still an 'expires' field in the tipc_mcast_method, and it
seems to even be considered when switching bcast/rcast. The whole
point of adding the mcast synchronization mechanism was to get rid of
this delay. Have you tested that syncronization really works without the
'expires' ?
2) There are some remnants of old code for the name table dist_queue.
This functionality was made redundant by me at least two years ago, so
this should be cleaned up.
3) We might have a potential race condition when new nodes come up, so
that publications are distributed twice.
a) A publication is added to the name table, and the name table
lock is released.
b) A new node comes up, and the new publication is delivered in the
bulk message.
c) The broadcast of the publication goes ahead and sends it out to
all nodes, even the one that just came up.
d) We end up with a double publication on one of the nodes.
e) One of those will linger in the name table after the publication
is withdrawn.
I have never seen this happen, and my analysis might be wrong, but
to me this looks like a possible scenario.
Note that my patch doesn't fix this, but we could possibly arrange
it by adding a 'distributed' flag i the publication item on the sending
side, so that the bulk distribution will ignore it.
Regards
///jon
Regards,
Hoang
-----Original Message-----
From: Jon Maloy <jma...@redhat.com>
Sent: Thursday, May 21, 2020 9:44 PM
To: Hoang Huu Le <hoang.h...@dektech.com.au>; ma...@donjonn.com
Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
Hi,
I have one more comment below. Looking forward to your feedback.
///jon
On 5/20/20 9:03 PM, Hoang Huu Le wrote:
Yeah, thanks Jon.
I will investigate more on your idea when I finish the issue with lab
infrastructure.
Hoang
-----Original Message-----
From: Jon Maloy <jma...@redhat.com>
Sent: Thursday, May 21, 2020 7:13 AM
To: Hoang Huu Le <hoang.h...@dektech.com.au>; ma...@donjonn.com
Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
Hi Hoang,
Below I try to summarize my newest proposal in relation to v2 of your
patch.
1) The bulk can be sent just as is done now, with the addition that we
add the NOT_LAST bit to the header.
2) We need a new capability bit identifying nodes which support
broadcast NAME_DISTR. I don't see we can just
reuse TIPC_MCAST_RBCTL, because the recipients need to have code
for handling concurrent
bulk/broadcast receptions. This bit is not added to the cluster
capability word.
3) We need to keep two structs of type tipc_nlist in the name table. One
contains tipc_dest structures for
all nodes NOT supporting TIPC_MCAST_NAMEDISTR (rcast_list), and the
other those which do (bcast_list).
For more comments see below.
On 5/12/20 6:22 AM, Hoang Huu Le wrote:
Just forward the patch I mentioned.
-----Original Message-----
From: Hoang Le <hoang.h...@dektech.com.au>
Sent: Tuesday, November 19, 2019 5:01 PM
To: jon.ma...@ericsson.com; ma...@donjonn.com; tipc-...@dektech.com.au
Subject: [PATCH 2/2] tipc: update a binding service via broadcast
Currently, updating binding table (add service binding to
name table/withdraw a service binding) is being sent over replicast.
However, if we are scaling up clusters to > 100 nodes/containers this
method is less affection because of looping through nodes in a
cluster one
by one.
It is worth to use broadcast to update a binding service. Then binding
table updates in all nodes for one shot.
The mechanism is backward compatible as sync message slient dropped.
v2: resolve synchronization problem when switching from unicast to
broadcast
Signed-off-by: Hoang Le <hoang.h...@dektech.com.au>
---
net/tipc/bcast.c | 3 ++-
net/tipc/link.c | 6 ++++++
net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++---
net/tipc/name_table.h | 4 ++++
net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++
net/tipc/node.h | 2 ++
6 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index e06f05d55534..44ed481fec47 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct net *net,
struct sk_buff *skb,
hdr = buf_msg(skb);
if (msg_user(hdr) == MSG_FRAGMENTER)
hdr = msg_inner_hdr(hdr);
- if (msg_type(hdr) != TIPC_MCAST_MSG)
+ if (msg_user(hdr) != NAME_DISTRIBUTOR &&
+ msg_type(hdr) != TIPC_MCAST_MSG)
return 0;
/* Allocate dummy message */
diff --git a/net/tipc/link.c b/net/tipc/link.c
index fb72031228c9..a2e9a64d5a0f 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct tipc_link
*l, struct sk_buff *skb,
struct sk_buff_head *inputq)
{
struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;
+ struct name_table *nt = tipc_name_table(l->net);
+ struct sk_buff_head *defnq = &nt->defer_namedq;
struct tipc_msg *hdr = buf_msg(skb);
switch (msg_user(hdr)) {
@@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct tipc_link
*l, struct sk_buff *skb,
case NAME_DISTRIBUTOR:
l->bc_rcvlink->state = LINK_ESTABLISHED;
skb_queue_tail(l->namedq, skb);
+
+ spin_lock_bh(&defnq->lock);
+ tipc_mcast_filter_msg(l->net, defnq, l->namedq);
Should not be needed here.
You can instead do this in tipc_named_rcv(), using l->namedq as
deferred queue and creating an
temporary namedq queue on the stack for the messages ready to be
delivered.
We sort the messages in two steps:
1) If there are any chains of bulk messages, we sort those per source
node into the temporary namedq and deliver them first, when a chain
is complete.
2) If we find that a chain is incomplete we push it back to the head
of n->namedq and return without further action.
3) When there are no bulk messages left in n->namedq we call
tipc_mcast_filter_msgs() to sort the remaining messages into the
temporary namedq, as far as possible, and deliver those which are
ready to be delivered.
+ spin_unlock_bh(&defnq->lock);
return true;
case MSG_BUNDLER:
case TUNNEL_PROTOCOL:
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 66a65c2cdb23..593dcd11357f 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -615,9 +615,11 @@ struct publication *tipc_nametbl_publish(struct
net *net, u32 type, u32 lower,
struct tipc_net *tn = tipc_net(net);
struct publication *p = NULL;
struct sk_buff *skb = NULL;
+ bool rcast;
spin_lock_bh(&tn->nametbl_lock);
+ rcast = nt->rcast;
if (nt->local_publ_count >= TIPC_MAX_PUBL) {
pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL);
goto exit;
@@ -632,8 +634,18 @@ struct publication *tipc_nametbl_publish(struct
net *net, u32 type, u32 lower,
exit:
spin_unlock_bh(&tn->nametbl_lock);
- if (skb)
- tipc_node_broadcast(net, skb);
Here we make two calls to tipc_mcast_xmit(), one with
method->rcast/mandatory and rcast_list,if not empty, and one with
method->bcast/mandatory and bcast_list, if not empty.
Actually, the latter should not be mandatory. We can easily imagine a
situation where we start out with only legacy nodes, and then upgrade to
bcast nodes, one by one.
In the beginning we want tipc_mcast_xmit() to select rcast even for the
bcast nodes, and then, as their proportion of the cluster grows, it
should switch to bcast. This does mean that the method struct for bcast
must be kept between the calls. I.e., another member of struct name_table.
///jon
skb must of course be cloned if necessary.
+ if (skb) {
+ /* Use broadcast if all nodes support broadcast NAME_DISTR */
+ if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
+ tipc_node_broadcast_named_publish(net, skb, &rcast);
+ spin_lock_bh(&tn->nametbl_lock);
+ nt->rcast = rcast;
+ spin_unlock_bh(&tn->nametbl_lock);
+ } else {
+ /* Otherwise, be backwards compatible */
+ tipc_node_broadcast(net, skb);
+ }
Not needed according to above.
return p;
}
@@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32
type, u32 lower,
u32 self = tipc_own_addr(net);
struct sk_buff *skb = NULL;
struct publication *p;
+ bool rcast;
spin_lock_bh(&tn->nametbl_lock);
+ rcast = nt->rcast;
p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key);
if (p) {
@@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32
type, u32 lower,
spin_unlock_bh(&tn->nametbl_lock);
if (skb) {
- tipc_node_broadcast(net, skb);
+ /* Use broadcast if all nodes support broadcast NAME_DISTR */
+ if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
+ tipc_node_broadcast_named_publish(net, skb, &rcast);
+ spin_lock_bh(&tn->nametbl_lock);
+ nt->rcast = rcast;
+ spin_unlock_bh(&tn->nametbl_lock);
+ } else {
+ /* Otherwise, be backwards compatible */
+ tipc_node_broadcast(net, skb);
+ }
One or two calls to tipc_mcast_xmit(), just as above.
return 1;
}
return 0;
@@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)
INIT_LIST_HEAD(&nt->cluster_scope);
rwlock_init(&nt->cluster_scope_lock);
tn->nametbl = nt;
+ /* 'bulk' updated messages via unicast */
+ nt->rcast = true;
+ skb_queue_head_init(&nt->defer_namedq);
Not needed. Now, node->namedq *is* the deferred queue.
spin_lock_init(&tn->nametbl_lock);
return 0;
}
@@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net)
* publications, then release the name table
*/
spin_lock_bh(&tn->nametbl_lock);
+ skb_queue_purge(&nt->defer_namedq);
for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {
if (hlist_empty(&nt->services[i]))
continue;
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index f79066334cc8..b8cdf2a29d48 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -95,6 +95,8 @@ struct publication {
* - used by name_distr to send bulk updates to new nodes
* - used by name_distr during re-init of name table
* @local_publ_count: number of publications issued by this node
+ * @defer_namedq: temporarily queue for 'synching' update
+ * @rcast: previous method used to publish/withdraw a service
*/
struct name_table {
struct hlist_head services[TIPC_NAMETBL_SIZE];
@@ -102,6 +104,8 @@ struct name_table {
struct list_head cluster_scope;
rwlock_t cluster_scope_lock;
u32 local_publ_count;
+ struct sk_buff_head defer_namedq;
+ bool rcast;
};
int tipc_nl_name_table_dump(struct sk_buff *skb, struct
netlink_callback *cb);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index aaf595613e6e..b058647fa78b 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2981,3 +2981,35 @@ void tipc_node_pre_cleanup_net(struct net
*exit_net)
}
rcu_read_unlock();
}
+
+int tipc_node_broadcast_named_publish(struct net *net, struct
sk_buff *skb,
+ bool *rcast)
+{
+ struct tipc_mc_method method = {.rcast = *rcast};
+ struct sk_buff_head xmitq;
+ struct tipc_nlist dests;
+ struct tipc_node *n;
+ u16 cong_link_cnt;
+ int rc = 0;
+
+ __skb_queue_head_init(&xmitq);
+ __skb_queue_tail(&xmitq, skb);
+
+ tipc_nlist_init(&dests, tipc_own_addr(net));
+ rcu_read_lock();
+ list_for_each_entry_rcu(n, tipc_nodes(net), list) {
+ if (in_own_node(net, n->addr))
+ continue;
+ if (!node_is_up(n))
+ continue;
+ tipc_nlist_add(&dests, n->addr);
+ }
+ rcu_read_unlock();
+
+ rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, &cong_link_cnt);
+ *rcast = method.rcast;
+
+ tipc_nlist_purge(&dests);
+ __skb_queue_purge(&xmitq);
+ return rc;
+}
Not needed any more.
diff --git a/net/tipc/node.h b/net/tipc/node.h
index a6803b449a2c..d7d19f9932b1 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct sk_buff *skb,
struct genl_info *info);
int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info);
#endif
void tipc_node_pre_cleanup_net(struct net *exit_net);
+int tipc_node_broadcast_named_publish(struct net *net, struct
sk_buff *skb,
+ bool *rcast);
#endif
There may of course be flaws or potential for improvements to this, but
to me
this should solve our problem without too much new complexity and code.
///jon
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion