On 6/4/20 5:14 AM, Hoang Huu Le wrote:

Hi Jon,

Please see my inline comment

Regards,

Hoang

*From:* Jon Maloy <jma...@redhat.com>
*Sent:* Friday, May 29, 2020 11:11 PM
*To:* tipc-discussion@lists.sourceforge.net; tipc-dek <tipc-...@dektech.com.au>; Tuong Tong Lien <tuong.t.l...@dektech.com.au>; Xin Long <lucien....@gmail.com>; Tung Quang Nguyen <tung.q.ngu...@dektech.com.au>; Ying Xue <ying....@windriver.com> *Subject:* Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast

Added more recipients.



-------- Forwarded Message --------

*Subject: *

        

Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast

*Date: *

        

Fri, 29 May 2020 12:08:02 -0400

*From: *

        

Jon Maloy <jma...@redhat.com> <mailto:jma...@redhat.com>

*To: *

        

Hoang Huu Le <hoang.h...@dektech.com.au> <mailto:hoang.h...@dektech.com.au>, ma...@donjonn.com <mailto:ma...@donjonn.com> <ma...@donjonn.com> <mailto:ma...@donjonn.com>



Hi Hoang,
See below.

On 5/27/20 6:49 AM, Hoang Huu Le wrote:

    Hi Jon,

    I got DRAFT version base on your idea (attachment file).
    But from my point, this version introduce too much code
    implementation at sending side.
    I don't think this is bright idea to keep rcast_list and
    bcast_list in the name table.
    I think we should find out a new way or just ignore the feature.

Yes, you are right.
I came up with a new idea, to just add a sequence number to the broadcast/replicast messages and re-order them at reception. This even handles the case if the first broadcast message arrives before the first bulk message, something we have not anticipated before. I couldn't resist the temptation trying to code it, as you can see i the patch I just sent out. It is totally untested, I just added the code as I thought it should be and made sure it compiled. There is still a little too much new code to my taste, but this might be a way forward.
Please give your feedback on this.


I also noticed a couple of things while working with this:

1) There is still an 'expires' field in the tipc_mcast_method, and it seems to even be considered when switching bcast/rcast.     The whole point of adding the mcast synchronization mechanism was to get rid of this delay. Have you tested that syncronization really works without the 'expires' ?
[Hoang] Yes, I did.

I issue the command below to force rcast/bcast regardless ‘expires’.

$ tipc link set bro REPLICAST

or

$tipc link set bro BROADCAST

Yes, but what happens when the protocol selects by itself, based on number of destinations, and this number has just passed a threshold?


2) There are some remnants of old code for the name table dist_queue. This functionality was made redundant by me at least two years ago, so this should be cleaned up.
[Hoang] I will check and clean those stuff in separate commit.


3) We might have a potential race condition when new nodes come up, so that publications are distributed twice.     a) A publication is added to the name table, and the name table lock is released.     b) A new node comes up, and the new publication is delivered in the bulk message.     c) The broadcast of the publication goes ahead and sends it out to all nodes, even the one that just came up.
    d) We end up with a double publication on one of the nodes.
    e) One of those will linger in the name table after the publication is withdrawn.     I have never seen this happen, and my analysis might be wrong, but to me this looks like a possible scenario.     Note that my patch doesn't fix this, but we could possibly arrange it by adding a 'distributed' flag i the publication item on the sending side, so that the bulk distribution will ignore it.

[Hoang] I try to simulate as your scenario description on 8 nodes with publication >100 services at the same time and bring interface down/up. Sometimes I got below error logs:

[  537.322414] tipc: Failed to remove binding 1000,2 from 1001001

[…]

[  537.358957] tipc: Failed to remove binding 1000,11 from 1001001

I’m not sure above counting as bug whether or not. If yes, we also fix this in another commit too.

This is not what I expected, but might be another manifestation of the same problem. We are probably observing a replicast withdraw arriving before the corresponding bulk publication. If you see a binding <1000,2> in the table after this printout that would be a confirmation.

For my scenario:
Do you see duplicate publication instances before you do withdraw?
Do you see lingering publication after a withdraw?

Luckily, all of this will be fixed with the new broadcast distribution protocol.

Regards
///jon


Regards
///jon




    Regards,
    Hoang
    -----Original Message-----
    From: Jon Maloy <jma...@redhat.com> <mailto:jma...@redhat.com>
    Sent: Thursday, May 21, 2020 9:44 PM
    To: Hoang Huu Le <hoang.h...@dektech.com.au>
    <mailto:hoang.h...@dektech.com.au>; ma...@donjonn.com
    <mailto:ma...@donjonn.com>
    Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via
    broadcast

    Hi,
    I have one more comment below. Looking forward to your feedback.
    ///jon


    On 5/20/20 9:03 PM, Hoang Huu Le wrote:

        Yeah, thanks Jon.
        I will investigate more on your idea when I finish the issue
        with lab infrastructure.

        Hoang
        -----Original Message-----
        From: Jon Maloy <jma...@redhat.com> <mailto:jma...@redhat.com>
        Sent: Thursday, May 21, 2020 7:13 AM
        To: Hoang Huu Le <hoang.h...@dektech.com.au>
        <mailto:hoang.h...@dektech.com.au>; ma...@donjonn.com
        <mailto:ma...@donjonn.com>
        Subject: Re: FW: [PATCH 2/2] tipc: update a binding service
        via broadcast

        Hi Hoang,
        Below I try to summarize my newest proposal in relation to v2
        of your patch.

        1) The bulk can be sent just as is done now, with the addition
        that we
        add the NOT_LAST bit to the header.
        2) We need a new capability bit identifying nodes which support
        broadcast NAME_DISTR. I don't see we can just
             reuse TIPC_MCAST_RBCTL, because the recipients need to
        have code
        for handling concurrent
             bulk/broadcast receptions. This bit is not added to the
        cluster
        capability word.
        3) We need to keep two structs of type tipc_nlist in the name
        table. One
        contains tipc_dest structures for
            all nodes NOT supporting TIPC_MCAST_NAMEDISTR
        (rcast_list), and the
        other those which do (bcast_list).


        For more comments see below.



        On 5/12/20 6:22 AM, Hoang Huu Le wrote:

            Just forward the patch I mentioned.

            -----Original Message-----
            From: Hoang Le <hoang.h...@dektech.com.au>
            <mailto:hoang.h...@dektech.com.au>
            Sent: Tuesday, November 19, 2019 5:01 PM
            To: jon.ma...@ericsson.com
            <mailto:jon.ma...@ericsson.com>; ma...@donjonn.com
            <mailto:ma...@donjonn.com>; tipc-...@dektech.com.au
            <mailto:tipc-...@dektech.com.au>
            Subject: [PATCH 2/2] tipc: update a binding service via
            broadcast

            Currently, updating binding table (add service binding to
            name table/withdraw a service binding) is being sent over
            replicast.
            However, if we are scaling up clusters to > 100
            nodes/containers this
            method is less affection because of looping through nodes
            in a cluster one
            by one.

            It is worth to use broadcast to update a binding service.
            Then binding
            table updates in all nodes for one shot.

            The mechanism is backward compatible as sync message
            slient dropped.

            v2: resolve synchronization problem when switching from
            unicast to
            broadcast

            Signed-off-by: Hoang Le <hoang.h...@dektech.com.au>
            <mailto:hoang.h...@dektech.com.au>
            ---
            net/tipc/bcast.c | 3 ++-
            net/tipc/link.c | 6 ++++++
            net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++---
            net/tipc/name_table.h | 4 ++++
            net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++
            net/tipc/node.h | 2 ++
            6 files changed, 76 insertions(+), 4 deletions(-)

            diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
            index e06f05d55534..44ed481fec47 100644
            --- a/net/tipc/bcast.c
            +++ b/net/tipc/bcast.c
            @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct
            net *net, struct sk_buff *skb,
            hdr = buf_msg(skb);
            if (msg_user(hdr) == MSG_FRAGMENTER)
            hdr = msg_inner_hdr(hdr);
            - if (msg_type(hdr) != TIPC_MCAST_MSG)
            + if (msg_user(hdr) != NAME_DISTRIBUTOR &&
            + msg_type(hdr) != TIPC_MCAST_MSG)
            return 0;
            /* Allocate dummy message */
            diff --git a/net/tipc/link.c b/net/tipc/link.c
            index fb72031228c9..a2e9a64d5a0f 100644
            --- a/net/tipc/link.c
            +++ b/net/tipc/link.c
            @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct
            tipc_link *l, struct sk_buff *skb,
            struct sk_buff_head *inputq)
            {
            struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;
            + struct name_table *nt = tipc_name_table(l->net);
            + struct sk_buff_head *defnq = &nt->defer_namedq;
            struct tipc_msg *hdr = buf_msg(skb);
            switch (msg_user(hdr)) {
            @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct
            tipc_link *l, struct sk_buff *skb,
            case NAME_DISTRIBUTOR:
            l->bc_rcvlink->state = LINK_ESTABLISHED;
            skb_queue_tail(l->namedq, skb);
            +
            + spin_lock_bh(&defnq->lock);
            + tipc_mcast_filter_msg(l->net, defnq, l->namedq);

        Should not be needed here.
        You can instead do this in tipc_named_rcv(), using l->namedq 
        as deferred queue and creating an
        temporary namedq queue on the stack for the messages ready to
        be delivered.
        We sort the messages in two steps:
        1) If there are any chains of bulk messages, we sort those per
        source node into the temporary namedq and deliver them first,
        when a chain is complete.
        2) If we find that a chain is incomplete we push it back to
        the head of n->namedq and return without further action.
        3) When there are no bulk messages left in n->namedq we call
        tipc_mcast_filter_msgs() to sort the remaining messages into
        the temporary namedq, as far as possible, and deliver those
        which are ready to be delivered.



            + spin_unlock_bh(&defnq->lock);
            return true;
            case MSG_BUNDLER:
            case TUNNEL_PROTOCOL:
            diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
            index 66a65c2cdb23..593dcd11357f 100644
            --- a/net/tipc/name_table.c
            +++ b/net/tipc/name_table.c
            @@ -615,9 +615,11 @@ struct publication
            *tipc_nametbl_publish(struct net *net, u32 type, u32 lower,
            struct tipc_net *tn = tipc_net(net);
            struct publication *p = NULL;
            struct sk_buff *skb = NULL;
            + bool rcast;
            spin_lock_bh(&tn->nametbl_lock);
            + rcast = nt->rcast;
            if (nt->local_publ_count >= TIPC_MAX_PUBL) {
            pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL);
            goto exit;
            @@ -632,8 +634,18 @@ struct publication
            *tipc_nametbl_publish(struct net *net, u32 type, u32 lower,
            exit:
            spin_unlock_bh(&tn->nametbl_lock);
            - if (skb)
            - tipc_node_broadcast(net, skb);

        Here we make two calls to tipc_mcast_xmit(), one with
        method->rcast/mandatory and rcast_list,if not empty, and one
        with method->bcast/mandatory and bcast_list, if not empty.

    Actually, the latter should not be mandatory. We can easily imagine a
    situation where we start out with only legacy nodes, and then
    upgrade to
    bcast nodes, one by one.

    In the beginning we want tipc_mcast_xmit() to select rcast even
    for the
    bcast nodes, and then, as their proportion of the cluster grows, it
    should switch to bcast. This does mean that the method struct for
    bcast
    must be kept between the calls. I.e., another member of struct
    name_table.

    ///jon



        skb must of course be cloned if necessary.



            + if (skb) {
            + /* Use broadcast if all nodes support broadcast
            NAME_DISTR */
            + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
            + tipc_node_broadcast_named_publish(net, skb, &rcast);
            + spin_lock_bh(&tn->nametbl_lock);
            + nt->rcast = rcast;
            + spin_unlock_bh(&tn->nametbl_lock);
            + } else {
            + /* Otherwise, be backwards compatible */
            + tipc_node_broadcast(net, skb);
            + }

        Not needed according to above.

            return p;
            }
            @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net
            *net, u32 type, u32 lower,
            u32 self = tipc_own_addr(net);
            struct sk_buff *skb = NULL;
            struct publication *p;
            + bool rcast;
            spin_lock_bh(&tn->nametbl_lock);
            + rcast = nt->rcast;
            p = tipc_nametbl_remove_publ(net, type, lower, upper,
            self, key);
            if (p) {
            @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net
            *net, u32 type, u32 lower,
            spin_unlock_bh(&tn->nametbl_lock);
            if (skb) {
            - tipc_node_broadcast(net, skb);
            + /* Use broadcast if all nodes support broadcast
            NAME_DISTR */
            + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
            + tipc_node_broadcast_named_publish(net, skb, &rcast);
            + spin_lock_bh(&tn->nametbl_lock);
            + nt->rcast = rcast;
            + spin_unlock_bh(&tn->nametbl_lock);
            + } else {
            + /* Otherwise, be backwards compatible */
            + tipc_node_broadcast(net, skb);
            + }

        One or two calls to tipc_mcast_xmit(), just as above.

            return 1;
            }
            return 0;
            @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)
            INIT_LIST_HEAD(&nt->cluster_scope);
            rwlock_init(&nt->cluster_scope_lock);
            tn->nametbl = nt;
            + /* 'bulk' updated messages via unicast */
            + nt->rcast = true;
            + skb_queue_head_init(&nt->defer_namedq);

        Not needed. Now, node->namedq *is* the deferred queue.

            spin_lock_init(&tn->nametbl_lock);
            return 0;
            }
            @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net)
            * publications, then release the name table
            */
            spin_lock_bh(&tn->nametbl_lock);
            + skb_queue_purge(&nt->defer_namedq);
            for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {
            if (hlist_empty(&nt->services[i]))
            continue;
            diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
            index f79066334cc8..b8cdf2a29d48 100644
            --- a/net/tipc/name_table.h
            +++ b/net/tipc/name_table.h
            @@ -95,6 +95,8 @@ struct publication {
            * - used by name_distr to send bulk updates to new nodes
            * - used by name_distr during re-init of name table
            * @local_publ_count: number of publications issued by this
            node
            + * @defer_namedq: temporarily queue for 'synching' update
            + * @rcast: previous method used to publish/withdraw a service
            */
            struct name_table {
            struct hlist_head services[TIPC_NAMETBL_SIZE];
            @@ -102,6 +104,8 @@ struct name_table {
            struct list_head cluster_scope;
            rwlock_t cluster_scope_lock;
            u32 local_publ_count;
            + struct sk_buff_head defer_namedq;
            + bool rcast;
            };
            int tipc_nl_name_table_dump(struct sk_buff *skb, struct
            netlink_callback *cb);
            diff --git a/net/tipc/node.c b/net/tipc/node.c
            index aaf595613e6e..b058647fa78b 100644
            --- a/net/tipc/node.c
            +++ b/net/tipc/node.c
            @@ -2981,3 +2981,35 @@ void
            tipc_node_pre_cleanup_net(struct net *exit_net)
            }
            rcu_read_unlock();
            }
            +
            +int tipc_node_broadcast_named_publish(struct net *net,
            struct sk_buff *skb,
            + bool *rcast)
            +{
            + struct tipc_mc_method method = {.rcast = *rcast};
            + struct sk_buff_head xmitq;
            + struct tipc_nlist dests;
            + struct tipc_node *n;
            + u16 cong_link_cnt;
            + int rc = 0;
            +
            + __skb_queue_head_init(&xmitq);
            + __skb_queue_tail(&xmitq, skb);
            +
            + tipc_nlist_init(&dests, tipc_own_addr(net));
            + rcu_read_lock();
            + list_for_each_entry_rcu(n, tipc_nodes(net), list) {
            + if (in_own_node(net, n->addr))
            + continue;
            + if (!node_is_up(n))
            + continue;
            + tipc_nlist_add(&dests, n->addr);
            + }
            + rcu_read_unlock();
            +
            + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests,
            &cong_link_cnt);
            + *rcast = method.rcast;
            +
            + tipc_nlist_purge(&dests);
            + __skb_queue_purge(&xmitq);
            + return rc;
            +}

        Not needed any more.

            diff --git a/net/tipc/node.h b/net/tipc/node.h
            index a6803b449a2c..d7d19f9932b1 100644
            --- a/net/tipc/node.h
            +++ b/net/tipc/node.h
            @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct
            sk_buff *skb, struct genl_info *info);
            int tipc_nl_node_flush_key(struct sk_buff *skb, struct
            genl_info *info);
            #endif
            void tipc_node_pre_cleanup_net(struct net *exit_net);
            +int tipc_node_broadcast_named_publish(struct net *net,
            struct sk_buff *skb,
            + bool *rcast);
            #endif

        There may of course be flaws or potential for improvements to
        this, but
        to me
        this should solve our problem without too much new complexity
        and code.

        ///jon



_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to