Hi Tuong,

Thank you for your explanation below. It's pretty clear for me.

Also, you can add my ack-by tag in the series if you want.

Thanks,
Ying

-----Original Message-----
From: Tuong Lien Tong [mailto:tuong.t.l...@dektech.com.au] 
Sent: Monday, March 25, 2019 3:33 PM
To: Xue, Ying; tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; 
ma...@donjonn.com
Subject: RE: [net-next 1/3] tipc: improve TIPC throughput by Gap ACK blocks

Hi Ying!

Correct, the idea was inspired from SACK in SCTP protocol but with simplicity.
Also, I see your suggestions about the "duplicated packets"... which is 
connected to the SCTP "delayed SACK" algorithm (i.e. in the case of no payload 
message loss). In TIPC, as I understand so far, we already have such a delay on 
acknowledgements by the link "rcv_unacked" (- Jon may correct me?) (but we 
don’t implement a timer for the SACK delay timeout i.e. the 200ms in the SCTP 
RFC). However, that duplicated TSN concept is based on DATA chunks _and_ "with 
no new DATA chunk(s)" which usually happens in case of SACK loss and the sender 
has tried to retransmit the same DATA chunks when its RTO timer expired..., 
obviously an immediate SACK is needed in this situation. In TIPC, we might not 
face with this situation because we do not have a retransmission timer at 
sender side, duplicates can occur almost due to the overactive NACK sending and 
should be covered by the 2nd patch of the series.
For me, in the case of packet loss, an immediate retransmission is important, 
otherwise it can reduce the performance. However, because we never know if the 
packet is really lost or just delayed, we have to apply the "1ms restriction" 
to reduce duplicates (- as you can also see in the 2nd patch). Fast 
retransmission was also tried, Jon and I had some discussions before... but the 
fact is, in TIPC, the sender is passive (due to no retransmission timer) and we 
could be in trouble if trying to wait for the 2nd or 3rd indications... 
Instead, the NACK sending criteria has been changed by the 2nd patch to both 
reduce duplicates but try to keep the performance...
Actually, in SCTP, the situation is a bit difference as they play with "chunks" 
and "multi-streaming" than individual packets like ours at the link layer, and 
many chunks can be optionally bundled in a single packet because of the 
slow-start or Nagle's algorithm...
Anyway, if you have any ideas to improve TIPC performance more, I will try to 
see what happens.
Thanks a lot!

BR/Tuong 

-----Original Message-----
From: Ying Xue <ying....@windriver.com> 
Sent: Friday, March 22, 2019 7:52 PM
To: Tuong Lien <tuong.t.l...@dektech.com.au>; 
tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; ma...@donjonn.com
Subject: Re: [net-next 1/3] tipc: improve TIPC throughput by Gap ACK blocks

Hi Tuong,

Great job! It's a very nice enhancement, and we should do the
improvement early.

On 3/20/19 11:28 AM, Tuong Lien wrote:
> During unicast link transmission, it's observed very often that because
> of one or a few lost/dis-ordered packets, the sending side will fastly
> reach the send window limit and must wait for the packets to be arrived
> at the receiving side or in the worst case, a retransmission must be
> done first. The sending side cannot release a lot of subsequent packets
> in its transmq even though all of them might have already been received
> by the receiving side.
> That is, one or two packets dis-ordered/lost and dozens of packets have
> to wait, this obviously reduces the overall throughput!
> 
> This commit introduces an algorithm to overcome this by using "Gap ACK
> blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
> that describes the link deferdq where packets have been got by the
> receiving side but with gaps, for example:
> 
>       link deferdq: [1 2 3 4      10 11      13 14 15       20]
> --> Gap ACK blocks:       <4, 5>,   <11, 1>,      <15, 4>, <20, 0>

This idea is the exactly same as SACK of SCTP:
https://tools.ietf.org/html/rfc4960#section-3.3.4


        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |   Type = 3    |Chunk  Flags   |      Chunk Length             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      Cumulative TSN Ack                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |          Advertised Receiver Window Credit (a_rwnd)           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       | Number of Gap Ack Blocks = N  |  Number of Duplicate TSNs = X |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |  Gap Ack Block #1 Start       |   Gap Ack Block #1 End        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       /                                                               /
       \                              ...                              \
       /                                                               /
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |   Gap Ack Block #N Start      |  Gap Ack Block #N End         |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                       Duplicate TSN 1                         |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       /                                                               /
       \                              ...                              \
       /                                                               /
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                       Duplicate TSN X                         |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   For example, assume that the receiver has the following DATA chunks
   newly arrived at the time when it decides to send a Selective ACK,

                           ----------
                           | TSN=17 |
                           ----------
                           |        | <- still missing
                           ----------
                           | TSN=15 |
                           ----------
                           | TSN=14 |
                           ----------
                           |        | <- still missing
                           ----------
                           | TSN=12 |
                           ----------
                           | TSN=11 |
                           ----------
                           | TSN=10 |
                           ----------

   then the parameter part of the SACK MUST be constructed as follows
   (assuming the new a_rwnd is set to 4660 by the sender):

                     +--------------------------------+
                     |   Cumulative TSN Ack = 12      |
                     +--------------------------------+
                     |        a_rwnd = 4660           |
                     +----------------+---------------+
                     | num of block=2 | num of dup=0  |
                     +----------------+---------------+
                     |block #1 strt=2 |block #1 end=3 |
                     +----------------+---------------+
                     |block #2 strt=5 |block #2 end=5 |
                     +----------------+---------------+


It seems more easier to understand than our solution. Of course, I don't
intend to disagree to your proposal. In addition, we also can consider
to mark duplicated packet as SCTP is doing.

For duplicated packets, SCTP adapt the strategy:
   When a packet arrives with duplicate DATA chunk(s) and with no new
   DATA chunk(s), the endpoint MUST _immediately_ send a SACK with no
   delay.  If a packet arrives with duplicate DATA chunk(s) bundled with
   new DATA chunks, the endpoint MAY immediately send a SACK.

In the current solution, when a sender detects a gap in
tipc_link_advance_transmq(), it will _immediately_ retransmit missing
messages.

However, probably we can borrow some idea from SCTP Fast Retransmit
algorithm, which might help us further improve throughput performance:

https://tools.ietf.org/html/rfc4960#section-7.2.4

   In the absence of data loss, an endpoint performs delayed
   acknowledgement.  However, whenever an endpoint notices a hole in the
   arriving TSN sequence, it SHOULD start sending a SACK back every time
   a packet arrives carrying data until the hole is filled.

   Whenever an endpoint receives a SACK that indicates that some TSNs
   are missing, it SHOULD wait for _two further miss indications_ (via
   subsequent SACKs for a total of three missing reports) on the same
   TSNs before taking action with regard to Fast Retransmit.

In SCTP, whenever sender endpoint detects a gap, it doesn't immediately
retransmit the missing packet. Instead, it will wait for two further
miss indications.

Of course, SCTP flow control management algorithm is not the exactly
same with TIPC. Especially, each SCTP endpoint has a retransmission
timer, slow-start etc.

Maybe its algorithms are all useful for TIPC scenario, but I think we
can try to use the two ideas I mentioned above based on the current
solution to check whether they can further improve TIPC transmission
performance.

Thanks,
Ying

> 
> The Gap ACK blocks will be sent to the sending side along with the
> traditional ACK or NACK message. Immediately when receiving the message
> the sending side will now not only release from its transmq the packets
> ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
> be enqueued and transmitted.
> In addition, the sending side can now do "multi-retransmissions"
> according to the Gaps reported in the Gap ACK blocks.
> 
> The new algorithm as verified helps greatly improve the TIPC throughput
> especially under packet loss condition.
> 
> So far, a maximum of 32 blocks is quite enough without any "Too few Gap
> ACK blocks" reports with a 5.0% packet loss rate, however this number
> can be increased in the furture if needed.
> 
> Also, the patch is backward compatible.
> 
> Acked-by: Jon Maloy <jon.ma...@ericsson.com>
> Signed-off-by: Tuong Lien <tuong.t.l...@dektech.com.au>
> ---
>  net/tipc/link.c | 134 
> +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  net/tipc/msg.h  |  30 +++++++++++++
>  net/tipc/node.h |   6 ++-
>  3 files changed, 158 insertions(+), 12 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 52d23b3ffaf5..5aee1ed23ba9 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -246,6 +246,10 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>  static void tipc_link_build_bc_init_msg(struct tipc_link *l,
>                                       struct sk_buff_head *xmitq);
>  static bool tipc_link_release_pkts(struct tipc_link *l, u16 to);
> +static u16 tipc_build_gap_ack_blks(struct tipc_link *l, void *data);
> +static void tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 
> gap,
> +                                   struct tipc_gap_ack_blks *ga,
> +                                   struct sk_buff_head *xmitq);
>  
>  /*
>   *  Simple non-static link routines (i.e. referenced outside this file)
> @@ -1226,6 +1230,102 @@ static bool tipc_link_release_pkts(struct tipc_link 
> *l, u16 acked)
>       return released;
>  }
>  
> +/* tipc_build_gap_ack_blks - build Gap ACK blocks
> + * @l: tipc link that data have come with gaps in sequence if any
> + * @data: data buffer to store the Gap ACK blocks after built
> + *
> + * returns the actual allocated memory size
> + */
> +static u16 tipc_build_gap_ack_blks(struct tipc_link *l, void *data)
> +{
> +     struct sk_buff *skb = skb_peek(&l->deferdq);
> +     struct tipc_gap_ack_blks *ga = data;
> +     u16 len, expect, seqno = 0;
> +     u8 n = 0;
> +
> +     if (!skb)
> +             goto exit;
> +
> +     expect = buf_seqno(skb);
> +     skb_queue_walk(&l->deferdq, skb) {
> +             seqno = buf_seqno(skb);
> +             if (unlikely(more(seqno, expect))) {
> +                     ga->gacks[n].ack = htons(expect - 1);
> +                     ga->gacks[n].gap = htons(seqno - expect);
> +                     if (++n >= MAX_GAP_ACK_BLKS) {
> +                             pr_info_ratelimited("Too few Gap ACK 
> blocks!\n");
> +                             goto exit;
> +                     }
> +             } else if (unlikely(less(seqno, expect))) {
> +                     pr_warn("Unexpected skb in deferdq!\n");
> +                     continue;
> +             }
> +             expect = seqno + 1;
> +     }
> +
> +     /* last block */
> +     ga->gacks[n].ack = htons(seqno);
> +     ga->gacks[n].gap = 0;
> +     n++;
> +
> +exit:
> +     len = tipc_gap_ack_blks_sz(n);
> +     ga->len = htons(len);
> +     ga->gack_cnt = n;
> +     return len;
> +}
> +
> +/* tipc_link_advance_transmq - advance TIPC link transmq queue by releasing
> + *                          acked packets, also doing retransmissions if
> + *                          gaps found
> + * @l: tipc link with transmq queue to be advanced
> + * @acked: seqno of last packet acked by peer without any gaps before
> + * @gap: # of gap packets
> + * @ga: buffer pointer to Gap ACK blocks from peer
> + * @xmitq: queue for accumulating the retransmitted packets if any
> + */
> +static void tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 
> gap,
> +                                   struct tipc_gap_ack_blks *ga,
> +                                   struct sk_buff_head *xmitq)
> +{
> +     struct sk_buff *skb, *_skb, *tmp;
> +     struct tipc_msg *hdr;
> +     u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1;
> +     u16 ack = l->rcv_nxt - 1;
> +     u16 seqno;
> +     u16 n = 0;
> +
> +     skb_queue_walk_safe(&l->transmq, skb, tmp) {
> +             seqno = buf_seqno(skb);
> +
> +next_gap_ack:
> +             if (less_eq(seqno, acked)) {
> +                     /* release skb */
> +                     __skb_unlink(skb, &l->transmq);
> +                     kfree_skb(skb);
> +             } else if (less_eq(seqno, acked + gap)) {
> +                     /* retransmit skb */
> +                     _skb = __pskb_copy(skb, MIN_H_SIZE, GFP_ATOMIC);
> +                     if (!_skb)
> +                             continue;
> +                     hdr = buf_msg(_skb);
> +                     msg_set_ack(hdr, ack);
> +                     msg_set_bcast_ack(hdr, bc_ack);
> +                     _skb->priority = TC_PRIO_CONTROL;
> +                     __skb_queue_tail(xmitq, _skb);
> +                     l->stats.retransmitted++;
> +             } else {
> +                     /* retry with Gap ACK blocks if any */
> +                     if (!ga || n >= ga->gack_cnt)
> +                             break;
> +                     acked = ntohs(ga->gacks[n].ack);
> +                     gap = ntohs(ga->gacks[n].gap);
> +                     n++;
> +                     goto next_gap_ack;
> +             }
> +     }
> +}
> +
>  /* tipc_link_build_state_msg: prepare link state message for transmission
>   *
>   * Note that sending of broadcast ack is coordinated among nodes, to reduce
> @@ -1378,6 +1478,7 @@ static void tipc_link_build_proto_msg(struct tipc_link 
> *l, int mtyp, bool probe,
>       struct tipc_mon_state *mstate = &l->mon_state;
>       int dlen = 0;
>       void *data;
> +     u16 glen = 0;
>  
>       /* Don't send protocol message during reset or link failover */
>       if (tipc_link_is_blocked(l))
> @@ -1390,8 +1491,8 @@ static void tipc_link_build_proto_msg(struct tipc_link 
> *l, int mtyp, bool probe,
>               rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
>  
>       skb = tipc_msg_create(LINK_PROTOCOL, mtyp, INT_H_SIZE,
> -                           tipc_max_domain_size, l->addr,
> -                           tipc_own_addr(l->net), 0, 0, 0);
> +                           tipc_max_domain_size + MAX_GAP_ACK_BLKS_SZ,
> +                           l->addr, tipc_own_addr(l->net), 0, 0, 0);
>       if (!skb)
>               return;
>  
> @@ -1418,9 +1519,11 @@ static void tipc_link_build_proto_msg(struct tipc_link 
> *l, int mtyp, bool probe,
>               msg_set_bc_gap(hdr, link_bc_rcv_gap(bcl));
>               msg_set_probe(hdr, probe);
>               msg_set_is_keepalive(hdr, probe || probe_reply);
> -             tipc_mon_prep(l->net, data, &dlen, mstate, l->bearer_id);
> -             msg_set_size(hdr, INT_H_SIZE + dlen);
> -             skb_trim(skb, INT_H_SIZE + dlen);
> +             if (l->peer_caps & TIPC_GAP_ACK_BLOCK)
> +                     glen = tipc_build_gap_ack_blks(l, data);
> +             tipc_mon_prep(l->net, data + glen, &dlen, mstate, l->bearer_id);
> +             msg_set_size(hdr, INT_H_SIZE + glen + dlen);
> +             skb_trim(skb, INT_H_SIZE + glen + dlen);
>               l->stats.sent_states++;
>               l->rcv_unacked = 0;
>       } else {
> @@ -1590,6 +1693,7 @@ static int tipc_link_proto_rcv(struct tipc_link *l, 
> struct sk_buff *skb,
>                              struct sk_buff_head *xmitq)
>  {
>       struct tipc_msg *hdr = buf_msg(skb);
> +     struct tipc_gap_ack_blks *ga = NULL;
>       u16 rcvgap = 0;
>       u16 ack = msg_ack(hdr);
>       u16 gap = msg_seq_gap(hdr);
> @@ -1600,6 +1704,7 @@ static int tipc_link_proto_rcv(struct tipc_link *l, 
> struct sk_buff *skb,
>       u16 dlen = msg_data_sz(hdr);
>       int mtyp = msg_type(hdr);
>       bool reply = msg_probe(hdr);
> +     u16 glen = 0;
>       void *data;
>       char *if_name;
>       int rc = 0;
> @@ -1697,7 +1802,17 @@ static int tipc_link_proto_rcv(struct tipc_link *l, 
> struct sk_buff *skb,
>                               rc = TIPC_LINK_UP_EVT;
>                       break;
>               }
> -             tipc_mon_rcv(l->net, data, dlen, l->addr,
> +
> +             /* Receive Gap ACK blocks from peer if any */
> +             if (l->peer_caps & TIPC_GAP_ACK_BLOCK) {
> +                     ga = (struct tipc_gap_ack_blks *)data;
> +                     glen = ntohs(ga->len);
> +                     /* sanity check: if failed, ignore Gap ACK blocks */
> +                     if (glen != tipc_gap_ack_blks_sz(ga->gack_cnt))
> +                             ga = NULL;
> +             }
> +
> +             tipc_mon_rcv(l->net, data + glen, dlen - glen, l->addr,
>                            &l->mon_state, l->bearer_id);
>  
>               /* Send NACK if peer has sent pkts we haven't received yet */
> @@ -1706,13 +1821,12 @@ static int tipc_link_proto_rcv(struct tipc_link *l, 
> struct sk_buff *skb,
>               if (rcvgap || reply)
>                       tipc_link_build_proto_msg(l, STATE_MSG, 0, reply,
>                                                 rcvgap, 0, 0, xmitq);
> -             tipc_link_release_pkts(l, ack);
> +
> +             tipc_link_advance_transmq(l, ack, gap, ga, xmitq);
>  
>               /* If NACK, retransmit will now start at right position */
> -             if (gap) {
> -                     rc = tipc_link_retrans(l, l, ack + 1, ack + gap, xmitq);
> +             if (gap)
>                       l->stats.recv_nacks++;
> -             }
>  
>               tipc_link_advance_backlog(l, xmitq);
>               if (unlikely(!skb_queue_empty(&l->wakeupq)))
> diff --git a/net/tipc/msg.h b/net/tipc/msg.h
> index 528ba9241acc..dcb73421c19a 100644
> --- a/net/tipc/msg.h
> +++ b/net/tipc/msg.h
> @@ -117,6 +117,36 @@ struct tipc_msg {
>       __be32 hdr[15];
>  };
>  
> +/* struct tipc_gap_ack - TIPC Gap ACK block
> + * @ack: seqno of the last consecutive packet in link deferdq
> + * @gap: number of gap packets since the last ack
> + *
> + * E.g:
> + *       link deferdq: 1 2 3 4      10 11      13 14 15       20
> + * --> Gap ACK blocks:      <4, 5>,   <11, 1>,      <15, 4>, <20, 0>
> + */
> +struct tipc_gap_ack {
> +     __be16 ack;
> +     __be16 gap;
> +};
> +
> +/* struct tipc_gap_ack_blks
> + * @len: actual length of the record
> + * @gack_cnt: number of Gap ACK blocks in the record
> + * @gacks: array of Gap ACK blocks
> + */
> +struct tipc_gap_ack_blks {
> +     __be16 len;
> +     u8 gack_cnt;
> +     struct tipc_gap_ack gacks[];
> +};
> +
> +#define tipc_gap_ack_blks_sz(n) (sizeof(struct tipc_gap_ack_blks) + \
> +                              sizeof(struct tipc_gap_ack) * (n))
> +
> +#define MAX_GAP_ACK_BLKS     32
> +#define MAX_GAP_ACK_BLKS_SZ  tipc_gap_ack_blks_sz(MAX_GAP_ACK_BLKS)
> +
>  static inline struct tipc_msg *buf_msg(struct sk_buff *skb)
>  {
>       return (struct tipc_msg *)skb->data;
> diff --git a/net/tipc/node.h b/net/tipc/node.h
> index 2404225c5d58..c0bf49ea3de4 100644
> --- a/net/tipc/node.h
> +++ b/net/tipc/node.h
> @@ -52,7 +52,8 @@ enum {
>       TIPC_BCAST_RCAST      = (1 << 4),
>       TIPC_NODE_ID128       = (1 << 5),
>       TIPC_LINK_PROTO_SEQNO = (1 << 6),
> -     TIPC_MCAST_RBCTL      = (1 << 7)
> +     TIPC_MCAST_RBCTL      = (1 << 7),
> +     TIPC_GAP_ACK_BLOCK    = (1 << 8)
>  };
>  
>  #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT           |  \
> @@ -62,7 +63,8 @@ enum {
>                               TIPC_BLOCK_FLOWCTL     |   \
>                               TIPC_NODE_ID128        |   \
>                               TIPC_LINK_PROTO_SEQNO  |   \
> -                             TIPC_MCAST_RBCTL)
> +                             TIPC_MCAST_RBCTL       |   \
> +                             TIPC_GAP_ACK_BLOCK)
>  #define INVALID_BEARER_ID -1
>  
>  void tipc_node_stop(struct net *net);
> 


_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to