Hi Jon, Agree, the patch doesn't show an improvement in throughput. However, I believe this is due to the fact our test tool e.g. the benchmark just runs tests of the same messages size in each cases. In practice, when there are different sized messages sent by applications at the same time, the new bundling strategy will take effect.
By the way, since now we limit exactly the number of packets in the backlog at each levels, this obviously reduces the throughput of small messages (as bundles), but I wonder why we need to set the limits quite small? When trying to set a larger value, I have observed a significant improvement in the throughputs, for large messages as well. Our current approach at the link layer doesn't seem very good as the limit is fixed without considering the actual number of users or connections... BR/Tuong -----Original Message----- From: Jon Maloy <jon.ma...@ericsson.com> Sent: Saturday, August 31, 2019 3:57 AM To: Tuong Tong Lien <tuong.t.l...@dektech.com.au>; tipc-discussion@lists.sourceforge.net; ma...@donjonn.com; ying....@windriver.com Subject: RE: [PATCH RFC 1/2] tipc: fix unlimited bundling of small messages Hi Tuong, Both patches are good with me. Unfortunately I could not register any measurable performance improvement, but I still think this is the right thing to do. Acked-by: Jon > -----Original Message----- > From: Tuong Lien <tuong.t.l...@dektech.com.au> > Sent: 29-Aug-19 07:36 > To: tipc-discussion@lists.sourceforge.net; Jon Maloy > <jon.ma...@ericsson.com>; ma...@donjonn.com; ying....@windriver.com > Subject: [PATCH RFC 1/2] tipc: fix unlimited bundling of small messages > > We have identified a problem with the "oversubscription" policy in the link > transmission code. > > When small messages are transmitted, and the sending link has reached the > transmit window limit, those messages will be bundled and put into the link > backlog queue. However, bundles of data messages are counted at the > 'CRITICAL' level, so that the counter for that level, instead of the counter for > the real, bundled message's level is the one being increased. > Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to > be tested against the unchanged counter for their own level, while > contributing to an unrestrained increase at the CRITICAL backlog level. > > This leaves a gap in congestion control algorithm for small messages that can > result in starvation for other users or a "real" CRITICAL user. Even that > eventually can lead to buffer exhaustion & link reset. > > We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when > bundling, we only bundle messages at the same importance level only. This > way, we know exactly how many slots a certain level have occupied in the > queue, so can manage level congestion accurately. > > By bundling messages at the same level, we even have more benefits. Let > consider this: > - One socket sends 64-byte messages at the 'CRITICAL' level; > - Another sends 4096-byte messages at the 'LOW' level; > > When a 64-byte message comes and is bundled the first time, we put the > overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later > use, but the next message can be a 4096-byte one that cannot be bundled to > the previous one. This means the last bundle carries only one payload message > which is totally inefficient, as for the receiver also! Later on, another 64-byte > message comes, now we make a new bundle and the same story repeats... > > With the new bundling algorithm, this will not happen, the 64-byte messages > will be bundled together even when the 4096-byte message(s) comes in > between. However, if the 4096-byte messages are sent at the same level i.e. > 'CRITICAL', the bundling algorithm will again cause the same overhead. > > Also, the same will happen even with only one socket sending small messages > at a rate close to the link transmit's one, so that, when one message is > bundled, it's transmitted shortly. Then, another message comes, a new bundle > is created and so on... > > We will solve this issue radically by the 2nd patch of this series. > > Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link > congestion") > Reported-by: Hoang Le <hoang.h...@dektech.com.au> > Acked-by: Jon Maloy <jon.ma...@ericsson.com> > Signed-off-by: Tuong Lien <tuong.t.l...@dektech.com.au> > --- > net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 5 +---- > 2 files changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..999eab592de8 > 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -160,6 +160,7 @@ struct tipc_link { > struct { > u16 len; > u16 limit; > + struct sk_buff *target_bskb; > } backlog[5]; > u16 snd_nxt; > u16 window; > @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) > void tipc_link_reset(struct tipc_link *l) { > struct sk_buff_head list; > + u32 imp; > > __skb_queue_head_init(&list); > > @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) > __skb_queue_purge(&l->deferdq); > __skb_queue_purge(&l->backlogq); > __skb_queue_purge(&l->failover_deferdq); > - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; > - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; > - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; > - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; > - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; > + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { > + l->backlog[imp].len = 0; > + l->backlog[imp].target_bskb = NULL; > + } > kfree_skb(l->reasm_buf); > kfree_skb(l->reasm_tnlmsg); > kfree_skb(l->failover_reasm_skb); > @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > struct sk_buff_head *transmq = &l->transmq; > struct sk_buff_head *backlogq = &l->backlogq; > - struct sk_buff *skb, *_skb, *bskb; > + struct sk_buff *skb, *_skb, **tskb; > int pkt_cnt = skb_queue_len(list); > int rc = 0; > > @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > seqno++; > continue; > } > - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { > + tskb = &l->backlog[imp].target_bskb; > + if (tipc_msg_bundle(*tskb, hdr, mtu)) { > kfree_skb(__skb_dequeue(list)); > l->stats.sent_bundled++; > continue; > } > - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { > + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { > kfree_skb(__skb_dequeue(list)); > - __skb_queue_tail(backlogq, bskb); > - l->backlog[msg_importance(buf_msg(bskb))].len++; > + __skb_queue_tail(backlogq, *tskb); > + l->backlog[imp].len++; > l->stats.sent_bundled++; > l->stats.sent_bundles++; > continue; > } > + l->backlog[imp].target_bskb = NULL; > l->backlog[imp].len += skb_queue_len(list); > skb_queue_splice_tail_init(list, backlogq); > } > @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > u16 seqno = l->snd_nxt; > u16 ack = l->rcv_nxt - 1; > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > + u32 imp; > > while (skb_queue_len(&l->transmq) < l->window) { > skb = skb_peek(&l->backlogq); > @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > break; > __skb_dequeue(&l->backlogq); > hdr = buf_msg(skb); > - l->backlog[msg_importance(hdr)].len--; > + imp = msg_importance(hdr); > + l->backlog[imp].len--; > + if (unlikely(skb == l->backlog[imp].target_bskb)) > + l->backlog[imp].target_bskb = NULL; > __skb_queue_tail(&l->transmq, skb); > /* next retransmit attempt */ > if (link_is_bc_sndlink(l)) > diff --git a/net/tipc/msg.c b/net/tipc/msg.c index > e6d49cdc61b4..922d262e153f 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -543,10 +543,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, > struct tipc_msg *msg, > bmsg = buf_msg(_skb); > tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, > INT_H_SIZE, dnode); > - if (msg_isdata(msg)) > - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); > - else > - msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); > + msg_set_importance(bmsg, msg_importance(msg)); > msg_set_seqno(bmsg, msg_seqno(msg)); > msg_set_ack(bmsg, msg_ack(msg)); > msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); > -- > 2.13.7 _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion