Re: [tipc-discussion] [net 1/1] tipc: reduce risk of wakeup queue starvation

tung quang nguyen Thu, 18 Jul 2019 22:42:06 -0700

Hi Jon,

See below.


Best regards,
Tung Nguyen

-----Original Message-----
From: Jon Maloy <[email protected]> 
Sent: Friday, July 19, 2019 10:05 AM
To: Tung Quang Nguyen <[email protected]>; 'Jon Maloy'
<[email protected]>; [email protected];
[email protected]
Cc: Mohan Krishna Ghanta Krishnamurthy
<[email protected]>;
[email protected]; Hoang Huu Le
<[email protected]>; Canh Duc Luu <[email protected]>; Tuong
Tong Lien <[email protected]>; Gordan Mihaljevic
<[email protected]>
Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation

Hi Tung,
I did of course do some measurements before sending out the patch yesterday,
but I saw no significant difference in performance between the methods.
The results were within the range of normal, stochastic variations, and
making many runs only confirmed that.
This was also what I was expecting.

I have now made a couple of changes to my patch:
1) I took into account that backlog[imp].len often is larger than
backlog[imp].limit (because of the oversubscription) when calculating the
available number of backlog queue slots, resulting in a negative avail[imp]
value.  This would have the effect that we sometimes keep submitting wakeup
jobs when there are no slots left in the backlog queue. Those jobs will be
launched, run, immediately find that we are at congestion again, and be
suspended a second time. That is, a gross waste of CPU resources. (This is a
weakness of your algorithm, too)
[Tung]: Exactly but the lists of skb are added to the link backlog queues
after this "Those jobs will be launched, run, immediately find that we are
at congestion again". That's the reason I chose to schedule all wakeup
messages right after the condition is met. All sleeping sockets will be
waken up and send their messages to link backlog queues before being put
into sleeping state again. With that, we can reduce the risk of starvation.
Your patch V2 was exactly my first patch except the prepending of wakeup
messages to the queue tail. But I thought more about above-mentioned
scenario and changed to the patch I sent to you to avoid the risk
completely.

2) After adding some statistics counters, I realized that there are
practically never any wakeup messages in the input queue when we run
prepare_wakeup(), so we can just as well skip the slightly expensive step to
count and reschedule those jobs. This may lead to that we at rare occasions
issue too many wakeup jobs, but it is  seems to be worth the cost.

3) I added the wakeup jobs to the tail of the input queue, instead of to the
head, so they will be run after regular input queue messages. I felt
slightly uncomfortable with prepending the wakeup messages, as it might lead
to surprising new behaviors.  This is in practice what you are doing, too,
since you check the wakeup queue and call tipc_sk_wakeup_rcv() after we have
checked the input queue and calles tipc_sk_rcv().   

I also re-ran my measurements with the new version, many times, and the
pattern was always the same.
[Tung]: I verified and confirm it. It is better than the first version which
showed very bad result in all my rounds of execution.
Let's wait and see if this patch could fix the issue.

Below is a sample.

No patch:
-------------

node1 ~ # bmc -t -c100
****** TIPC Benchmark Client Started ******
Transferring 64000 messages in TIPC Throughput Benchmark
+---------------------------------------------------------------------------
------------------+
|  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
|
|  [octets]  | Conns |    Conn   |  [ms]
+------------------------------------------------+
|            |       |           |           | Total [Msg/s] | Total [Mb/s]
| Per Conn [Mb/s] |
+---------------------------------------------------------------------------
------------------+
|        64  |  100  |    64000  |     5635  |      1135714  |         581
|              5  |
+---------------------------------------------------------------------------
------------------+
|       256  |  100  |    32000  |     4221  |       758006  |        1552
|             15  |
+---------------------------------------------------------------------------
------------------+
|      1024  |  100  |    16000  |     9023  |       177316  |        1452
|             14  |
+---------------------------------------------------------------------------
------------------+
|      4096  |  100  |     8000  |    10238  |        78139  |        2560
|             25  |
+---------------------------------------------------------------------------
------------------+
|     16384  |  100  |     4000  |    15651  |        25557  |        3349
|             33  |
+---------------------------------------------------------------------------
------------------+
|     65536  |  100  |     2000  |    29197  |         6849  |        3590
|             35  |
+---------------------------------------------------------------------------
------------------+
Completed Throughput Benchmark
****** TIPC Benchmark Client Finished ******
node1 ~ #

Tung's patch:
-----------------
node1 ~ # bmc -t -c100
****** TIPC Benchmark Client Started ******
Transferring 64000 messages in TIPC Throughput Benchmark
+---------------------------------------------------------------------------
------------------+
|  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
|
|  [octets]  | Conns |    Conn   |  [ms]
+------------------------------------------------+
|            |       |           |           | Total [Msg/s] | Total [Mb/s]
| Per Conn [Mb/s] |
+---------------------------------------------------------------------------
------------------+
|        64  |  100  |    64000  |     5906  |      1083527  |         554
|              5  |
+---------------------------------------------------------------------------
------------------+
|       256  |  100  |    32000  |     3510  |       911531  |        1866
|             18  |
+---------------------------------------------------------------------------
------------------+
|      1024  |  100  |    16000  |     9594  |       166755  |        1366
|             13  |
+---------------------------------------------------------------------------
------------------+
|      4096  |  100  |     8000  |     9738  |        82144  |        2691
|             26  |
+---------------------------------------------------------------------------
------------------+
|     16384  |  100  |     4000  |    15760  |        25379  |        3326
|             33  |
+---------------------------------------------------------------------------
------------------+
|     65536  |  100  |     2000  |    30615  |         6532  |        3424
|             34  |
+---------------------------------------------------------------------------
------------------+
Completed Throughput Benchmark
****** TIPC Benchmark Client Finished ******
node1 ~ #

Jon's patch, v2:
-----------------
node1 ~ # bmc -t -c100
+---------------------------------------------------------------------------
------------------+
|  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
|
|  [octets]  | Conns |    Conn   |  [ms]
+------------------------------------------------+
|            |       |           |           | Total [Msg/s] | Total [Mb/s]
| Per Conn [Mb/s] |
+---------------------------------------------------------------------------
------------------+
|        64  |  100  |    64000  |     5465  |      1171064  |         599
|              5  |
+---------------------------------------------------------------------------
------------------+
|       256  |  100  |    32000  |     3270  |       978490  |        2003
|             20  |
+---------------------------------------------------------------------------
------------------+
|      1024  |  100  |    16000  |     6987  |       228964  |        1875
|             18  |
+---------------------------------------------------------------------------
------------------+
|      4096  |  100  |     8000  |     9562  |        83657  |        2741
|             27  |
+---------------------------------------------------------------------------
------------------+
|     16384  |  100  |     4000  |    15603  |        25635  |        3360
|             33  |
+---------------------------------------------------------------------------
------------------+
|     65536  |  100  |     2000  |    28385  |         7045  |        3693
|             36  |
+---------------------------------------------------------------------------
------------------+
Completed Throughput Benchmark
****** TIPC Benchmark Client Finished ******
node1 ~ #

BR
///jon


> -----Original Message-----
> From: tung quang nguyen <[email protected]>
> Sent: 18-Jul-19 06:03
> To: Jon Maloy <[email protected]>; 'Jon Maloy'
> <[email protected]>; [email protected];
> [email protected]
> Cc: Mohan Krishna Ghanta Krishnamurthy
> <[email protected]>;
> [email protected]; Hoang Huu Le
> <[email protected]>; Canh Duc Luu
> <[email protected]>; Tuong Tong Lien
> <[email protected]>; Gordan Mihaljevic
> <[email protected]>
> Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> 
> Hi Jon,
> 
> You patch is too complex. There will be a lot of overheads when
> grabbing/releasing locks (3 times) and using 2 loops.
> As a result, performance of benchmark test is reduced significantly while
my
> patch shows the same performance with the original code.
> 
> This is the comparison between my patch and yours after running benchmark
> using 100 connections (Message size in bytes: x% better):
> - 64: 23.5%
> - 256: 30.2%
> - 1024: 19.5%
> - 4096: 14.9%
> - 16384: 6.7%
> - 65536: 2.4%
> 
> So, I do not think your patch would solve the issue.
> 
> Thanks.
> 
> Best regards,
> Tung Nguyen
> 
> -----Original Message-----
> From: Jon Maloy <[email protected]>
> Sent: Thursday, July 18, 2019 4:22 AM
> To: Jon Maloy <[email protected]>
> Cc: Mohan Krishna Ghanta Krishnamurthy
> <[email protected]>;
> [email protected]; Tung Quang Nguyen
> <[email protected]>; Hoang Huu Le
> <[email protected]>; Canh Duc Luu
> <[email protected]>; Tuong Tong Lien
> <[email protected]>; Gordan Mihaljevic
> <[email protected]>; [email protected]; tipc-
> [email protected]
> Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> 
> Hi Tung,
> After thinking more about this, I realized that the problem is a little
more
> complex than I first imagined.
> We must also take into account that there may still be old wakeup messages
in
> the input queue when we are trying to add new ones. Those may be scattered
> around in the input queue because  new data messages have arrived before
> they were scheduled.
> So, we must make sure that those are still placed at the head of the input
> queue, before any new wakeup messages, which should be before any data
> messages.
> Those messages should also be counted when we calculate the available
space
> in the backlog queue, so that there is never more pending wakeup jobs than
> there are available slots in in that queue.
> So, I ended up with writing my own patch for this, I hope you don't mind.
> I tested this as far as I could, but I assume you have a more relevant
test
> program than me for this.
> 
> BR
> ///jon
> 
> 
> > -----Original Message-----
> > From: Jon Maloy
> > Sent: 17-Jul-19 16:56
> > To: Jon Maloy <[email protected]>; Jon Maloy
> <[email protected]>
> > Cc: Mohan Krishna Ghanta Krishnamurthy
> > <[email protected]>;
> > [email protected]; Tung Quang Nguyen
> > <[email protected]>; Hoang Huu Le
> > <[email protected]>; Canh Duc Luu
> <[email protected]>;
> > Tuong Tong Lien <[email protected]>; Gordan Mihaljevic
> > <[email protected]>; [email protected]; tipc-
> > [email protected]
> > Subject: [net 1/1] tipc: reduce risk of wakeup queue starvation
> >
> > In commit 365ad353c256 ("tipc: reduce risk of user starvation during
> > link
> > congestion") we allowed senders to add exactly one list of extra
> > buffers
> to the
> > link backlog queues during link congestion (aka "oversubscription").
> However,
> > the criteria for when to stop adding wakeup messages to the input
> > queue when the overload abates is inaccurate, and may cause starvation
> > problems during very high load.
> >
> > Currently, we stop adding wakeup messages after 10 total failed
> > attempts where we find that there is no space left in the backlog
> > queue for a
> certain
> > importance level. The counter for this is accumulated across all
> > levels,
> which
> > may lead the algorithm to leave the loop prematurely, although there
> > may
> still
> > be plenty of space available at some levels.
> > The result is sometimes that messages near the wakeup queue tail are
> > not added to the input queue as they should be.
> >
> > We now introduce a more exact algorithm, where we don't stop adding
> > messages until all backlog queue levels have have saturated or there
> > are
> no
> > more wakeup messages to dequeue.
> >
> > Fixes: 365ad35 ("tipc: reduce risk of user starvation during link
> congestion")
> > Reported-by: Tung Nguyen <[email protected]>
> > Signed-off-by: Jon Maloy <[email protected]>
> > ---
> >  net/tipc/link.c | 43 +++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 35 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/tipc/link.c b/net/tipc/link.c index 66d3a07..1f41523
> 100644
> > --- a/net/tipc/link.c
> > +++ b/net/tipc/link.c
> > @@ -853,18 +853,45 @@ static int link_schedule_user(struct tipc_link
> > *l, struct tipc_msg *hdr)
> >   */
> >  static void link_prepare_wakeup(struct tipc_link *l)  {
> > +   struct sk_buff_head *wakeupq = &l->wakeupq;
> > +   struct sk_buff_head *inputq = l->inputq;
> >     struct sk_buff *skb, *tmp;
> > -   int imp, i = 0;
> > +   struct sk_buff_head tmpq;
> > +   int avail[5] = {0,};
> > +   int imp = 0;
> > +
> > +   __skb_queue_head_init(&tmpq);
> > +
> > +   for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++)
> > +           avail[imp] = l->backlog[imp].limit - l->backlog[imp].len;
> >
> > -   skb_queue_walk_safe(&l->wakeupq, skb, tmp) {
> > +   /* Already pending wakeup messages in inputq must come first */
> > +   spin_lock_bh(&inputq->lock);
> > +   skb_queue_walk_safe(inputq, skb, tmp) {
> > +           if (msg_user(buf_msg(skb)) != SOCK_WAKEUP)
> > +                   continue;
> > +           __skb_unlink(skb, inputq);
> > +           __skb_queue_tail(&tmpq, skb);
> >             imp = TIPC_SKB_CB(skb)->chain_imp;
> > -           if (l->backlog[imp].len < l->backlog[imp].limit) {
> > -                   skb_unlink(skb, &l->wakeupq);
> > -                   skb_queue_tail(l->inputq, skb);
> > -           } else if (i++ > 10) {
> > -                   break;
> > -           }
> > +           if (avail[imp])
> > +                   avail[imp]--;
> >     }
> > +   spin_unlock_bh(&inputq->lock);
> > +
> > +   spin_lock_bh(&wakeupq->lock);
> > +   skb_queue_walk_safe(wakeupq, skb, tmp) {
> > +           imp = TIPC_SKB_CB(skb)->chain_imp;
> > +           if (!avail[imp])
> > +                   continue;
> > +           avail[imp]--;
> > +           __skb_unlink(skb, wakeupq);
> > +           __skb_queue_tail(&tmpq, skb);
> > +   }
> > +   spin_unlock_bh(&wakeupq->lock);
> > +
> > +   spin_lock_bh(&inputq->lock);
> > +   skb_queue_splice(&tmpq, inputq);
> > +   spin_unlock_bh(&inputq->lock);
> >  }
> >
> >  void tipc_link_reset(struct tipc_link *l)
> > --
> > 2.1.4
> 




_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [net 1/1] tipc: reduce risk of wakeup queue starvation

Reply via email to