Hi Peter, I think the most sensible you could do here is to raise the priority of the sensitive sender. This unfortunately requires a change to the application code (a new setsockopt), but it might be worth a try. Which level is your sending socket using now?
///jon > -----Original Message----- > From: Peter Koss <[email protected]> > Sent: September 20, 2018 12:46 PM > To: Jon Maloy <[email protected]>; tipc- > [email protected] > Subject: RE: What affects congestion beyond window size, and what might > have reduced congestion thresholds in TIPC 2.0.x? > > Hi Jon, > > To describe the problem, this is a performance matching issue to us, > between (2.0.5 TIPC and Wind River Linux 6) and (1.7.7 TIPC and Wind River > Linux 3). Our performance is much less under the former, something like half > as much as the latter. > > We see the sender getting EAGAIN using 150 window size (and 50 and 300), > that's probably the core issue to us, which is happening at a much less busy > processing rate than the older environment. There was no link reset at 150 > noted, but the EAGAIN is happening at an unexpectedly low rate. We use > non-blocking socket flags for this data, on both old and new versions. Our > program is sensitive to small delays, hence the non-blocking choice. > > I think we are most interested in any tuning available specific to 2.0.5 , or > anything changed in TIPC to perhaps slow down these things. > > Regards > > Peter > > > > > -----Original Message----- > From: Jon Maloy <[email protected]> > Sent: Thursday, September 20, 2018 11:11 AM > To: Peter Koss <[email protected]>; tipc- > [email protected] > Subject: RE: What affects congestion beyond window size, and what might > have reduced congestion thresholds in TIPC 2.0.x? > > Hi Peter, > See below. > > > -----Original Message----- > > From: Peter Koss <[email protected]> > > Sent: September 20, 2018 11:31 AM > > To: Jon Maloy <[email protected]>; tipc- > > [email protected] > > Subject: RE: What affects congestion beyond window size, and what > > might have reduced congestion thresholds in TIPC 2.0.x? > > > > > > Hi Jon, > > > > Again, thanks for thinking about this. Kernel version : > > Linux a33ems1 3.10.62-ltsi-WR6.0.0.36_standard #1 SMP PREEMPT Mon > > Ok. A pretty old kernel then. > > > Aug 20 17:25:51 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > The retransmission error (i.e., kind of catastrophic interruption) > > only was noted upon experiments pushing the window size to 300-400, it > > was not noted at lower window sizes. > > > > Our thoughts are around how to see evidence of retransmission going on > > prior to that, or ways to see evidence of that in TIPC 2.0.5. > > You use tipc-config to list the link statistics. There you will see the > amount of > retransmissions, just as you can see the amount of congestions. > > > compare to TIPC 1.7.7 under Wind River Linux 3 as a data point, but > > care mostly about addressing it under 2.0.5 and Wind River Linux 6. > > > > This is not running a VM. Not sure about your question on Remote > > Procedure Call > > Sorry, I did of course mean RPS (Receive Packet Steering). But in such an old > kernel TIPC is not supporting that, so that cannot be the problem. > > > activated, if there's a command to run or code construct to check I > > could get that. > > I still don't understand what it is you consider a problem. If you use a > window > size of 150, you don't have the link reset, you say. > So, what is it that is not working? > > BR > ///jon > > > > > Regards > > > > PK > > -----Original Message----- > > From: Jon Maloy <[email protected]> > > Sent: Thursday, September 20, 2018 7:28 AM > > To: Peter Koss <[email protected]>; tipc- > > [email protected] > > Subject: RE: What affects congestion beyond window size, and what > > might have reduced congestion thresholds in TIPC 2.0.x? > > > > Hi Peter, > > See my comments below. > > > > > -----Original Message----- > > > From: Peter Koss <[email protected]> > > > Sent: September 19, 2018 6:11 PM > > > To: Jon Maloy <[email protected]>; tipc- > > > [email protected] > > > Subject: RE: What affects congestion beyond window size, and what > > > might have reduced congestion thresholds in TIPC 2.0.x? > > > > > > Thanks for responding. > > > > > > There was code in TIPC 1.7x that gave some node receive queue > > > inforHimation, but that is now obsolete in 2.0.x. We are using > > > socket receive calls to get data instead, which seems to suggest one > > > of two > > > problems: either the receive side queue is filling up and exceeding > > > limits, or the ack back to the sender is having trouble. We do see > > > the > > sender getting an errno=EAGAIN. > > > Overall the performance levels we see are much less than with TIPC > > > 1.7.x under Wind River Linux 3 than with TIPC 2.0.x under Wind River > > > Linux > > 6. > > > > Which Linux kernel version are you running? > > > > > > > > case TIPC_NODE_RECVQ_DEPTH: > > > + value = (u32)atomic_read(&tipc_queue_size); <== This > > > is > > obsolete > > > now, call occurs, we get 0. > > > + break; > > > + case TIPC_SOCK_RECVQ_DEPTH: > > > + value = skb_queue_len(&sk->sk_receive_queue); > > > + break; > > > > > > > > > Questions we have currently: > > > - What is the socket receive queue limit (default)? > > > > That depends on the Linux version you are using. Prior to 4.6 it was > > 64 Mb, in later versions it is 2 Mb, but with a much better flow control > algorithm. > > > > > - Is it wise to try a window size > 150? > > > > I have never done it myself except for experimental purposes, but I > > see no problem with it. > > Do you have any particular reason to do so? Does it give significant > > better throughput than at 150 ? > > > > > - Is there a good way to control or influence the flow control > > > sender/receiver coordination, > > You can improve the window size to potentially improve link level > > throughput, and you can increase sending socket importance priority to > > reduce the risk of receive socket buffer overflow. > > > > > or a best way to adjust receive buffer limit? > > If you want to change this you follow the instruction under 5.2 at the > > following link: > > http://tipc.sourceforge.net/programming.html#incr_rcvbuf > > But I see no sign that buffer overflow is your problem. > > > > > > > > For context, the first sign of errors shows up as congestion, where > > > the max value will increase to slightly above whatever window size > > > we set (50,150,300,400). > > > > > > pl0_1:~$ /usr/sbin/tipc-config -ls | grep "Send queue max" > > > Congestion link:0 Send queue max:2 avg:1 > > > Congestion link:93121 Send queue max:162 avg:3 > > > Congestion link:206724 Send queue max:164 avg:3 > > > Congestion link:67839 Send queue max:167 avg:3 > > > Congestion link:214788 Send queue max:166 avg:3 > > > Congestion link:205240 Send queue max:165 avg:3 > > > Congestion link:240955 Send queue max:166 avg:3 > > > Congestion link:0 Send queue max:0 avg:0 > > > Congestion link:0 Send queue max:1 avg:0 > > > Congestion link:0 Send queue max:0 avg:0 > > > > This is all normal and unproblematic. We allow an oversubscription of > > one message (max 46 1500 byte packets) on each link to make the > > algorithm simpler. So you will often find the max value higher than > > the nominal upper limit. > > > > > > > > The next following error occur only when the window size is high, > > > 300-400, not seen at > > > 50 or 150, so we think this may be extraneous to our issue. It also > > > makes > > us > > > wonder > > > whether going above 150 is wise, hence the question above. > > > > > > Sep 17 05:42:00 pl0_4 kernel: tipc: Retransmission failure on link > > > <1.1.5:bond1-1.1.2:bond1> Sep 17 05:42:00 pl0_4 kernel: tipc: > > > Resetting link > > > > This is your real problem. For some reason a packet has been > > retransmitted > > >100 times on a link without going through. Then the link is reset, > > >and all > > associated connections as well. > > We have seen this happen for various reasons over the years, and fixed > > them all. > > Is possibly RPC activated on your receiving node? > > Are you running a VM with a virtio interface? This one tends to be > > overwhelmed sometimes and just stops sending for 30 seconds, something > > leading to broken links. > > > > But again, it all depends on which kernel and environment you are running. > > Please update me on this. > > > > BR > > ///jon > > > > > Sep 17 05:42:00 pl0_4 kernel: Link 1001002<eth:bond1>::WW Sep 17 > > > 05:42:00 > > > pl0_4 kernel: tipc: Lost link <1.1.5:bond1-1.1.2:bond1> on network > > > plane A Sep 17 05:42:00 pl0_4 kernel: tipc: Lost contact with > > > <1.1.2> Sep 17 05:42:00 > > > pl0_10 kernel: tipc: Resetting link <1.1.2:bond1-1.1.5:bond1>, > > > requested by peer Sep 17 05:42:00 pl0_10 kernel: tipc: Lost link > > > <1.1.2:bond1-1.1.5:bond1> on network plane A > > > > > > Thanks in advance, advice is appreciated. > > > > > > PK > > > > > > -----Original Message----- > > > From: Jon Maloy <[email protected]> > > > Sent: Tuesday, September 18, 2018 12:15 PM > > > To: Peter Koss <[email protected]>; tipc- > > > [email protected] > > > Subject: RE: What affects congestion beyond window size, and what > > > might have reduced congestion thresholds in TIPC 2.0.x? > > > > > > Hi Peter, > > > The only parameter of those mentioned below that would have any > > > effect on congestion is TIPC_MAX_LINK_WIN, which should reduce > > > occurrences > > of > > > link level congestion. > > > However, you don't describe which symptoms you see caused by this > > > congestion. > > > - Is it only a higher 'congested' counter when you look at the link > > > statistics? If so, you don't have a problem at all, this is a > > > totally normal and frequent occurrence. (Maybe we should have given > > > this field a different name to avert > > > confusion.) > > > - If this causes a severely reduced throughput you may have a > > > problem, but I don't find that very likely. > > > - If you are losing messages at the socket level (dropped because of > > > receive buffer overflow) you *do* have a problem, but this can most > > > often be remedied by extending the socket receive buffer limit. > > > > > > BR > > > ///Jon Maloy > > > > > > -----Original Message----- > > > From: Peter Koss <[email protected]> > > > Sent: September 18, 2018 12:33 PM > > > To: [email protected] > > > Subject: [tipc-discussion] What affects congestion beyond window > > > size, and what might have reduced congestion thresholds in TIPC 2.0.x? > > > > > > > > > In TIPC 1.7.6, we battled with congestion quite a bit. We ultimately > settled > > > on adjusting these parameters in TIPC, which we also used in TIPC > > > 1.7.7. This was running on Wind River Linux 3, where TIPC was an > > > independent module from the kernel. > > > > > > SOL_TIPC changed from 271 to 50. > > > (probably not > > > affecting congestion) > > > TIPC_MAX_LINK_WIN changed from 50 to 150 > > > TIPC_NODE_RECVQ_DEPTH set to 131 > > > > > > Using Wind River Linux 6, we get TIPC 2.0.5 as part of the kernel, > > > and we see congestion at occurring at much lower overall load levels > > > (less traffic > > overall), > > > compared to TIPC 1.7.7 & WR3. We've made the same changes as above > > via > > > a loadable module for TIPC 2.0.5, and also noted that > > > TIPC_NODE_RECVQ_DEPTH is now obsoleted. Upon observing > > congestion, > > > we have changed the default window size, and max window size, up to > > > 300 and even 400. This helps congestion a little bit, but not > > > sufficiently. > > > > > > > > > Does anyone know: > > > -What has changed in TIPC 2.0.x that affects this? > > > -Are there other parameters to change, to assist this? > > > -Is there a replacement set of parameters that affect what > > > TIPC_NODE_RECVQ_DEPTH influences? > > > > > > > > > > > > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION > > intended > > > solely for the use of the addressee(s). If you are not the intended > > > recipient, please notify so to the sender by e-mail and delete the > > > original message. In such cases, please notify us immediately at > > > [email protected] <mailto:[email protected]> . Further, you are not > > > to copy, disclose, or distribute this e-mail or its contents to any > > > unauthorized person(s). Any such actions are considered unlawful. > > > This e-mail may contain viruses. Infinite has taken every reasonable > > > precaution to minimize this risk, but is not liable for any damage > > > you may sustain as a result of any virus in this e-mail. You should > > > carry out your > > own virus checks before opening the e-mail or attachments. > > > Infinite reserves the right to monitor and review the content of all > > > messages sent to or from this e-mail address. Messages sent to or > > > from this e-mail address may be stored on the Infinite e-mail system. > > > > > > > > > > > > ***INFINITE******** End of Disclaimer********INFINITE******** > > > > > > _______________________________________________ > > > tipc-discussion mailing list > > > [email protected] > > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
