Re: [tipc-discussion] RE : Re: Link related question/issue

Jon Maloy Fri, 07 Mar 2008 14:24:02 -0800

Better suggestion:

if  (gap > 255)
       gap = 255;


Otherwise 257 would give a gap of 1, 258  a gap of 2 etc.  Clearly not
optimal.

///jon


Jon Maloy wrote:
> Hi Peter,
> See my previous mail.
> It should be pretty easy for you to verify my hypothesis. Just add a
>
> if  ((gap %256) == 0)
>        gap = 255;
>
> into the function msg_set_seq_gap() in tipc_link.h.  This should
> be a functional work-around for the problem.
>
> ///jon
>
>
> Jon Maloy wrote:
>   
>> See below.
>> I think we are close now.
>>  If any of you want the dump, you will have to ask Peter directly,
>> since he was reluctant to send it out to tipc-discussion.
>>
>> Regards
>> ///jon
>>
>> Horvath, Elmer wrote:
>>   
>>     
>>> Hi,
>>>
>>> This is very interesting.  From the description by Jon (I did not see the 
>>> Wireshark trace posted), the gap value is being calculated as 0 when it 
>>> should be calculated as a non-zero value.
>>>
>>> We actually encountered a similar, but different, issue internally and 
>>> believed it to be a compiler problem (we were not compiling with GCC).  The 
>>> target was an E500 (8560 based PPC system) and was compiled with software 
>>> floating point (though no floating point code is in TIPC that I know of).
>>>
>>> In our case, the calculation was effectively subtracting 1 from 1 and 
>>> getting a 1.  The node would then send a NAK falsely asking for 
>>> retransmissions on packets it did in fact receive.
>>>
>>> The incorrect calculation for us was in tipc_link.c in the routine 
>>> link_recv_proto_msg() calculating the value of the variable 'rec_gap'.  The 
>>> code is:
>>>     if (less_eq(mod(l_ptr->next_in_no), msg_next_sent(msg))) {
>>>             rec_gap = mod(msg_next_sent(msg) - 
>>>                           mod(l_ptr->next_in_no));
>>>     }
>>>   
>>>     
>>>       
>> I think this is calculated correctly in our case, but the rec_gap passed 
>> into tipc_send_proto_msg() gets overwritten by
>> that routine. This is normally correct, since the gap should be adjusted 
>> according to what is present
>> in the deferred-queue, in order to avoid retransmitting more packets 
>> than necessary.
>>
>> The code I was referring to is the following, where 'gap' initially is 
>> set to the 'rec_gap' calculated above.
>>
>> if (l_ptr->oldest_deferred_in) {
>>    u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
>>    gap = mod(rec - mod(l_ptr->next_in_no));
>> }
>>
>> msg_set_seq_gap(msg, gap);
>> .....
>>
>> When the protocol gets stuck,  'rec_gap' should be found to be (54992 - 
>> 53968) =  1024
>> Since the result is non-zero, tipc_link_send_proto_msg() is called.
>> Inside that routine three things can happen:
>> 1) l_ptr->oldest_deferred_in is NULL. This means that gap' will retain 
>> its value of 1024.
>>     This leads us into case 3) below.
>> 2) The calculation of 'gap' over-writes the original value. If this 
>> value always is zero,
>>      the protocol will bail out. Can this happen?
>> 3) msg_set_gap() always writes a zero into the message.
>>     Actually, this is fully possible. The field for 'gap' is only 8 bits 
>> long, so any gap size
>>     which is a multiple of 256 will give a zero. Looking at the dump, 
>> this looks
>>     very possible: the first packet loss is not 95 packets, as I stated 
>> in my first mail, but
>>     54483 - 53967= 525 packets. This is counting only from what we see 
>> in Wireshark,
>>    which we have reason to suspect doesn't show all packets. So the real 
>> value might
>>    quite well be 512.  And if this happens, we are stuck forever, 
>> because the head
>>    of the deferred-queue will never move.
>>
>> My question to Peter is: How often does this happen? Every time?  Often?
>> If it happens often, can it be that the Ethernet driver has the habit of 
>> throwing away
>> blocks of packets which are exactly a multiple of 256 or 512. (These 
>> computer
>> programmers...)
>>
>> Anyway, we have clearly found a potential problem which must be 
>> resolved.  With
>> window sizes  > 255, scenario 3) is bound to happen now and then. If 
>> this is Peter's
>> problem, remains to be seen.
>>  
>>
>>   
>>     
>>> This resulted in rec_gap being non-zero even though both operands were the 
>>> same value.  When rec_gap is non-zero, then tipc_link_send_proto_msg() is 
>>> called with a non-zero gap value a bit further down in the same routine.
>>>
>>> Adding instrumentation sometimes fixed the problem; doing the exact same 
>>> calculation again immediately following this code would yield the correct 
>>> gap value.  Very bizarre.
>>>
>>> We attributed this to a compiler issue.
>>>
>>> I don't know if this is the same issue, but it surely sounds similar enough 
>>> to be noted.  And this may be another place to check since it calls 
>>> tipc_link_send_proto_msg() after receiving a state message.
>>>
>>> Elmer
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Maloy
>>> Sent: Tuesday, March 04, 2008 7:51 PM
>>> To: Xpl++; [EMAIL PROTECTED]; [email protected]
>>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>>
>>> Hi Peter,
>>>
>>> I see two interesting patterns:
>>>
>>> a: After the packet loss has started at packet
>>>    14191, state messages from 1.1.12 always come
>>>    in pairs, with the same timestamp. 
>>>
>>> b: Also, after the problems have started, all 
>>>    state messages 1.1.6->1.1.12, even when they are
>>>    not probes, are immedieately followed by a state 
>>>    message in the opposite direction. 
>>>    This is a strong indication that the receiver
>>>    (1.1.12) actually detects the gap from the state
>>>    message contents, and sends out a new state 
>>>    message (a NACK), but for some reason the gap 
>>>    value never makes it into that message. 
>>>    Hence, tipc_link_send_proto_msg(),
>>>    where the gap is calculated and added 
>>>    (line 2135 in tipc_link.c, tipc-1.7.5), seems 
>>>    to be a good place to start 
>>>    looking.
>>>    I strongly suspect that the gap calculated at
>>>    lines 2128-2129 always yields 0, or that
>>>    no packets ever make it into the deferred queue
>>>    (via tipc_link_defer_pkt()).
>>>    That would be consistent with what we see.
>>>
>>> Regards
>>> ///jon
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xpl++
>>> Sent: March 4, 2008 2:17 PM
>>> To: [EMAIL PROTECTED]; '[email protected]'
>>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>>
>>> Hi,
>>>
>>> So .. what about that TODO comment in tipc_link.c regarding the stronger 
>>> seq# checking and stuff? :) Since I managed to stabilize my cluster I must 
>>> proceed with a software upgrade (deadlines :( ...) and will be able to 
>>> start looking into the link code sometime tomorrow evening. In the mean 
>>> time any ideas as to where/what to look at would be highly appreciated ;)
>>>
>>> Regards,
>>> Peter.
>>>
>>> Jon Paul Maloy ??????:
>>>   
>>>     
>>>       
>>>> Hi,
>>>> Your analysis makes sense, but it still doesn't explain why TIPC 
>>>> cannot handle this quite commonplace situation.
>>>> Yesterday, I forgot one essential detail: Even State messages contain 
>>>> info to help the receiver detect a gap. The "next_sent" sequence 
>>>> number tells the receiver if it is out of synch with the sender, and 
>>>> gives it a chance to send a NACK (a State with gap != 0). Since 
>>>> State-packets clearly are received, otherwise the link would go down, 
>>>> there must be some bug in tipc that causes the gap to be calculated 
>>>> wrong, or not at all. Neither does it look like the receiver is 
>>>> sending a State _immediately_ after a gap has occurred, which it 
>>>> should.
>>>> So, I think we are looking for some serious bug within tipc that 
>>>> completely cripples the retransmission protocol. We should try to 
>>>> backtrack and find out in which version it has been introduced.
>>>>
>>>> ///jon
>>>>
>>>>
>>>> --- Xpl++ <[EMAIL PROTECTED]> a écrit :
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Hi,
>>>>>
>>>>> Some more info about my systems:
>>>>> - all nodes that tend to drop packets are quite loaded, thou very 
>>>>> rarely one can see cpu #0 being 100% busy
>>>>> - there are also few multithreaded tasks that are bound to cpu#0 and 
>>>>> running in SCHED_RR. All of them use tipc. None of them uses the 
>>>>> maximum scheduler priority and they use very little cpu time and do 
>>>>> not tend to make any peaks
>>>>> - there is one taks that runs in SCHED_RR at maximum priority 99/RT 
>>>>> (it really does a very very important job), which uses around 1ms of 
>>>>> cpu, every 4 seconds, and it is explicitly bound to cpu#0
>>>>> - all other tasks (mostly apache & php/perl) are free to run on any 
>>>>> cpu
>>>>> - all of these nodes also have considerable io load.
>>>>> - kernel has irq balancing and prety much all irq are balanced, 
>>>>> except for nic irqs. They are always services by cpu #0
>>>>> - to create the packet drop issue I have to mildly stress the node, 
>>>>> which would normaly mean a moment when apache would try to start some 
>>>>> extra childred, that would also cause the number of simultaneously 
>>>>> running php script to also rise, while at the same time the incoming 
>>>>> network traffic is also rising. The stress is preceeded by a few 
>>>>> seconds of high input packet rate which may be causing evene more 
>>>>> stress on the scheduler and cpu starvation
>>>>> - wireshark is dropping packets (surprising many, as it seems), tipc 
>>>>> is confused .. and all is related to moments of general cpu 
>>>>> starvation and an even worse one at cpu#0
>>>>>
>>>>> Then it all started adding up ..
>>>>> I moved all non SCHED_OTHER tasks to other cpus, as well as few other 
>>>>> services. The result - 30% of the nodes showed between 5 and 200 
>>>>> packets dropped for the whole stress routine, which had not affected 
>>>>> TIPC operation, nametables were in sync, all communications seem to 
>>>>> work properly.
>>>>> Thou this solves my problems, it is still very unclear what may have 
>>>>> been happening in the kernel and in the tipc stack that is causing 
>>>>> this bizzare behavior.
>>>>> SMP systems alone are tricky, and when adding load and 
>>>>> pseudo-realtime tasks situation seems to become really complicated.
>>>>> One really cool thing to note is that Opteron based nodes handle hi 
>>>>> load and cpu starvation much better than Xeon ones ..
>>>>> which only confirms an
>>>>> old observation of mine, that for some reason (that must be the
>>>>> design/architecture?) Opterons appear _much_ more 
>>>>> interactive/responsive than Xeons under heavy load ..
>>>>> Another note, this on TIPC - link window for 100mbit nets should be 
>>>>> at least 256 if one wants to do any serious communication between a 
>>>>> dozen or more nodes. Also for a gbit net link windows above 1024 seem 
>>>>> to really confuse the stack when face with high output packet rate.
>>>>>
>>>>> Regards,
>>>>> Peter Litov.
>>>>>
>>>>>
>>>>> Martin Peylo ??????:
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> I'll try to help with the Wireshark side of this
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> problem.
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> On 3/4/08, Jon Maloy <[EMAIL PROTECTED]>
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> wrote:
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>   
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>>>  Strangely enough, node 1.1.12 continues to ack
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> packets
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  which we don't see in wireshark (is it possible
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> that
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  wireshark can miss packets?). It goes on acking
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> packets
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  up to the one with sequence number 53967, (on of
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> the
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  "invisible" packets, but from there on it is
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> stop.
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>     
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>> I've never encountered Wireshark missing packets
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> so far. While it
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> sounds as it wouldn't be a problem with the TIPC
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> dissector, could you
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> please send me a trace file so I can definitely
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> exclude this cause of
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> defect? I've tried to get it from the link quoted
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> in the mail from Jon
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> but it seems it was already removed.
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>>>  [...]
>>>>>>>     
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>>   
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>>>  As a sum of this, I start to suspect your
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> Ethernet
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  driver. It seems like it sometimes delivers
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> packets
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  to TIPC which it does not deliver to Wireshark,
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> and
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  vice versa. This seems to happen after a period
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> of
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  high traffic, and only with messages beyond a
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> certain
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  size, since the State  messages always go
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> through.
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  Can you see any pattern in the direction the
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> links
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  go stale, with reference to which driver you are  using. (E.g., is 
>>>>>>> there always an e1000 driver
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> involved
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>  on the receiving end in the stale direction?)  Does this happen 
>>>>>>> when you only run one type of
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>> driver?
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>>>     
>>>>>>>         
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>> I've not yet gone that deep into package capture,
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> so I can't say much
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> about that. Peter, could you send a mail to one of
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> the Wireshark
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> mailing lists describing the problem? Have you
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> tried capturing other
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> kinds of high traffic with less ressource hungy
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> capture frontends?
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>>>> Best regards,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> ----------------------------------------------------------------------
>>>> ---
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>>>>> Microsoft(R) Visual Studio 2008.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> _______________________________________________
>>>>> tipc-discussion mailing list
>>>>> [email protected]
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>   
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>>> Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>>> Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>   
>>>     
>>>       
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>   
>>     
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] RE : Re: Link related question/issue

Reply via email to