Better suggestion:
if (gap > 255)
gap = 255;
Otherwise 257 would give a gap of 1, 258 a gap of 2 etc. Clearly not
optimal.
///jon
Jon Maloy wrote:
> Hi Peter,
> See my previous mail.
> It should be pretty easy for you to verify my hypothesis. Just add a
>
> if ((gap %256) == 0)
> gap = 255;
>
> into the function msg_set_seq_gap() in tipc_link.h. This should
> be a functional work-around for the problem.
>
> ///jon
>
>
> Jon Maloy wrote:
>
>> See below.
>> I think we are close now.
>> If any of you want the dump, you will have to ask Peter directly,
>> since he was reluctant to send it out to tipc-discussion.
>>
>> Regards
>> ///jon
>>
>> Horvath, Elmer wrote:
>>
>>
>>> Hi,
>>>
>>> This is very interesting. From the description by Jon (I did not see the
>>> Wireshark trace posted), the gap value is being calculated as 0 when it
>>> should be calculated as a non-zero value.
>>>
>>> We actually encountered a similar, but different, issue internally and
>>> believed it to be a compiler problem (we were not compiling with GCC). The
>>> target was an E500 (8560 based PPC system) and was compiled with software
>>> floating point (though no floating point code is in TIPC that I know of).
>>>
>>> In our case, the calculation was effectively subtracting 1 from 1 and
>>> getting a 1. The node would then send a NAK falsely asking for
>>> retransmissions on packets it did in fact receive.
>>>
>>> The incorrect calculation for us was in tipc_link.c in the routine
>>> link_recv_proto_msg() calculating the value of the variable 'rec_gap'. The
>>> code is:
>>> if (less_eq(mod(l_ptr->next_in_no), msg_next_sent(msg))) {
>>> rec_gap = mod(msg_next_sent(msg) -
>>> mod(l_ptr->next_in_no));
>>> }
>>>
>>>
>>>
>> I think this is calculated correctly in our case, but the rec_gap passed
>> into tipc_send_proto_msg() gets overwritten by
>> that routine. This is normally correct, since the gap should be adjusted
>> according to what is present
>> in the deferred-queue, in order to avoid retransmitting more packets
>> than necessary.
>>
>> The code I was referring to is the following, where 'gap' initially is
>> set to the 'rec_gap' calculated above.
>>
>> if (l_ptr->oldest_deferred_in) {
>> u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
>> gap = mod(rec - mod(l_ptr->next_in_no));
>> }
>>
>> msg_set_seq_gap(msg, gap);
>> .....
>>
>> When the protocol gets stuck, 'rec_gap' should be found to be (54992 -
>> 53968) = 1024
>> Since the result is non-zero, tipc_link_send_proto_msg() is called.
>> Inside that routine three things can happen:
>> 1) l_ptr->oldest_deferred_in is NULL. This means that gap' will retain
>> its value of 1024.
>> This leads us into case 3) below.
>> 2) The calculation of 'gap' over-writes the original value. If this
>> value always is zero,
>> the protocol will bail out. Can this happen?
>> 3) msg_set_gap() always writes a zero into the message.
>> Actually, this is fully possible. The field for 'gap' is only 8 bits
>> long, so any gap size
>> which is a multiple of 256 will give a zero. Looking at the dump,
>> this looks
>> very possible: the first packet loss is not 95 packets, as I stated
>> in my first mail, but
>> 54483 - 53967= 525 packets. This is counting only from what we see
>> in Wireshark,
>> which we have reason to suspect doesn't show all packets. So the real
>> value might
>> quite well be 512. And if this happens, we are stuck forever,
>> because the head
>> of the deferred-queue will never move.
>>
>> My question to Peter is: How often does this happen? Every time? Often?
>> If it happens often, can it be that the Ethernet driver has the habit of
>> throwing away
>> blocks of packets which are exactly a multiple of 256 or 512. (These
>> computer
>> programmers...)
>>
>> Anyway, we have clearly found a potential problem which must be
>> resolved. With
>> window sizes > 255, scenario 3) is bound to happen now and then. If
>> this is Peter's
>> problem, remains to be seen.
>>
>>
>>
>>
>>> This resulted in rec_gap being non-zero even though both operands were the
>>> same value. When rec_gap is non-zero, then tipc_link_send_proto_msg() is
>>> called with a non-zero gap value a bit further down in the same routine.
>>>
>>> Adding instrumentation sometimes fixed the problem; doing the exact same
>>> calculation again immediately following this code would yield the correct
>>> gap value. Very bizarre.
>>>
>>> We attributed this to a compiler issue.
>>>
>>> I don't know if this is the same issue, but it surely sounds similar enough
>>> to be noted. And this may be another place to check since it calls
>>> tipc_link_send_proto_msg() after receiving a state message.
>>>
>>> Elmer
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Maloy
>>> Sent: Tuesday, March 04, 2008 7:51 PM
>>> To: Xpl++; [EMAIL PROTECTED]; [email protected]
>>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>>
>>> Hi Peter,
>>>
>>> I see two interesting patterns:
>>>
>>> a: After the packet loss has started at packet
>>> 14191, state messages from 1.1.12 always come
>>> in pairs, with the same timestamp.
>>>
>>> b: Also, after the problems have started, all
>>> state messages 1.1.6->1.1.12, even when they are
>>> not probes, are immedieately followed by a state
>>> message in the opposite direction.
>>> This is a strong indication that the receiver
>>> (1.1.12) actually detects the gap from the state
>>> message contents, and sends out a new state
>>> message (a NACK), but for some reason the gap
>>> value never makes it into that message.
>>> Hence, tipc_link_send_proto_msg(),
>>> where the gap is calculated and added
>>> (line 2135 in tipc_link.c, tipc-1.7.5), seems
>>> to be a good place to start
>>> looking.
>>> I strongly suspect that the gap calculated at
>>> lines 2128-2129 always yields 0, or that
>>> no packets ever make it into the deferred queue
>>> (via tipc_link_defer_pkt()).
>>> That would be consistent with what we see.
>>>
>>> Regards
>>> ///jon
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xpl++
>>> Sent: March 4, 2008 2:17 PM
>>> To: [EMAIL PROTECTED]; '[email protected]'
>>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>>
>>> Hi,
>>>
>>> So .. what about that TODO comment in tipc_link.c regarding the stronger
>>> seq# checking and stuff? :) Since I managed to stabilize my cluster I must
>>> proceed with a software upgrade (deadlines :( ...) and will be able to
>>> start looking into the link code sometime tomorrow evening. In the mean
>>> time any ideas as to where/what to look at would be highly appreciated ;)
>>>
>>> Regards,
>>> Peter.
>>>
>>> Jon Paul Maloy ??????:
>>>
>>>
>>>
>>>> Hi,
>>>> Your analysis makes sense, but it still doesn't explain why TIPC
>>>> cannot handle this quite commonplace situation.
>>>> Yesterday, I forgot one essential detail: Even State messages contain
>>>> info to help the receiver detect a gap. The "next_sent" sequence
>>>> number tells the receiver if it is out of synch with the sender, and
>>>> gives it a chance to send a NACK (a State with gap != 0). Since
>>>> State-packets clearly are received, otherwise the link would go down,
>>>> there must be some bug in tipc that causes the gap to be calculated
>>>> wrong, or not at all. Neither does it look like the receiver is
>>>> sending a State _immediately_ after a gap has occurred, which it
>>>> should.
>>>> So, I think we are looking for some serious bug within tipc that
>>>> completely cripples the retransmission protocol. We should try to
>>>> backtrack and find out in which version it has been introduced.
>>>>
>>>> ///jon
>>>>
>>>>
>>>> --- Xpl++ <[EMAIL PROTECTED]> a écrit :
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Some more info about my systems:
>>>>> - all nodes that tend to drop packets are quite loaded, thou very
>>>>> rarely one can see cpu #0 being 100% busy
>>>>> - there are also few multithreaded tasks that are bound to cpu#0 and
>>>>> running in SCHED_RR. All of them use tipc. None of them uses the
>>>>> maximum scheduler priority and they use very little cpu time and do
>>>>> not tend to make any peaks
>>>>> - there is one taks that runs in SCHED_RR at maximum priority 99/RT
>>>>> (it really does a very very important job), which uses around 1ms of
>>>>> cpu, every 4 seconds, and it is explicitly bound to cpu#0
>>>>> - all other tasks (mostly apache & php/perl) are free to run on any
>>>>> cpu
>>>>> - all of these nodes also have considerable io load.
>>>>> - kernel has irq balancing and prety much all irq are balanced,
>>>>> except for nic irqs. They are always services by cpu #0
>>>>> - to create the packet drop issue I have to mildly stress the node,
>>>>> which would normaly mean a moment when apache would try to start some
>>>>> extra childred, that would also cause the number of simultaneously
>>>>> running php script to also rise, while at the same time the incoming
>>>>> network traffic is also rising. The stress is preceeded by a few
>>>>> seconds of high input packet rate which may be causing evene more
>>>>> stress on the scheduler and cpu starvation
>>>>> - wireshark is dropping packets (surprising many, as it seems), tipc
>>>>> is confused .. and all is related to moments of general cpu
>>>>> starvation and an even worse one at cpu#0
>>>>>
>>>>> Then it all started adding up ..
>>>>> I moved all non SCHED_OTHER tasks to other cpus, as well as few other
>>>>> services. The result - 30% of the nodes showed between 5 and 200
>>>>> packets dropped for the whole stress routine, which had not affected
>>>>> TIPC operation, nametables were in sync, all communications seem to
>>>>> work properly.
>>>>> Thou this solves my problems, it is still very unclear what may have
>>>>> been happening in the kernel and in the tipc stack that is causing
>>>>> this bizzare behavior.
>>>>> SMP systems alone are tricky, and when adding load and
>>>>> pseudo-realtime tasks situation seems to become really complicated.
>>>>> One really cool thing to note is that Opteron based nodes handle hi
>>>>> load and cpu starvation much better than Xeon ones ..
>>>>> which only confirms an
>>>>> old observation of mine, that for some reason (that must be the
>>>>> design/architecture?) Opterons appear _much_ more
>>>>> interactive/responsive than Xeons under heavy load ..
>>>>> Another note, this on TIPC - link window for 100mbit nets should be
>>>>> at least 256 if one wants to do any serious communication between a
>>>>> dozen or more nodes. Also for a gbit net link windows above 1024 seem
>>>>> to really confuse the stack when face with high output packet rate.
>>>>>
>>>>> Regards,
>>>>> Peter Litov.
>>>>>
>>>>>
>>>>> Martin Peylo ??????:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'll try to help with the Wireshark side of this
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> problem.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 3/4/08, Jon Maloy <[EMAIL PROTECTED]>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Strangely enough, node 1.1.12 continues to ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> packets
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> which we don't see in wireshark (is it possible
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> that
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> wireshark can miss packets?). It goes on acking
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> packets
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> up to the one with sequence number 53967, (on of
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> the
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> "invisible" packets, but from there on it is
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> stop.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I've never encountered Wireshark missing packets
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> so far. While it
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> sounds as it wouldn't be a problem with the TIPC
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dissector, could you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> please send me a trace file so I can definitely
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> exclude this cause of
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> defect? I've tried to get it from the link quoted
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> in the mail from Jon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> but it seems it was already removed.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> As a sum of this, I start to suspect your
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> Ethernet
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> driver. It seems like it sometimes delivers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> packets
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> to TIPC which it does not deliver to Wireshark,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> and
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> vice versa. This seems to happen after a period
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> of
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> high traffic, and only with messages beyond a
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> certain
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> size, since the State messages always go
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> through.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> Can you see any pattern in the direction the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> links
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> go stale, with reference to which driver you are using. (E.g., is
>>>>>>> there always an e1000 driver
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> involved
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> on the receiving end in the stale direction?) Does this happen
>>>>>>> when you only run one type of
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> driver?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I've not yet gone that deep into package capture,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> so I can't say much
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> about that. Peter, could you send a mail to one of
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> the Wireshark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> mailing lists describing the problem? Have you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> tried capturing other
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> kinds of high traffic with less ressource hungy
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> capture frontends?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Best regards,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ----------------------------------------------------------------------
>>>> ---
>>>>
>>>>
>>>>
>>>>
>>>>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>>>>> Microsoft(R) Visual Studio 2008.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>
>>>>
>>>>
>>>>
>>>>> _______________________________________________
>>>>> tipc-discussion mailing list
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>>> Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>>> Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>>
>>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion