Hi,
Ok, so it seems like we have pinpointed the problem.
Luckily, the solution to this is extremely simple, and
backwards compatible.
The 5 bits next to MSB of the gap fields happen to be
unused, and is exactly what we need for a future-proof
gap-field size. (If we can loose a 512 packets block
on a 1-Gb llink, how many can we loose on a 100 Gb
link? )
I suggest the following solution, i.e. extending the bit-
mask in msg_set_bits()/msg_bits() from 8 to 13 bits:
static inline u32 msg_seq_gap(struct tipc_msg *m)
{
return msg_bits(m, 1, 16, 0x1fff);
}
static inline void msg_set_seq_gap(struct tipc_msg *m, u32 n)
{
msg_set_bits(m, 1, 16, 0x1fff, n);
}
I leave to Allan to integrate this into the 1.7.5 code, since delivering
it to David M would only screw up the patches Allan has in his pipe.
And anyway, I have not actually tested it.
Regards
///jon
Xpl++ wrote:
> Hi Jon,
>
> See below.
>
> Jon Maloy ??????:
>> I think this is calculated correctly in our case, but the rec_gap
>> passed into tipc_send_proto_msg() gets overwritten by
>> that routine. This is normally correct, since the gap should be
>> adjusted according to what is present
>> in the deferred-queue, in order to avoid retransmitting more packets
>> than necessary.
>>
>> The code I was referring to is the following, where 'gap' initially
>> is set to the 'rec_gap' calculated above.
>>
>> if (l_ptr->oldest_deferred_in) {
>> u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
>> gap = mod(rec - mod(l_ptr->next_in_no));
>> }
>>
>> msg_set_seq_gap(msg, gap);
>> .....
>>
>> When the protocol gets stuck, 'rec_gap' should be found to be (54992
>> - 53968) = 1024
>> Since the result is non-zero, tipc_link_send_proto_msg() is called.
>> Inside that routine three things can happen:
>> 1) l_ptr->oldest_deferred_in is NULL. This means that gap' will
>> retain its value of 1024.
>> This leads us into case 3) below.
> How can l_ptr->oldest_deferred_in be NULL? Don't think this is out case.
>> 2) The calculation of 'gap' over-writes the original value. If this
>> value always is zero,
>> the protocol will bail out. Can this happen?
>> 3) msg_set_gap() always writes a zero into the message.
>> Actually, this is fully possible. The field for 'gap' is only 8 bits
>> long, so any gap size
>> which is a multiple of 256 will give a zero. Looking at the dump,
>> this looks
>> very possible: the first packet loss is not 95 packets, as I stated
>> in my first mail, but
>> 54483 - 53967= 525 packets. This is counting only from what we see in
>> Wireshark,
>> which we have reason to suspect doesn't show all packets. So the real
>> value might
>> quite well be 512. And if this happens, we are stuck forever, because
>> the head
>> of the deferred-queue will never move.
> This seems to be the case we are seeing in the dump. 525 is just too
> close to 512 :)
>> My question to Peter is: How often does this happen? Every time? Often?
>> If it happens often, can it be that the Ethernet driver has the habit
>> of throwing away
>> blocks of packets which are exactly a multiple of 256 or 512. (These
>> computer
>> programmers...)
> Considering the amount of traffic I have on the nodes and packet drop
> rate relative to the frequency of occurance of this problem - we can
> safely call it a rare condition (... like 1 in 256 chance when packet
> drop occures ;) ). It is not predictable under normal operation, while
> very easy to cause using a stress test, thou it usualy took me between
> 2 and 10 attempts before I can make the link stall.
> I reduced link window to 224 when I realized the gap field is 8bit and
> I haven't seen any problems since then.
> However it's worth noticing that some time ago when the cluster was
> working over a 100mbit net, a window of < 256 was causing more trouble
> than a window of say > 512 when facing high traffic/packet rate and
> that is actualy the reason I ended up using windows of up to 4096.
> Being a good programmer, I was trying powers of 2 for link window
> values up until I got to 4096, when my troubles almost disappeared
> during the 100mbit era.
> When we switched to gbit lan the picture changes quite dramaticaly.
> Regarding the magic number 256 - it seems that most of my e1000 nics
> have a 256 entry tx/rx descriptor table, so it may have something to
> do with the issue .. but who knows :)
>
>> Anyway, we have clearly found a potential problem which must be
>> resolved. With
>> window sizes > 255, scenario 3) is bound to happen now and then. If
>> this is Peter's
>> problem, remains to be seen.
> So far it seems that the protocol may be having issues with link
> windows >= 256.
> I'd guess it will be a good idea to add the gap check you proposed
> anyway.
>
> Regards,
> Peter.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion