Hi Elmer,
Se my comments/questions below.
///jon
Horvath, Elmer wrote:
> "TIPC session ID mismatch and one-sided link tear-down"
>
> It may be prudent to put a check on the session ID in the state message
> reception to detect stale session IDs. Perhaps just drop offending
> messages and the link will eventually go down. I would hesitate to
> recommend dropping a link if the session ID is wrong in case an old (or
> malformed) packet happened to be received.
>
Yes, we should drop such messages, just to fulfill the protocol. It is
not a
complete solution to the problem, though, because on-going traffic may
still keep the link up for a while, either until the traffic becomes so
low
hat the link supervision mechanism kicks in and resets the link, or until
the other endpoint resets itself because of too many failed retransmission
attempts. (The sequence numbers will not match).
But what really puzzles me is how this can occur in the first place.
If you have a look at the state diagram for link activation below, and
the
corresponding text (pasted in from the tipc-protocol draft), you will
see that a
link endpoint can never ever come back to WORKING_WORKING after
a reset without having a confirmation that the other endpoint has also been
reset, i.e. having received either a "RESET" or an "ACTIVATE" message
from the other end.
This is the state machine we have been using for many years, and I have
had reason to come back to it on many occasions, only to confirm each time
that it really is water-proof. I have not heard of any problem like the
one
you describe earlier.
I can only see two possible explanations for this: Either the code
doesn't match
the state machine (a bug somewhere), or a stray RESET or ACTIVATE is
received from the other end (or somewhere else?) despite that the link is
up and running. I have seen the latter happen, once I was using a switch
that was sometimes delivering packets completely out of order, with several
seconds of delay. (This was the initial reason for introducing the
session id.)
But this can only happen if the link has been recently started or reset,
as far
as I can see.
Is the latter a possibility? How long has the link been up when this occurs?
Are you using a switch with any known issues? Have you seen this on Linux
too, or only on VxWorks?
<<<Pasted in: >>>>
2.6.2. Link Activation
Link activation and supervision is completely handled by the generic
part of the protocol, in contrast to the partially media-dependent
neighbour detection protocol.
The following FSM describes how a link is activated and supervised.
------------------------------------------------------------------------
--------------- ---------------
| |<--(CHECKPOINT == LAST_REC)--| |
| | | |
|Working-Unknown|----TRAFFIC/ACTIVATE_MSG---->|Working-Working|
| | | |
| |-------+ +-ACTIVATE_MSG>| |
--------------- \ / ------------A--
| \ / | |
| NO TRAFFIC/ \/ RESET_MSG TRAFFIC/
| NO PROBE /\ | ACTIVATE_MSG
| REPLY / \ | |
---V----------- / \ --V------------
| |-------+ +--RESET_MSG-->| |
| | | |
| Reset-Unknown | | Reset-Reset |
| |----------RESET_MSG--------->| |
| | | |
-------------A- ---------------
| |
| BLOCK/ | UNBLOCK/
| CHANGEOVER| CHANGEOVER END
| ORIG_MSG |
-V-------------
| |
| |
| Blocked |
| |
| |
---------------
* Figure 20: Link finite state machine *
------------------------------------------------------------------------
A link enpoint's state is defined by the own endpoint's state, combined
with what is known about the other endpoint's state. The following
states exist:
Reset-Unknown
Own link endpoint reset, i.e. queues are emptied and sequence
numbers are set back to their initial values. The state of the peer
endpoint is unknown. LINK_PROTOCOL/RESET_MSG messages are sent
periodically at CONTINUITY_INTERVAL to inform peer about the own
endpoint's state, and to force it to reset its own enpoint,if this
has not already been done. If the peer endpoint is rebooting, or has
reset for some other reason, it will sooner or later also reach the
state Reset-Unknown, and start sending its own RESET_MSG messages
periodically. At least one of the endpoints, and often both, will
eventually receive a RESET_MSG and transfer to state Reset-Reset. If
the peer is still active, i.e. in one of the states Working-Working
or Working-Unknown, and has not yet detected the disturbance causing
this endpoint to reset, it will sooner or later receive a RESET_MSG,
and transfer directly to state Reset-Reset. If a LINK_PROTOCOL/
ACTIVATE_MSG message is received in this state, the link endpoint
knows that the peer is already in state Reset-Reset, and can itself
move directly on to state Working-Working. Any other messages are
ignored in this state. CONTINUITY_INTERVAL is calculated as the
smallest value of LINK_TOLERANCE/4 and 0.5 sec.
> Perhaps Jon or anyone else can comment on this and provide any
> recommendations on dealing with this, or why it may have been left out
> of the code before now. Perhaps it was just not observed as a problem
> and left out.
>
>
>
> [END]
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by: The Future of Linux Business White Paper
> from Novell. From the desktop to the data center, Linux is going
> mainstream. Let it simplify your IT future.
> http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell. From the desktop to the data center, Linux is going
mainstream. Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion