Galen M. Shipman wrote:
Gleb Natapov wrote:
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov <gl...@voltaire.com>
wrote:
If you use OB1 PML (default one) it will never recover from link down
error no matter how many other transports you have. The reason is that
OB1 never tracks what happens with buffers submitted to BTL. So if BTL
can't, for any reason, transmit packet passed to it by OB1 the job will
stuck because OB1 doesn't have enough information to try to resend the
packet via another transport. For this kind of resource tracking there
is DR PML. In case of IB BTL link down event generates error for each
packet submitted for sending to the device. IB BTL simply discards all
this packets and relies on PML to resend them, so even after link up
event a job will not recover if OB1 PML is used with IB BTL. This may be
different with another transports.
This makes sense; one thing I'm wondering now is if the OB1 PML is able
(and/or had enough information) to figure out that it can't continue at
all, and will abort the job.
In case of openib BTL I don't see how job may recover from link down
event so I think aborting the job is the right thing to do. In case of
other transports if transport can continue after link up event as if
nothing happened it is worth to wait for link up. This capability may be
added to openib BTL too, it's just nobody cares enough.
Ethernet doesn't fail in this case because the TCP stack handles this
gracefully. The same behavior can be observed when disconnecting an
ethernet cable while a ssh session exists, plug it back in and you are
probably good to go, after a bit of time (due to exponential backoff on
retrans). For GM/MX over myrinet the timeout is quite high on connection
down and the software stack handles this gracefully. For IB the link
state transitions from LinkActive to LinkActDefer until LinkDownTimeout
expires and the link transitions to LinkDown state.
From the spec: LinkDownTimeout occurs when the port state machine has
continuously been in the LinkActDefer state for 10ms + 3% /-51% .. I
have no idea what that formula means, perhaps my pdf of the spec is
messed up.
Okay, so these are percentage not modulus, the formula makes some sense
now..
so the timeout is between 4.9 and 10.3 ms, you had better plug the cable
in/out very quickly ;-)
So transitioning to the LinkDown state is dictated by the IB spec, it
would seem that we would want to defer the transition based on a user
configurable parameter, this is link layer so it would probably be
necessary to do this when loading the IB driver. Am I interpreting this
correctly?
- Galen
--
Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users