Re: [OMPI users] Fault Tolerance & Behavior

George Bosilca Thu, 26 Oct 2006 17:12:00 -0400

The Open MPI behavior is the same independently of the network usedfor the job. At least the behavior dictated by our internal messagepassing layer. But, for this to happens we should get a warning fromthe network that something is wrong (such a timeout). In the case ofTCP (and Myrinet) the timeout is so high that Open MPI was notinformed that something went wrong (we printout some warnings whenthis happens). It was happily waiting for a message to complete ...Once the network cable was reconnected, the network device itselfrecover and resume the communication, leading to a correct sendoperation (and this without involving Open MPI at all). There isnothing (that has a reasonable cost) we can do about this.

For IB, look like the network timeout is smaller. Open MPI knew thatsomething was wrong (the output prove it), and tried to continueusing the other available devices. If none are available, then OpenMPI is supposed to abort the job. For your particular run did you hadEthernet between the nodes ? If yes, I'm quite sure the MPI runwasn't stopped ... it continued using the TCP device (if not disabledby hand at mpirun time).

That's not what is supposed to happens right now. If there are otherdevices (such as TCP) the MPI job will print out some warnings andwill continue over the remaining networks (some will continue to usethe other networks, only the peer where the network went down getaffected). If the network timeout is too high, Open MPI will nevernotice that something went wrong. At least not the default messagelayer (PML).

If you want to have the job abort when your main network goes down,disable the usage of the others available network. More specificallydisable the TCP. A simple way to do it, it's to add the followingargument to your mpirun command:


--mca btl ^tcp (or --mca btl opnib,sm,self).

  Thanks,
    george.

PS: There are several internal message passing modules available forOpen MPI. The default one, look more for performance thanreliability. If reliability it's what you need you should use the DRPML. For this, you can specify --mca pml dr at mpirun time. This (DR)PML has data reliability and timeout (Open MPI internal timeout thatare configurable), allowing to recover faster from a network failure.



On Oct 26, 2006, at 3:52 PM, Troy Telford wrote:

I've recently had the chance to see how Open MPI (as well as otherMPIs)
behave in the case of network failure.

I've looked at what happens when a node has its network connection
disconnected in the middle of a job, with Ethernet, Myrinet (GM), and
InfiniBand (OpenIB).
With Ethernet and Myrinet, the job more or less pauses until thecable is
re-connected.  (I imagine timeouts still apply, but I wasn't patient
enough to wait for them)
With InfiniBand, the job pauses and Open MPI throws a few errormessages.
After the cable is plugged back in (and the SM catches up), the job
remains where it was when it was paused. I'd guess that part ofthis isthat the timeout is much shorter with IB than with Myri orEthernet, and
that when I unplug the IB cable, it times out fairly quickly (and then
Open MPI throws its error messages).
At any rate, the thought occurs (and it may just be my ignorance ofMPI):After a network connection times out (as was apparently the casewith IB),is the job salvageable? If the jobs are not salvageable, whydidn't Open
MPI abort the job (and clean up the running processes on the nodes)?
--
Troy Telford
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fault Tolerance & Behavior

Reply via email to