Re: [OMPI users] Node failure handling

r...@open-mpi.org Tue, 27 Jun 2017 09:18:38 -0700

Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 
<https://github.com/open-mpi/ompi/pull/3771>


> On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote:
> 
> Actually, the error message is coming from mpirun to indicate that it lost 
> connection to one (or more) of its daemons. This happens because slurm only 
> knows about the remote daemons - mpirun was started outside of “srun”, and so 
> slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills 
> the daemons on the compute nodes, not mpirun. As a result, we always see that 
> error message.
> 
> The capability should exist as an option - it used to, but probably has 
> fallen into disrepair. I’ll see if I can bring it back.
> 
>> On Jun 27, 2017, at 3:35 AM, George Bosilca <bosi...@icl.utk.edu 
>> <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> I would also be interested in having the slurm keep the remaining processes 
>> around, we have been struggling with this on many of the NERSC machines. 
>> That being said the error message comes from orted, and it suggest that they 
>> are giving up because they lose connection to a peer. I was not aware that 
>> this capability exists in the master version of ORTE, but if it does then it 
>> makes our life easier.
>> 
>>   George.
>> 
>> 
>> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> 
>> wrote:
>> Let me poke at it a bit tomorrow - we should be able to avoid the abort. 
>> It’s a bug if we can’t.
>> 
>> > On Jun 26, 2017, at 7:39 PM, Tim Burgess <ozburgess+o...@gmail.com 
>> > <mailto:ozburgess%2bo...@gmail.com>> wrote:
>> >
>> > Hi Ralph,
>> >
>> > Thanks for the quick response.
>> >
>> > Just tried again not under slurm, but the same result... (though I
>> > just did kill -9 orted on the remote node this time)
>> >
>> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
>> >
>> > Cheers,
>> > Tim
>> >
>> >
>> > ```
>> > [user@bud96 mpi_resilience]$
>> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
>> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
>> > --debug-daemons $(pwd)/test
>> > ( some output from job here )
>> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
>> > bash: line 1: 161312 Killed
>> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
>> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
>> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
>> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
>> > "581828608.0;tcp://172.16.251.96 
>> > <http://172.16.251.96/>,172.31.1.254:58250 <http://172.31.1.254:58250/>" 
>> > -mca plm "rsh"
>> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
>> > --------------------------------------------------------------------------
>> > ORTE has lost communication with a remote daemon.
>> >
>> >  HNP daemon   : [[8878,0],0] on node bud96
>> >  Remote daemon: [[8878,0],1] on node pnod0331
>> >
>> > This is usually due to either a failure of the TCP network
>> > connection to the node, or possibly an internal failure of
>> > the daemon itself. We cannot recover from this failure, and
>> > therefore will terminate the job.
>> > --------------------------------------------------------------------------
>> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
>> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - 
>> > exiting
>> > ```
>> >
>> > On 27 June 2017 at 12:19, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> > <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
>> >> Ah - you should have told us you are running under slurm. That does 
>> >> indeed make a difference. When we launch the daemons, we do so with "srun 
>> >> --kill-on-bad-exit” - this means that slurm automatically kills the job 
>> >> if any daemon terminates. We take that measure to avoid leaving zombies 
>> >> behind in the event of a failure.
>> >>
>> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> >> launcher instead of the slurm one, which gives you more control.
>> >>
>> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com 
>> >>> <mailto:ozburgess%2bo...@gmail.com>> wrote:
>> >>>
>> >>> Hi Ralph, George,
>> >>>
>> >>> Thanks very much for getting back to me.  Alas, neither of these
>> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> >>> "--enable-recovery", once the node reboots one gets the following
>> >>> error:
>> >>>
>> >>> ```
>> >>> --------------------------------------------------------------------------
>> >>> ORTE has lost communication with a remote daemon.
>> >>>
>> >>> HNP daemon   : [[58323,0],0] on node pnod0330
>> >>> Remote daemon: [[58323,0],1] on node pnod0331
>> >>>
>> >>> This is usually due to either a failure of the TCP network
>> >>> connection to the node, or possibly an internal failure of
>> >>> the daemon itself. We cannot recover from this failure, and
>> >>> therefore will terminate the job.
>> >>> --------------------------------------------------------------------------
>> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>> >>> ```
>> >>>
>> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>> >>> forever to come back up), but earlier experiments SIGKILLing the orted
>> >>> on a compute node led to a very similar message as above, so at this
>> >>> point I'm not optimistic...
>> >>>
>> >>> I think my next step is to try with several separate mpiruns and use
>> >>> mpi_comm_{connect,accept} to plumb everything together before the
>> >>> application starts.  I notice this is the subject of some recent work
>> >>> on ompi master.  Even though the mpiruns will all be associated to the
>> >>> same ompi-server, do you think this could be sufficient to isolate the
>> >>> failures?
>> >>>
>> >>> Cheers,
>> >>> Tim
>> >>>
>> >>>
>> >>>
>> >>> On 10 June 2017 at 00:56, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> >>> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
>> >>>> It has been awhile since I tested it, but I believe the 
>> >>>> --enable-recovery option might do what you want.
>> >>>>
>> >>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com 
>> >>>>> <mailto:ozburgess%2bo...@gmail.com>> wrote:
>> >>>>>
>> >>>>> Hi!
>> >>>>>
>> >>>>> So I know from searching the archive that this is a repeated topic of
>> >>>>> discussion here, and apologies for that, but since it's been a year or
>> >>>>> so I thought I'd double-check whether anything has changed before
>> >>>>> really starting to tear my hair out too much.
>> >>>>>
>> >>>>> Is there a combination of MCA parameters or similar that will prevent
>> >>>>> ORTE from aborting a job when it detects a node failure?  This is
>> >>>>> using the tcp btl, under slurm.
>> >>>>>
>> >>>>> The application, not written by us and too complicated to re-engineer
>> >>>>> at short notice, has a strictly master-slave communication pattern.
>> >>>>> The master never blocks on communication from individual slaves, and
>> >>>>> apparently can itself detect slaves that have silently disappeared and
>> >>>>> reissue the work to those remaining.  So from an application
>> >>>>> standpoint I believe we should be able to handle this.  However, in
>> >>>>> all my testing so far the job is aborted as soon as the runtime system
>> >>>>> figures out what is going on.
>> >>>>>
>> >>>>> If not, do any users know of another MPI implementation that might
>> >>>>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>> >>>>> quiet the last couple of years?
>> >>>>>
>> >>>>> Thanks in advance,
>> >>>>>
>> >>>>> Tim
>> >>>>> _______________________________________________
>> >>>>> users mailing list
>> >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> >>>>
>> >>>> _______________________________________________
>> >>>> users mailing list
>> >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

Reply via email to