Re: [OMPI users] Node failure handling

r...@open-mpi.org Mon, 26 Jun 2017 21:15:56 -0700

Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a 
bug if we can’t.


> On Jun 26, 2017, at 7:39 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
> 
> Hi Ralph,
> 
> Thanks for the quick response.
> 
> Just tried again not under slurm, but the same result... (though I
> just did kill -9 orted on the remote node this time)
> 
> Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> 
> Cheers,
> Tim
> 
> 
> ```
> [user@bud96 mpi_resilience]$
> /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> --debug-daemons $(pwd)/test
> ( some output from job here )
> ( I then do kill -9 `pgrep orted`  on pnod0331 )
> bash: line 1: 161312 Killed
> /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> --------------------------------------------------------------------------
> ORTE has lost communication with a remote daemon.
> 
>  HNP daemon   : [[8878,0],0] on node bud96
>  Remote daemon: [[8878,0],1] on node pnod0331
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
> [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
> ```
> 
> On 27 June 2017 at 12:19, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> Ah - you should have told us you are running under slurm. That does indeed 
>> make a difference. When we launch the daemons, we do so with "srun 
>> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
>> any daemon terminates. We take that measure to avoid leaving zombies behind 
>> in the event of a failure.
>> 
>> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> launcher instead of the slurm one, which gives you more control.
>> 
>>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>> 
>>> Hi Ralph, George,
>>> 
>>> Thanks very much for getting back to me.  Alas, neither of these
>>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>>> "--enable-recovery", once the node reboots one gets the following
>>> error:
>>> 
>>> ```
>>> --------------------------------------------------------------------------
>>> ORTE has lost communication with a remote daemon.
>>> 
>>> HNP daemon   : [[58323,0],0] on node pnod0330
>>> Remote daemon: [[58323,0],1] on node pnod0331
>>> 
>>> This is usually due to either a failure of the TCP network
>>> connection to the node, or possibly an internal failure of
>>> the daemon itself. We cannot recover from this failure, and
>>> therefore will terminate the job.
>>> --------------------------------------------------------------------------
>>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>>> ```
>>> 
>>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>>> forever to come back up), but earlier experiments SIGKILLing the orted
>>> on a compute node led to a very similar message as above, so at this
>>> point I'm not optimistic...
>>> 
>>> I think my next step is to try with several separate mpiruns and use
>>> mpi_comm_{connect,accept} to plumb everything together before the
>>> application starts.  I notice this is the subject of some recent work
>>> on ompi master.  Even though the mpiruns will all be associated to the
>>> same ompi-server, do you think this could be sufficient to isolate the
>>> failures?
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> 
>>> 
>>> On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>>> It has been awhile since I tested it, but I believe the --enable-recovery 
>>>> option might do what you want.
>>>> 
>>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>>>> 
>>>>> Hi!
>>>>> 
>>>>> So I know from searching the archive that this is a repeated topic of
>>>>> discussion here, and apologies for that, but since it's been a year or
>>>>> so I thought I'd double-check whether anything has changed before
>>>>> really starting to tear my hair out too much.
>>>>> 
>>>>> Is there a combination of MCA parameters or similar that will prevent
>>>>> ORTE from aborting a job when it detects a node failure?  This is
>>>>> using the tcp btl, under slurm.
>>>>> 
>>>>> The application, not written by us and too complicated to re-engineer
>>>>> at short notice, has a strictly master-slave communication pattern.
>>>>> The master never blocks on communication from individual slaves, and
>>>>> apparently can itself detect slaves that have silently disappeared and
>>>>> reissue the work to those remaining.  So from an application
>>>>> standpoint I believe we should be able to handle this.  However, in
>>>>> all my testing so far the job is aborted as soon as the runtime system
>>>>> figures out what is going on.
>>>>> 
>>>>> If not, do any users know of another MPI implementation that might
>>>>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>>>>> quiet the last couple of years?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> Tim
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

Reply via email to