Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t.
> On Jun 26, 2017, at 7:39 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote: > > Hi Ralph, > > Thanks for the quick response. > > Just tried again not under slurm, but the same result... (though I > just did kill -9 orted on the remote node this time) > > Any ideas? Do you think my multiple-mpirun idea is worth trying? > > Cheers, > Tim > > > ``` > [user@bud96 mpi_resilience]$ > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery > --debug-daemons $(pwd)/test > ( some output from job here ) > ( I then do kill -9 `pgrep orted` on pnod0331 ) > bash: line 1: 161312 Killed > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608" > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri > "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh" > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1" > -------------------------------------------------------------------------- > ORTE has lost communication with a remote daemon. > > HNP daemon : [[8878,0],0] on node bud96 > Remote daemon: [[8878,0],1] on node pnod0331 > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > -------------------------------------------------------------------------- > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting > ``` > > On 27 June 2017 at 12:19, r...@open-mpi.org <r...@open-mpi.org> wrote: >> Ah - you should have told us you are running under slurm. That does indeed >> make a difference. When we launch the daemons, we do so with "srun >> --kill-on-bad-exit” - this means that slurm automatically kills the job if >> any daemon terminates. We take that measure to avoid leaving zombies behind >> in the event of a failure. >> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh >> launcher instead of the slurm one, which gives you more control. >> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote: >>> >>> Hi Ralph, George, >>> >>> Thanks very much for getting back to me. Alas, neither of these >>> options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a >>> recent master (7002535), with slurm's "--no-kill" and openmpi's >>> "--enable-recovery", once the node reboots one gets the following >>> error: >>> >>> ``` >>> -------------------------------------------------------------------------- >>> ORTE has lost communication with a remote daemon. >>> >>> HNP daemon : [[58323,0],0] on node pnod0330 >>> Remote daemon: [[58323,0],1] on node pnod0331 >>> >>> This is usually due to either a failure of the TCP network >>> connection to the node, or possibly an internal failure of >>> the daemon itself. We cannot recover from this failure, and >>> therefore will terminate the job. >>> -------------------------------------------------------------------------- >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd >>> ``` >>> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take >>> forever to come back up), but earlier experiments SIGKILLing the orted >>> on a compute node led to a very similar message as above, so at this >>> point I'm not optimistic... >>> >>> I think my next step is to try with several separate mpiruns and use >>> mpi_comm_{connect,accept} to plumb everything together before the >>> application starts. I notice this is the subject of some recent work >>> on ompi master. Even though the mpiruns will all be associated to the >>> same ompi-server, do you think this could be sufficient to isolate the >>> failures? >>> >>> Cheers, >>> Tim >>> >>> >>> >>> On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote: >>>> It has been awhile since I tested it, but I believe the --enable-recovery >>>> option might do what you want. >>>> >>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote: >>>>> >>>>> Hi! >>>>> >>>>> So I know from searching the archive that this is a repeated topic of >>>>> discussion here, and apologies for that, but since it's been a year or >>>>> so I thought I'd double-check whether anything has changed before >>>>> really starting to tear my hair out too much. >>>>> >>>>> Is there a combination of MCA parameters or similar that will prevent >>>>> ORTE from aborting a job when it detects a node failure? This is >>>>> using the tcp btl, under slurm. >>>>> >>>>> The application, not written by us and too complicated to re-engineer >>>>> at short notice, has a strictly master-slave communication pattern. >>>>> The master never blocks on communication from individual slaves, and >>>>> apparently can itself detect slaves that have silently disappeared and >>>>> reissue the work to those remaining. So from an application >>>>> standpoint I believe we should be able to handle this. However, in >>>>> all my testing so far the job is aborted as soon as the runtime system >>>>> figures out what is going on. >>>>> >>>>> If not, do any users know of another MPI implementation that might >>>>> work for this use case? As far as I can tell, FT-MPI has been pretty >>>>> quiet the last couple of years? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Tim >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users