Re: [OMPI users] Node failure handling

2017-06-26 Thread r...@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t. > On Jun 26, 2017, at 7:39 PM, Tim Burgess wrote: > > Hi Ralph, > > Thanks for the quick response. > > Just tried again not under slurm, but the same result... (though I

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, Thanks for the quick response. Just tried again not under slurm, but the same result... (though I just did kill -9 orted on the remote node this time) Any ideas? Do you think my multiple-mpirun idea is worth trying? Cheers, Tim ``` [user@bud96 mpi_resilience]$

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, George, Thanks very much for getting back to me. Alas, neither of these options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a recent master (7002535), with slurm's "--no-kill" and openmpi's "--enable-recovery", once the node reboots one gets the following error: ```