Re: [OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts

Jeff Squyres Sat, 21 Jun 2008 09:26:44 -0400

Sorry for the delay in replying to this -- mails sometimes pile up inmy INBOX and I don't get to reply to them all in a timely fashion.

Yes, you can expect this to be much better in the v1.3 series. If youhave a few cycles, you might want to test a nightly trunk tarballsnapshot in some of your problematic cases and see if it's better.We've had a little instability in trunk tarballs over the last week,so you might want to wait until next week to give it a shot.


    http://www.open-mpi.org/nightly/trunk/


On Jun 9, 2008, at 10:50 AM, Bill Johnstone wrote:

Hello OMPI devs,
I'm currently running OMPI v 1.2.4 . It didn't seem that any bugswhich affect me or my users were fixed in 1.2.5 and 1.2.6, so Ihaven't upgraded yet.
When I was initially getting started with OpenMPI, I had someproblems which I was able to solve, but one still remains. As Imentioned in
http://www.open-mpi.org/community/lists/users/2007/07/3716.php
when there is a non-graceful exit on any of the MPI jobs, mpirunhangs. As an example, I have a code that I run which, when it has atrivial runtime error (e.g., some small mistake in the input file)dies yielding messages to the screen like:
[node1.x86-64:28556] MPI_ABORT invoked on rank 0 in communicatorMPI_COMM_WORLD with errorcode 16
but mpirun never exits, and Ctrl+C won't kill it. I have to resortto kill -9.
Now that I'm running under SLURM, this is worse because there is nonice way to manually clear individual jobs off the controller. Soeven if I manually kill mpirun on the failed job, slurmctld stillthinks its running.
Ralph Castain replied to the previously-linked message:
http://www.open-mpi.org/community/lists/users/2007/07/3718.phpindicating that he thought he knew why this was happening and thatit was or would likely be fixed in the trunk.
At this point, I just want to know: can I look forward to this beingfixed in the upcoming v 1.3 series?
I don't mean that to sound ungrateful: *many thanks* to the OMPIdevs for what you've already given the community at large. I'm justa bit frustrated because we seem to run a lot of codes on ourcluster that abort at one time or another.
Thank you.



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts

Reply via email to