Hi Ralph,
On 02/03/2014 04:20 PM, Ralph Castain wrote:
On Feb 3, 2014, at 1:13 PM, Eric Chamberland <eric.chamberl...@giref.ulaval.ca>
wrote:
On 02/03/2014 03:59 PM, Ralph Castain wrote:
Very strange - even if you kill the job with SIGTERM, or have processes that
segfault, OMPI should clean itself up and remove those session directories.
Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but it
at least to-date has done pretty well.
Ok, one more information here that may matter: All sequential tests are launched
*without* mpiexec... I don't know if the "cleanup" phase is done by mpiexec or
the binaries...
Ah, yes that would be a source of the problem! We can't guarantee cleanup if you just
kill the procs or they segfault *unless* mpiexec is used to launch the job. What are you
using to launch? Most resource managers provide an "epilog" capability for
precisely this purpose as all MPIs would display the same issue.
For the sequential jobs, we just launch the tests on the "command
line"... no resource manager is ever used. For the jobs which requires
more than 1 process, we have "mpiexec -n ..." added to the command line...
which should delete files that shouldn't exists... ;-)
But, IMHO, I still think OpenMPI should "choose" another directory name if it
can't create it because a poor file exists!
We could do that - but now we get into the bottomless pit of trying every
possible combination of directory names, and ensuring that every process comes
up with the same answer! Remember, the session dir is where the shared memory
regions rendezvous, so every process on a node would have to find the same place
ok. Just for my knowledge: that means if I launch 2 processes on a
single node and they have to communicate, they will do it by the files
in /tmp?
How can all users be aware that they have to cleanup such files?
Given how long 1.6.x has been out there, and that this is about the only time
I've heard of a problem, I'm not sure this is a general enough issue to merit
the concern
Ok. I did just verified on 8 other computers/architectures that are
running the same tests: there is only 1 which have files in the
directory level of /tmp/openmpi-sessions-${USER}*
Since we do that kind of testing since many years, I also agree it is
not a widespread issue... But it just occured 2 times in the last 3
days!!! :-/
Maybe a good compromise would be to have the error message to tell there is a
file with the same name of the directory chosen?
I can make that change - good suggestion.
ok, thanks!
Or add a new entry to the FAQ to help users find the workaround you proposed...
;-)
we can try to do that too
If I may suggest to test the behavior of 1.7.x... what about this: Have
a test case that creates a bunch of files (from 0 to 65536) in
/tmp/openmpi-sessions-${USER}... before launching an executable without
mpirun... >:)
Anyway, thanks a lot!
Eric