We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a
'starter_method' that's affecting OpenMPI jobs.
Following previous discussions on the list[1], we're using the
'environment modules' package, and using a starter_method to initialize
the user's environment as if it was a login shell and to export the
"module" function.
Our starter_method script is "/lab/bin/starter" and contains:
---------------------------------------------
[line 1] #!/bin/bash -l
[line 2] # initialize modules, then run whatever was given
[line 3] . /usr/share/Modules/init/bash
[line 4]
[line 5] # check if "module" is declared as a function
[line 6] declare -f -F module 1> /dev/null 2>&1
[line 7] if [ $? = 0 ] ; then
[line 8] # there is a module function, export it
[line 9] export -f module
[line 10] fi
[line 11]
[line 12] printf "Debugging. About to:\n\texec \"${@}\"\n" 1>&2
[line 12] exec "${@}"
---------------------------------------------
(The debugging statement is not normally active.)
This works fine for serial jobs.
However, OpenMPI (1.3.3) jobs fail to start. It appears as if the
starter_method is somehow corrupting the environment passed to mpirun.
For example, I submitted a job when I was in the directory
/lab/home/bergman/sge_job_output.
The starter_method script reports:
Debugging: About to:
exec OPAL_PREFIX=/lab/bin/openmpi/sge; export OPAL_PREFIX;\
PATH=/lab/bin/openmpi/bin:$PATH ;\
export PATH ;\
LD_LIBRARY_PATH=/lab/bin/openmpi/lib:$LD_LIBRARY_PATH
;\
export LD_LIBRARY_PATH ; /lab/bin/openmpi/bin/orted
which looks fine. However, that is followed by the error:
/lab/bin/starter: line 12:
/lab/home/bergman/sge_job_output/OPAL_PREFIX=/lab/bin/openmpi/sge;\
export OPAL_PREFIX;\
PATH=/lab/bin/openmpi/bin:$PATH ;\
export PATH ;\
LD_LIBRARY_PATH=/lab/bin/openmpi/lib:$LD_LIBRARY_PATH
;\
export LD_LIBRARY_PATH ; /lab/bin/openmpi/bin/orted
[Lines broken for readability.]
The odd thing is that it the current working directory (where the SGE
job was submitted) is pre-peneded to the definition of the OPAL_PREFIX
variable. This is consistent, regardless of where the SGE job is launched (~,
/tmp, etc.).
Any suggestions?
Thanks,
Mark
[1] http://gridengine.org/pipermail/users/2014-January/007121.html
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users