We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a
'starter_method' that's affecting OpenMPI jobs.

Following  previous discussions on the list[1], we're using the
'environment modules' package, and using a starter_method to initialize
the user's environment as if it was a login shell and to export the
"module" function.

Our starter_method script is "/lab/bin/starter" and contains:

---------------------------------------------
[line  1]       #!/bin/bash -l
[line  2]       # initialize modules, then run whatever was given
[line  3]       . /usr/share/Modules/init/bash
[line  4]       
[line  5]       #  check if "module" is declared as a function
[line  6]       declare -f -F module 1> /dev/null 2>&1
[line  7]       if [ $? = 0 ] ; then
[line  8]               # there is a module function, export it
[line  9]               export -f module
[line 10]       fi
[line 11]       
[line 12]       printf "Debugging. About to:\n\texec \"${@}\"\n" 1>&2
[line 12]       exec "${@}"
---------------------------------------------

(The debugging statement is not normally active.)

This works fine for serial jobs.

However, OpenMPI (1.3.3) jobs fail to start. It appears as if the
starter_method is somehow corrupting the environment passed to mpirun.

For example, I submitted a job when I was in the directory
/lab/home/bergman/sge_job_output.

The starter_method script reports:

        Debugging: About to:
                exec OPAL_PREFIX=/lab/bin/openmpi/sge; export OPAL_PREFIX;\
                         PATH=/lab/bin/openmpi/bin:$PATH ;\
                         export PATH ;\
                         LD_LIBRARY_PATH=/lab/bin/openmpi/lib:$LD_LIBRARY_PATH 
;\
                         export LD_LIBRARY_PATH ; /lab/bin/openmpi/bin/orted

which looks fine. However, that is followed by the error:

        /lab/bin/starter: line 12: 
/lab/home/bergman/sge_job_output/OPAL_PREFIX=/lab/bin/openmpi/sge;\
                         export OPAL_PREFIX;\
                         PATH=/lab/bin/openmpi/bin:$PATH ;\
                         export PATH ;\
                         LD_LIBRARY_PATH=/lab/bin/openmpi/lib:$LD_LIBRARY_PATH 
;\
                         export LD_LIBRARY_PATH ; /lab/bin/openmpi/bin/orted

[Lines broken for readability.]


The odd thing is that it the current working directory (where the SGE
job was submitted) is pre-peneded to the definition of the OPAL_PREFIX
variable. This is consistent, regardless of where the SGE job is launched (~,
/tmp, etc.).

Any suggestions?

Thanks,

Mark



[1] http://gridengine.org/pipermail/users/2014-January/007121.html
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to