On Fri, Jan 27, 2012 at 2:49 PM, Lane Schwartz <[email protected]> wrote:
> I have encountered a problem where sometimes (but not always) my jobs
> ignore the -cwd or -wd flags and run in my home directory instead of
> the specified working directory. I can run the same job multiple times
> launching from the same directory, and sometimes the job correctly
> runs from the current directory, and sometimes it runs from my home
> directory.

I ran over 100 test jobs and all of them ran in directory specified in
-cwd or -wd. How easy is it to reproduce the issue?? Is the home
directory on NFS or some kind of network or cluster storage??

If Grid Engine cannot change the directory to the one specified by
-cwd/-wd, then it will simply turn the job into the "Eqw" state.

Since this is a random issue, we will need:

1) run a few jobs that do the following:

- check if the current working directory is the correct one by calling
`pwd` & check the hard coded value of the supposingly correct
directory (so obviously you will need to decide the location before
you submit the jobs, since the correct value is hard coded into the
job script).

- if the value is not correct, then email you to notify the issue, and
then sleep (I mean... the job, not you!)

- and if the value is correct, then just exit with 0 and don't sleep
(no point in wasting the job slots).


2) So assume you have jobs do not run in the "correct" directory, run:

- qstat -j <job id>

the "sge_o_workdir" should show you what SGE thinks which directory
the job is supposed to run in.

- go into the $SGE_ROOT/$SGE_CELL/spool/<execution
host>/active_jobs/<job id.1> directory

Tar up the content of the directory and send it to me, together with
the qstat -j output.

Rayson




>
> Interestingly, though, it always outputs the stderr and stdout log
> files into the correct folder (specified by -o and -e) which is in the
> current directory.
>
> To help debug the problem, I made a sample script that simply calls
> pwd. The script is below:
>
> #!/bin/bash
>
> # Tell SGE to use bash instead of the SGE default shell
> #$ -S /bin/bash
>
> # Tell SGE to keep all current environment variables
> #$ -V
>
> # Tell SGE to run job from current working directory
> #$ -cwd
>
> # Tell SGE which queue to use
> #$ -q all.q
>
> # Tell SGE the name of this job
> #$ -N fr-en.mert
>
> # Tell SGE where to log this job
> #$ -o log/mert/fr-en/
> #$ -e log/mert/fr-en/
>
> # Tell SGE how much memory this job needs
> #$ -l mem_free=8G
>
> echo "pwd=`pwd`"
> echo "PWD=$PWD"
>
>
>
> I'm running OGS/GE2011.11. I also have another setup with SGE/ge6.2u6
> - on that older setup this problem does not occur.
>
> Has anyone ever seen this type of problem? If I change the script to
> use -wd /path/to/current/dir instead of -cwd I get exactly the same
> inconsistent behavior. Likewise, it doesn't appear to matter whether I
> pass the flag at the command line or within the script, as above.
>
> Are there any grid engine or scheduler log files that I could examine
> which might be helpful in tracking down this behavior?
>
> Thanks,
> Lane Schwartz
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to