On Fri, Jan 27, 2012 at 3:42 PM, Rayson Ho <[email protected]> wrote:
> On Fri, Jan 27, 2012 at 2:49 PM, Lane Schwartz <[email protected]> wrote:
>> I have encountered a problem where sometimes (but not always) my jobs
>> ignore the -cwd or -wd flags and run in my home directory instead of
>> the specified working directory. I can run the same job multiple times
>> launching from the same directory, and sometimes the job correctly
>> runs from the current directory, and sometimes it runs from my home
>> directory.
>
> I ran over 100 test jobs and all of them ran in directory specified in
> -cwd or -wd. How easy is it to reproduce the issue?? Is the home
> directory on NFS or some kind of network or cluster storage??

The home directory is mounted via NFS. The correct directory (where
the jobs are launched from) is also on NFS.

> If Grid Engine cannot change the directory to the one specified by
> -cwd/-wd, then it will simply turn the job into the "Eqw" state.

When jobs run in the wrong directory, their job state remains in "r" state.


> 2) So assume you have jobs do not run in the "correct" directory, run:
>
> - qstat -j <job id>
>
> the "sge_o_workdir" should show you what SGE thinks which directory
> the job is supposed to run in.

I ran a bunch of jobs. The job is a dummy script that simply runs
`pwd` and echoes the value of $PWD, then checks the value of $PWD
against the hardcoded directory where the job should be run. If $PWD
fails to match the expected directory, the job echoes "Failure" then
sleeps.

For all of the jobs that printed "Failure", the log file shows that
running 'pwd' returned my home directory instead of the correct
directory. Likewise, $PWD reported my home directory.

For those jobs that printed "Failure", when I run qstat -j <job id>
the value of sge_o_workdir lists the directory where the job was
launched (that is, the directory where the job should have been run).

> - go into the $SGE_ROOT/$SGE_CELL/spool/<execution
> host>/active_jobs/<job id.1> directory

I ssh'd to the execution host for one of the jobs that reported
"Failure" and went to the directory you specified above.

The "environment" file lists the following:
PWD=/scratch4/lane/2011-12-15_europarl

That is where the job should be running, but when the job ran it
printed out /home/lane as the value of $PWD.

The "config" file lists the following:
cwd=/scratch4/lane/2011-12-15_europarl.

Again, this is the directory where the job should have run.

Any ideas?

Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to