On Tue, Jan 31, 2012 at 12:10 PM, Lane Schwartz <[email protected]> wrote: > On Fri, Jan 27, 2012 at 3:42 PM, Rayson Ho <[email protected]> wrote: >> On Fri, Jan 27, 2012 at 2:49 PM, Lane Schwartz <[email protected]> wrote: >>> I have encountered a problem where sometimes (but not always) my jobs >>> ignore the -cwd or -wd flags and run in my home directory instead of >>> the specified working directory. I can run the same job multiple times >>> launching from the same directory, and sometimes the job correctly >>> runs from the current directory, and sometimes it runs from my home >>> directory. >> >> I ran over 100 test jobs and all of them ran in directory specified in >> -cwd or -wd. How easy is it to reproduce the issue?? Is the home >> directory on NFS or some kind of network or cluster storage?? > > The home directory is mounted via NFS. The correct directory (where > the jobs are launched from) is also on NFS. > >> If Grid Engine cannot change the directory to the one specified by >> -cwd/-wd, then it will simply turn the job into the "Eqw" state. > > When jobs run in the wrong directory, their job state remains in "r" state. > > >> 2) So assume you have jobs do not run in the "correct" directory, run: >> >> - qstat -j <job id> >> >> the "sge_o_workdir" should show you what SGE thinks which directory >> the job is supposed to run in. > > I ran a bunch of jobs. The job is a dummy script that simply runs > `pwd` and echoes the value of $PWD, then checks the value of $PWD > against the hardcoded directory where the job should be run. If $PWD > fails to match the expected directory, the job echoes "Failure" then > sleeps. > > For all of the jobs that printed "Failure", the log file shows that > running 'pwd' returned my home directory instead of the correct > directory. Likewise, $PWD reported my home directory. > > For those jobs that printed "Failure", when I run qstat -j <job id> > the value of sge_o_workdir lists the directory where the job was > launched (that is, the directory where the job should have been run). > >> - go into the $SGE_ROOT/$SGE_CELL/spool/<execution >> host>/active_jobs/<job id.1> directory > > I ssh'd to the execution host for one of the jobs that reported > "Failure" and went to the directory you specified above. > > The "environment" file lists the following: > PWD=/scratch4/lane/2011-12-15_europarl > > That is where the job should be running, but when the job ran it > printed out /home/lane as the value of $PWD. > > The "config" file lists the following: > cwd=/scratch4/lane/2011-12-15_europarl. > > Again, this is the directory where the job should have run. > > Any ideas? > > Thanks, > Lane
To help debug what's going on, I added a couple lines to my .bashrc file. At the beginning of my ~/.bashrc: export MYVAR=$PWD export MYSTART=`pwd` export MYBINSTART=`/bin/pwd` And at the end of my ~/.bashrc: export MYVAREND=$PWD export MYBINEND=`/bin/pwd` I then modified my test script to print out the values of all of these new variables. In all cases for these new variables, the correct directory (where the job really should be run) is printed. When "Failure" is reported, printing $PWD or running `pwd` from my script prints the wrong directory (my home dir, where the job should NOT run). Thanks, Lane _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
