Daniel, I'm not sure what is going on. Can you please send me the Torque logs? Also can you please send me all your Moab logs again? I might be able to help you more after looking at them. Thanks.
On Tue, 2008-09-16 at 15:49 -0400, Daniel Gruner wrote: > Hi Hugh, > > I am still having some weird problems with moab/torque on my test xcpu > cluster. I mentioned some of these in a previous e-mail, but the > query went unanswered, and since you wrote the script(s) perhaps you > could help me debug this? Here is the issue: > > I have 2 compute nodes, each with 2 cpus. I submit several jobs to > the queue using qsub: > > [EMAIL PROTECTED] xcpu]$ xstat > n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > [EMAIL PROTECTED] xcpu]$ showq > > active jobs------------------------ > JOBID USERNAME STATE PROCS REMAINING > STARTTIME > > 25.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon > Sep 15 10:11:07 > 26.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon > Sep 15 10:11:07 > 27.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon > Sep 15 10:11:07 > 28.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon > Sep 15 10:11:07 > > 4 active jobs 4 of 4 processors in use by local jobs (100.00%) > 2 of 2 nodes active (100.00%) > > eligible jobs---------------------- > JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME > > > 0 eligible jobs > > blocked jobs----------------------- > JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME > > > 0 blocked jobs > > Total jobs: 4 > > The job script is: > #!/bin/bash > #PBS -l nodes=1 > #XCPU -p > > date > > > The weird thing is that all the jobs end up being executed on node > n0000, as per the output: > > [EMAIL PROTECTED] xcpu]$ cat script.cmd.o25 > n0000: Mon Sep 15 10:11:42 UTC 2008 > [EMAIL PROTECTED] xcpu]$ cat script.cmd.o26 > n0000: Mon Sep 15 10:11:26 UTC 2008 > [EMAIL PROTECTED] xcpu]$ cat script.cmd.o27 > n0000: Mon Sep 15 10:13:03 UTC 2008 > [EMAIL PROTECTED] xcpu]$ cat script.cmd.o28 > n0000: Mon Sep 15 10:12:02 UTC 2008 > > This is despite the fact that when I use the moab "checkjob" command, > some of these jobs were supposedly assigned to n0001 an some to n0000: > > [EMAIL PROTECTED] xcpu]$ checkjob 26.dgk3.chem.utoronto.ca > job 26.dgk3.chem.utoronto.ca > > AName: script.cmd > State: Completed > Complete Time: Mon Sep 15 10:12:40 > Completion Code: 0 > Creds: user:danny group:danny class:batch > WallTime: 00:01:33 of 1:00:00 > SubmitTime: Mon Sep 15 10:10:53 > (Time Queued Total: 00:02:49 Eligible: 00:00:00) > > Total Requested Tasks: 1 > > Req[0] TaskCount: 1 Partition: dgk3 > Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: --- Arch: --- Features: --- > > Allocated Nodes: > [n0001:1] > > [EMAIL PROTECTED] xcpu]$ checkjob 28.dgk3.chem.utoronto.ca > job 28.dgk3.chem.utoronto.ca > > AName: script.cmd > State: Running > Creds: user:danny group:danny class:batch > WallTime: 00:01:33 of 1:00:00 > SubmitTime: Mon Sep 15 10:10:56 > (Time Queued Total: 00:00:11 Eligible: 00:00:11) > > StartTime: Mon Sep 15 10:11:07 > Total Requested Tasks: 1 > > Req[0] TaskCount: 1 Partition: dgk3 > Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: --- Arch: --- Features: --- > > Allocated Nodes: > [n0000:1] > > > Can you shed any light on this? > Thanks a lot, > Daniel -- Hugh Greenberg Los Alamos National Laboratory, CCS-1 Email: [EMAIL PROTECTED] Phone: (505) 665-6471
