Hi Hugh,
I am still having some weird problems with moab/torque on my test xcpu
cluster. I mentioned some of these in a previous e-mail, but the
query went unanswered, and since you wrote the script(s) perhaps you
could help me debug this? Here is the issue:
I have 2 compute nodes, each with 2 cpus. I submit several jobs to
the queue using qsub:
[EMAIL PROTECTED] xcpu]$ xstat
n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0
n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0
[EMAIL PROTECTED] xcpu]$ showq
active jobs------------------------
JOBID USERNAME STATE PROCS REMAINING
STARTTIME
25.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon
Sep 15 10:11:07
26.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon
Sep 15 10:11:07
27.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon
Sep 15 10:11:07
28.dgk3.chem.utoronto.ca danny Running 1 00:58:30 Mon
Sep 15 10:11:07
4 active jobs 4 of 4 processors in use by local jobs (100.00%)
2 of 2 nodes active (100.00%)
eligible jobs----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 eligible jobs
blocked jobs-----------------------
JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME
0 blocked jobs
Total jobs: 4
The job script is:
#!/bin/bash
#PBS -l nodes=1
#XCPU -p
date
The weird thing is that all the jobs end up being executed on node
n0000, as per the output:
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o25
n0000: Mon Sep 15 10:11:42 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o26
n0000: Mon Sep 15 10:11:26 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o27
n0000: Mon Sep 15 10:13:03 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o28
n0000: Mon Sep 15 10:12:02 UTC 2008
This is despite the fact that when I use the moab "checkjob" command,
some of these jobs were supposedly assigned to n0001 an some to n0000:
[EMAIL PROTECTED] xcpu]$ checkjob 26.dgk3.chem.utoronto.ca
job 26.dgk3.chem.utoronto.ca
AName: script.cmd
State: Completed
Complete Time: Mon Sep 15 10:12:40
Completion Code: 0
Creds: user:danny group:danny class:batch
WallTime: 00:01:33 of 1:00:00
SubmitTime: Mon Sep 15 10:10:53
(Time Queued Total: 00:02:49 Eligible: 00:00:00)
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: dgk3
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
Allocated Nodes:
[n0001:1]
[EMAIL PROTECTED] xcpu]$ checkjob 28.dgk3.chem.utoronto.ca
job 28.dgk3.chem.utoronto.ca
AName: script.cmd
State: Running
Creds: user:danny group:danny class:batch
WallTime: 00:01:33 of 1:00:00
SubmitTime: Mon Sep 15 10:10:56
(Time Queued Total: 00:00:11 Eligible: 00:00:11)
StartTime: Mon Sep 15 10:11:07
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: dgk3
Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: ---
Allocated Nodes:
[n0000:1]
Can you shed any light on this?
Thanks a lot,
Daniel