Hi Hugh,

I am still having some weird problems with moab/torque on my test xcpu
cluster.  I mentioned some of these in a previous e-mail, but the
query went unanswered, and since you wrote the script(s) perhaps you
could help me debug this?  Here is the issue:

I have 2 compute nodes, each with 2 cpus.  I submit several jobs to
the queue using qsub:

[EMAIL PROTECTED] xcpu]$ xstat
n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0

[EMAIL PROTECTED] xcpu]$ showq

active jobs------------------------
JOBID                     USERNAME      STATE PROCS   REMAINING
    STARTTIME

25.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
Sep 15 10:11:07
26.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
Sep 15 10:11:07
27.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
Sep 15 10:11:07
28.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
Sep 15 10:11:07

4 active jobs               4 of 4 processors in use by local jobs (100.00%)
                            2 of 2 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 blocked jobs

Total jobs:  4

The job script is:
#!/bin/bash
#PBS -l nodes=1
#XCPU -p

date


The weird thing is that all the jobs end up being executed on node
n0000, as per the output:

[EMAIL PROTECTED] xcpu]$ cat script.cmd.o25
n0000: Mon Sep 15 10:11:42 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o26
n0000: Mon Sep 15 10:11:26 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o27
n0000: Mon Sep 15 10:13:03 UTC 2008
[EMAIL PROTECTED] xcpu]$ cat script.cmd.o28
n0000: Mon Sep 15 10:12:02 UTC 2008

This is despite the fact that when I use the moab "checkjob" command,
some of these jobs were supposedly assigned to n0001 an some to n0000:

[EMAIL PROTECTED] xcpu]$ checkjob 26.dgk3.chem.utoronto.ca
job 26.dgk3.chem.utoronto.ca

AName: script.cmd
State: Completed
Complete Time:  Mon Sep 15 10:12:40
  Completion Code: 0
Creds:  user:danny  group:danny  class:batch
WallTime:   00:01:33 of 1:00:00
SubmitTime: Mon Sep 15 10:10:53
  (Time Queued  Total: 00:02:49  Eligible: 00:00:00)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: dgk3
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---

Allocated Nodes:
[n0001:1]

[EMAIL PROTECTED] xcpu]$ checkjob 28.dgk3.chem.utoronto.ca
job 28.dgk3.chem.utoronto.ca

AName: script.cmd
State: Running
Creds:  user:danny  group:danny  class:batch
WallTime:   00:01:33 of 1:00:00
SubmitTime: Mon Sep 15 10:10:56
  (Time Queued  Total: 00:00:11  Eligible: 00:00:11)

StartTime: Mon Sep 15 10:11:07
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: dgk3
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---

Allocated Nodes:
[n0000:1]


Can you shed any light on this?
Thanks a lot,
Daniel

Reply via email to