[xcpu] Re: moab/torque problems

Hugh Greenberg Tue, 16 Sep 2008 13:30:42 -0700

Daniel,

I'm not sure what is going on.  Can you please send me the Torque logs?
Also can you please send me all your Moab logs again?  I might be able
to help you more after looking at them.  Thanks.


On Tue, 2008-09-16 at 15:49 -0400, Daniel Gruner wrote:
> Hi Hugh,
> 
> I am still having some weird problems with moab/torque on my test xcpu
> cluster.  I mentioned some of these in a previous e-mail, but the
> query went unanswered, and since you wrote the script(s) perhaps you
> could help me debug this?  Here is the issue:
> 
> I have 2 compute nodes, each with 2 cpus.  I submit several jobs to
> the queue using qsub:
> 
> [EMAIL PROTECTED] xcpu]$ xstat
> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
> 
> [EMAIL PROTECTED] xcpu]$ showq
> 
> active jobs------------------------
> JOBID                     USERNAME      STATE PROCS   REMAINING
>     STARTTIME
> 
> 25.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
> Sep 15 10:11:07
> 26.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
> Sep 15 10:11:07
> 27.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
> Sep 15 10:11:07
> 28.dgk3.chem.utoronto.ca     danny    Running     1    00:58:30  Mon
> Sep 15 10:11:07
> 
> 4 active jobs               4 of 4 processors in use by local jobs (100.00%)
>                             2 of 2 nodes active      (100.00%)
> 
> eligible jobs----------------------
> JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
> 
> 
> 0 eligible jobs
> 
> blocked jobs-----------------------
> JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
> 
> 
> 0 blocked jobs
> 
> Total jobs:  4
> 
> The job script is:
> #!/bin/bash
> #PBS -l nodes=1
> #XCPU -p
> 
> date
> 
> 
> The weird thing is that all the jobs end up being executed on node
> n0000, as per the output:
> 
> [EMAIL PROTECTED] xcpu]$ cat script.cmd.o25
> n0000: Mon Sep 15 10:11:42 UTC 2008
> [EMAIL PROTECTED] xcpu]$ cat script.cmd.o26
> n0000: Mon Sep 15 10:11:26 UTC 2008
> [EMAIL PROTECTED] xcpu]$ cat script.cmd.o27
> n0000: Mon Sep 15 10:13:03 UTC 2008
> [EMAIL PROTECTED] xcpu]$ cat script.cmd.o28
> n0000: Mon Sep 15 10:12:02 UTC 2008
> 
> This is despite the fact that when I use the moab "checkjob" command,
> some of these jobs were supposedly assigned to n0001 an some to n0000:
> 
> [EMAIL PROTECTED] xcpu]$ checkjob 26.dgk3.chem.utoronto.ca
> job 26.dgk3.chem.utoronto.ca
> 
> AName: script.cmd
> State: Completed
> Complete Time:  Mon Sep 15 10:12:40
>   Completion Code: 0
> Creds:  user:danny  group:danny  class:batch
> WallTime:   00:01:33 of 1:00:00
> SubmitTime: Mon Sep 15 10:10:53
>   (Time Queued  Total: 00:02:49  Eligible: 00:00:00)
> 
> Total Requested Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: dgk3
> Memory >= 0  Disk >= 0  Swap >= 0
> Opsys:   ---  Arch: ---  Features: ---
> 
> Allocated Nodes:
> [n0001:1]
> 
> [EMAIL PROTECTED] xcpu]$ checkjob 28.dgk3.chem.utoronto.ca
> job 28.dgk3.chem.utoronto.ca
> 
> AName: script.cmd
> State: Running
> Creds:  user:danny  group:danny  class:batch
> WallTime:   00:01:33 of 1:00:00
> SubmitTime: Mon Sep 15 10:10:56
>   (Time Queued  Total: 00:00:11  Eligible: 00:00:11)
> 
> StartTime: Mon Sep 15 10:11:07
> Total Requested Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: dgk3
> Memory >= 0  Disk >= 0  Swap >= 0
> Opsys:   ---  Arch: ---  Features: ---
> 
> Allocated Nodes:
> [n0000:1]
> 
> 
> Can you shed any light on this?
> Thanks a lot,
> Daniel
-- 
Hugh Greenberg
Los Alamos National Laboratory, CCS-1
Email: [EMAIL PROTECTED]
Phone: (505) 665-6471

[xcpu] Re: moab/torque problems

Reply via email to