[xcpu] Re: moab/torque problems

Daniel Gruner Thu, 11 Sep 2008 08:44:06 -0700

Forget it.  For whatever reason, after I decided to "qdel" those two
jobs, I submitted
a few more, and they got running.  That is, some are running, and some
are queued,
and all that is fine.


What is still very puzzling is why they take forever to run...  For example:

[EMAIL PROTECTED] ~]$ showq

active jobs------------------------
JOBID                     USERNAME      STATE PROCS   REMAINING
    STARTTIME

11.dgk3.chem.utoronto.ca     danny    Running     2    00:58:13  Thu
Sep 11 11:32:00
12.dgk3.chem.utoronto.ca     danny    Running     2    00:58:44  Thu
Sep 11 11:32:31

2 active jobs               4 of 4 processors in use by local jobs (100.00%)
                            2 of 2 nodes active      (100.00%)

eligible jobs----------------------
JOBID                     USERNAME      STATE PROCS     WCLIMIT
    QUEUETIME

13.dgk3.chem.utoronto.ca     danny       Idle     2     1:00:00  Thu
Sep 11 11:32:30
14.dgk3.chem.utoronto.ca     danny       Idle     2     1:00:00  Thu
Sep 11 11:32:31

2 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 blocked jobs

Total jobs:  4

These jobs all they do is call the "date" command:
[EMAIL PROTECTED] ~]$ cat script.cmd
#!/bin/bash
#PBS -l nodes=2
#XCPU -p

date


The error output from all these jobs is:
[EMAIL PROTECTED] ~]$ cat script.cmd.e14
stty: standard input: Inappropriate ioctl for device
TERM environment variable not set.

The stdout from them is:
[EMAIL PROTECTED] ~]$ cat script.cmd.o14
n0000: Thu Sep 11 11:34:21 UTC 2008
n0000: Thu Sep 11 11:34:21 UTC 2008
[EMAIL PROTECTED] ~]$ cat script.cmd.o13
n0000: Thu Sep 11 11:34:17 UTC 2008
n0000: Thu Sep 11 11:34:17 UTC 2008
[EMAIL PROTECTED] ~]$ cat script.cmd.o12
n0000: Thu Sep 11 11:32:48 UTC 2008
n0000: Thu Sep 11 11:32:48 UTC 2008
[EMAIL PROTECTED] ~]$ cat script.cmd.o11
n0000: Thu Sep 11 11:33:26 UTC 2008
n0000: Thu Sep 11 11:33:26 UTC 2008

The "funny" thing is that all these jobs are being run on the same
node n0000, even though two nodes are available.  Something is still
not right.  The queuing system does allow the two jobs to run (after a
fashion, because they seem to be waiting forever until some timeout or
something), but they are both submitted to the same node.

When I submit a bunch of jobs requiring 1 node each, 4 get run and 2
queued, correctly, but they still take a long time.  Also, they all
get run on node n0000, and not sent to the different nodes.

???
Daniel



On 9/11/08, Daniel Gruner <[EMAIL PROTECTED]> wrote:
> Oh boy! Thanks for finding the typo...  Happens when you cut and paste...
>
>  Ok, so we move on:  after restarting moab, the showq screen correctly
>  shows 2 nodes available.  However, when I qsub a couple of jobs, they
>  remain queued:
>
>  [EMAIL PROTECTED] ~]$ qstat
>  Job id                    Name             User            Time Use S Queue
>  ------------------------- ---------------- --------------- -------- - -----
>  9.dgk3                    script.cmd       danny                  0 Q
>  batch
>  10.dgk3                   script.cmd       danny                  0 Q
>  batch
>
> [EMAIL PROTECTED] ~]$ showq
>
>  active jobs------------------------
>  JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
>
>
>  0 active jobs               0 of 4 processors in use by local jobs (0.00%)
>                             0 of 2 nodes active      (0.00%)
>
>  eligible jobs----------------------
>  JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
>
>
>  0 eligible jobs
>
>  blocked jobs-----------------------
>  JOBID                     USERNAME      STATE PROCS     WCLIMIT
>     QUEUETIME
>
>
> 10.dgk3.chem.utoronto.ca     danny  BatchHold     2     1:00:00  Thu
>  Sep 11 11:12:20
>  9.dgk3.chem.utoronto.ca      danny  BatchHold     2     1:00:00  Thu
>  Sep 11 11:12:18
>
>  2 blocked jobs
>
>  Total jobs:  2
>
>  and in fact, they remain blocked by moab.  I attach here the logs (the
>  latest and relevant part of moab.log, plus the other two logs).
>
>  Thanks,
>
> Daniel
>
>
>
>  On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
>  >
>  >  Daniel,
>  >
>  >  >From the log file, it seems like moab is trying to contact torque
>  >  directly and it is not using the xcpu scripts at all.  Also, there was a
>  >  warning message in the log that says:
>  >
>  >  WARNING:  cannot process attribute 'TYPE=NONE' specified for RM dgk3
>  >
>  >  It seems as though Moab cannot figure out which resource manager to use.
>  >  I noticed an error in your moab.cfg file.  The line:
>  >
>  >
>  >  RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
>  >
>  >
>  > should be:
>  >
>  >  RMCFG[dgk3]            TYPE=NATIVE FLAGS=FULLCP
>  >
>  >  Try that and let me know if it works or not.  If it doesn't work, please
>  >  send the logs.
>  >
>  >
>  >  On Thu, 2008-09-11 at 10:37 -0400, Daniel Gruner wrote:
>  >  > Hi Hugh,
>  >  >
>  >  > There is only one file, moab.log, which I attach.
>  >  >
>  >  > Daniel
>  >  >
>  >  > On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
>  >  > >
>  >  > >  Daniel,
>  >  > >
>  >  > >  Can you send me Moab's log files?  For me, the Moab log directory
>  >  > >  is /opt/moab/log/.  One of each type of log file would help me figure
>  >  > >  out what is happening.  Thanks.
>  >  > >
>  >  > >
>  >  > >  On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote:
>  >  > >  > Hi Hugh,
>  >  > >  >
>  >  > >  > On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL 
> PROTECTED]> wrote:
>  >  > >  > >
>  >  > >  > > Daniel,
>  >  > >  > >
>  >  > >  > > Just to be sure, are you running statfs?  The moab scripts gets 
> the node
>  >  > >  > > information from statfs and moab is only showing one node.
>  >  > >  >
>  >  > >  > Yes I am.  Here is what xstat says:
>  >  > >  > [EMAIL PROTECTED] ~]$ xstat
>  >  > >  > n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
>  >  > >  > n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
>  >  > >  >
>  >  > >  > >
>  >  > >  > > Once that is fixed, you may see that torque error anyway.  Did 
> you do
>  >  > >  > > the following as specified in the Readme?:
>  >  > >  > >
>  >  > >  > > 3.  There only needs to be one pbs_mom running on the head 
> node(s).
>  >  > >  > > Since there is only one pbs_mom running, Torque needs to be 
> aware of the
>  >  > >  > > number of nodes in your cluster, otherwise job submission will 
> fail if
>  >  > >  > > the user requests more than one node.
>  >  > >  > > To make Torque aware of the number of nodes in your cluster, 
> execute
>  >  > >  > > qmgr and enter something like the following on the qmgr command 
> prompt:
>  >  > >  > >
>  >  > >  > > Qmgr: s s resources_available.nodect = 91
>  >  > >  > > Qmgr: s q batch resources_available.nodect=91
>  >  > >  >
>  >  > >  > I did read the instructions, and this is the configuration, as per 
> qmgr:
>  >  > >  >
>  >  > >  > [EMAIL PROTECTED] ~]# qmgr
>  >  > >  > Max open servers: 4
>  >  > >  > Qmgr: print server
>  >  > >  > #
>  >  > >  > # Create queues and set their attributes.
>  >  > >  > #
>  >  > >  > #
>  >  > >  > # Create and define queue batch
>  >  > >  > #
>  >  > >  > create queue batch
>  >  > >  > set queue batch queue_type = Execution
>  >  > >  > set queue batch resources_default.nodes = 1
>  >  > >  > set queue batch resources_default.walltime = 01:00:00
>  >  > >  > set queue batch resources_available.nodect = 2
>  >  > >  > set queue batch enabled = True
>  >  > >  > set queue batch started = True
>  >  > >  > #
>  >  > >  > # Set server attributes.
>  >  > >  > #
>  >  > >  > set server scheduling = True
>  >  > >  > set server acl_hosts = dgk3.chem.utoronto.ca
>  >  > >  > set server managers = [EMAIL PROTECTED]
>  >  > >  > set server operators = [EMAIL PROTECTED]
>  >  > >  > set server default_queue = batch
>  >  > >  > set server log_events = 511
>  >  > >  > set server mail_from = adm
>  >  > >  > set server resources_available.nodect = 2
>  >  > >  > set server scheduler_iteration = 600
>  >  > >  > set server node_check_rate = 150
>  >  > >  > set server tcp_timeout = 6
>  >  > >  > set server mom_job_sync = True
>  >  > >  > set server keep_completed = 300
>  >  > >  > set server next_job_number = 9
>  >  > >  >
>  >  > >  > As you see, I tried to follow your README instructions pretty 
> faithfully...
>  >  > >  > Daniel
>  >  > >  >
>  >  > >  >
>  >  > >  > >
>  >  > >  > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote:
>  >  > >  > >> Hi
>  >  > >  > >>
>  >  > >  > >> I got an evaluation version of Moab (5.2.4), and torque 
> (2.3.3), and
>  >  > >  > >> after following the instructions in the
>  >  > >  > >> sxcpu/moab_torque/README.Moab_Torque file, and running all of
>  >  > >  > >> pbs_server, pbs_mom, and moab, it appears that moab only 
> recognizes
>  >  > >  > >> one node in my cluster.  This test cluster has a master and 2 
> slaves,
>  >  > >  > >> each with 2 processors.
>  >  > >  > >>
>  >  > >  > >> Here are my configuration files:
>  >  > >  > >>
>  >  > >  > >> [EMAIL PROTECTED] torque]# cat server_priv/nodes
>  >  > >  > >> dgk3.chem.utoronto.ca np=4
>  >  > >  > >>
>  >  > >  > >> [EMAIL PROTECTED] torque]# cat mom_priv/config
>  >  > >  > >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh
>  >  > >  > >>
>  >  > >  > >> [EMAIL PROTECTED] moab]# cat moab.cfg
>  >  > >  > >> 
> ################################################################################
>  >  > >  > >> #
>  >  > >  > >> #  Moab Configuration File for moab-5.2.4
>  >  > >  > >> #
>  >  > >  > >> #  Documentation can be found at
>  >  > >  > >> #  
> http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml
>  >  > >  > >> #
>  >  > >  > >> #  For a complete list of all parameters (including those 
> below) please see:
>  >  > >  > >> #  
> http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml
>  >  > >  > >> #
>  >  > >  > >> #  For more information on the initial configuration, please 
> refer to:
>  >  > >  > >> #  
> http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm
>  >  > >  > >> #
>  >  > >  > >> #  Use 'mdiag -C' to check config file parameters for validity
>  >  > >  > >> #
>  >  > >  > >> 
> ################################################################################
>  >  > >  > >>
>  >  > >  > >> SCHEDCFG[Moab]        SERVER=dgk3:42559
>  >  > >  > >> ADMINCFG[1]           USERS=root
>  >  > >  > >> TOOLSDIR              /opt/moab/tools
>  >  > >  > >> LOGLEVEL              3
>  >  > >  > >>
>  >  > >  > >> 
> ################################################################################
>  >  > >  > >> #
>  >  > >  > >> #  Resource Manager configuration
>  >  > >  > >> #
>  >  > >  > >> #  For more information on configuring a Resource Manager, see:
>  >  > >  > >> #  
> http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml
>  >  > >  > >> #
>  >  > >  > >> 
> ################################################################################
>  >  > >  > >>
>  >  > >  > >> RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
>  >  > >  > >> RMCFG[dgk3]      
> CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl
>  >  > >  > >> RMCFG[dgk3]      
> WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl
>  >  > >  > >> RMCFG[dgk3]      JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl
>  >  > >  > >> RMCFG[dgk3]      
> JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl
>  >  > >  > >>
>  >  > >  > >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl
>  >  > >  > >> 
> #################################################################################
>  >  > >  > >> # Configuration file for xcpu tools
>  >  > >  > >> #
>  >  > >  > >> # This was written by ClusterResources.  Modifications were 
> made for XCPU by
>  >  > >  > >> # Hugh Greenberg.
>  >  > >  > >> 
> ################################################################################
>  >  > >  > >>
>  >  > >  > >> use FindBin qw($Bin);    # The $Bin directory is the directory 
> this file is in
>  >  > >  > >>
>  >  > >  > >> # Important:  Moab::Tools must be included in the calling script
>  >  > >  > >> # before this config file so that homeDir is properly set.
>  >  > >  > >> our ($homeDir);
>  >  > >  > >>
>  >  > >  > >> # Set the PATH to include directories for bproc and torque 
> binaries
>  >  > >  > >> $ENV{PATH} = 
> "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin";
>  >  > >  > >>
>  >  > >  > >> # Set paths as necessary -- these can be short names if PATH is 
> included above
>  >  > >  > >> $xstat    = 'xstat';
>  >  > >  > >> $xrx      = 'xrx';
>  >  > >  > >> $xk       = 'xk';
>  >  > >  > >> $qrun     = 'qrun';
>  >  > >  > >> $qstat    = 'qstat';
>  >  > >  > >> $pbsnodes = 'pbsnodes';
>  >  > >  > >>
>  >  > >  > >> # Set configured node resources
>  >  > >  > >> $processorsPerNode = 2;        # Number of processors
>  >  > >  > >> $memoryPerNode     = 2048;     # Memory in megabytes
>  >  > >  > >> $swapPerNode       = 2048;     # Swap in megabytes
>  >  > >  > >>
>  >  > >  > >> # Specify level of log detail
>  >  > >  > >> $logLevel = 1;
>  >  > >  > >>
>  >  > >  > >> # The default number of processors to run on
>  >  > >  > >> $nodes = 1;
>  >  > >  > >>
>  >  > >  > >>
>  >  > >  > >> Here is the output from "showq":
>  >  > >  > >>
>  >  > >  > >> [EMAIL PROTECTED] ~]$ showq
>  >  > >  > >>
>  >  > >  > >> active jobs------------------------
>  >  > >  > >> JOBID              USERNAME      STATE PROCS   REMAINING        
>     STARTTIME
>  >  > >  > >>
>  >  > >  > >>
>  >  > >  > >> 0 active jobs               0 of 4 processors in use by local 
> jobs (0.00%)
>  >  > >  > >>                             0 of 1 nodes active      (0.00%)
>  >  > >  > >>
>  >  > >  > >> eligible jobs----------------------
>  >  > >  > >> JOBID              USERNAME      STATE PROCS     WCLIMIT        
>     QUEUETIME
>  >  > >  > >>
>  >  > >  > >>
>  >  > >  > >> 0 eligible jobs
>  >  > >  > >>
>  >  > >  > >> blocked jobs-----------------------
>  >  > >  > >> JOBID              USERNAME      STATE PROCS     WCLIMIT        
>     QUEUETIME
>  >  > >  > >>
>  >  > >  > >>
>  >  > >  > >> 0 blocked jobs
>  >  > >  > >>
>  >  > >  > >> Total jobs:  0
>  >  > >  > >>
>  >  > >  > >>
>  >  > >  > >> When I run the following script:
>  >  > >  > >>
>  >  > >  > >> #!/bin/bash
>  >  > >  > >> #PBS -l nodes=2
>  >  > >  > >> #XCPU -p
>  >  > >  > >>
>  >  > >  > >> date
>  >  > >  > >>
>  >  > >  > >> it eventually finishes, but it runs both processes on the same 
> node
>  >  > >  > >> n0000.  If I specify
>  >  > >  > >> more than 2 nodes (processes, really), the job aborts saying it
>  >  > >  > >> doesn't have enough resouces.  The issue seems to be that moab
>  >  > >  > >> understands that it has only one active node - it appears to 
> simply
>  >  > >  > >> probe the master, since it is the node specified in the
>  >  > >  > >> server_priv/nodes file, and there is a single mom running.
>  >  > >  > >>
>  >  > >  > >> Any ideas?
>  >  > >  > >>
>  >  > >  > >> Thanks,
>  >  > >  > >> Daniel
>  >  > >  > > --
>  >  > >  > > Hugh Greenberg
>  >  > >  > > Los Alamos National Laboratory, CCS-1
>  >  > >  > > Email: [EMAIL PROTECTED]
>  >  > >  > > Phone: (505) 665-6471
>  >  > >  > >
>  >  > >  > >
>  >  > >
>  >  > > --
>  >  > >
>  >  > > Hugh Greenberg
>  >  > >  Los Alamos National Laboratory, CCS-1
>  >  > >  Email: [EMAIL PROTECTED]
>  >  > >  Phone: (505) 665-6471
>  >  > >
>  >  > >
>  >
>  > --
>  >
>  > Hugh Greenberg
>  >  Los Alamos National Laboratory, CCS-1
>  >  Email: [EMAIL PROTECTED]
>  >  Phone: (505) 665-6471
>  >
>  >
>
>

[xcpu] Re: moab/torque problems

Reply via email to