[xcpu] Re: moab/torque problems

Hugh Greenberg Thu, 11 Sep 2008 08:38:51 -0700

It looks to me like Torque is not starting the job.  Can you send me
Torque's logs?  Thanks.


On Thu, 2008-09-11 at 11:24 -0400, Daniel Gruner wrote:
> Oh boy! Thanks for finding the typo...  Happens when you cut and paste...
> 
> Ok, so we move on:  after restarting moab, the showq screen correctly
> shows 2 nodes available.  However, when I qsub a couple of jobs, they
> remain queued:
> 
> [EMAIL PROTECTED] ~]$ qstat
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 9.dgk3                    script.cmd       danny                  0 Q
> batch
> 10.dgk3                   script.cmd       danny                  0 Q
> batch
> [EMAIL PROTECTED] ~]$ showq
> 
> active jobs------------------------
> JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
> 
> 
> 0 active jobs               0 of 4 processors in use by local jobs (0.00%)
>                             0 of 2 nodes active      (0.00%)
> 
> eligible jobs----------------------
> JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
> 
> 
> 0 eligible jobs
> 
> blocked jobs-----------------------
> JOBID                     USERNAME      STATE PROCS     WCLIMIT
>     QUEUETIME
> 
> 10.dgk3.chem.utoronto.ca     danny  BatchHold     2     1:00:00  Thu
> Sep 11 11:12:20
> 9.dgk3.chem.utoronto.ca      danny  BatchHold     2     1:00:00  Thu
> Sep 11 11:12:18
> 
> 2 blocked jobs
> 
> Total jobs:  2
> 
> and in fact, they remain blocked by moab.  I attach here the logs (the
> latest and relevant part of moab.log, plus the other two logs).
> 
> Thanks,
> Daniel
> 
> 
> 
> On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
> >
> >  Daniel,
> >
> >  >From the log file, it seems like moab is trying to contact torque
> >  directly and it is not using the xcpu scripts at all.  Also, there was a
> >  warning message in the log that says:
> >
> >  WARNING:  cannot process attribute 'TYPE=NONE' specified for RM dgk3
> >
> >  It seems as though Moab cannot figure out which resource manager to use.
> >  I noticed an error in your moab.cfg file.  The line:
> >
> >
> >  RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
> >
> >
> > should be:
> >
> >  RMCFG[dgk3]            TYPE=NATIVE FLAGS=FULLCP
> >
> >  Try that and let me know if it works or not.  If it doesn't work, please
> >  send the logs.
> >
> >
> >  On Thu, 2008-09-11 at 10:37 -0400, Daniel Gruner wrote:
> >  > Hi Hugh,
> >  >
> >  > There is only one file, moab.log, which I attach.
> >  >
> >  > Daniel
> >  >
> >  > On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
> >  > >
> >  > >  Daniel,
> >  > >
> >  > >  Can you send me Moab's log files?  For me, the Moab log directory
> >  > >  is /opt/moab/log/.  One of each type of log file would help me figure
> >  > >  out what is happening.  Thanks.
> >  > >
> >  > >
> >  > >  On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote:
> >  > >  > Hi Hugh,
> >  > >  >
> >  > >  > On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> 
> > wrote:
> >  > >  > >
> >  > >  > > Daniel,
> >  > >  > >
> >  > >  > > Just to be sure, are you running statfs?  The moab scripts gets 
> > the node
> >  > >  > > information from statfs and moab is only showing one node.
> >  > >  >
> >  > >  > Yes I am.  Here is what xstat says:
> >  > >  > [EMAIL PROTECTED] ~]$ xstat
> >  > >  > n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
> >  > >  > n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
> >  > >  >
> >  > >  > >
> >  > >  > > Once that is fixed, you may see that torque error anyway.  Did 
> > you do
> >  > >  > > the following as specified in the Readme?:
> >  > >  > >
> >  > >  > > 3.  There only needs to be one pbs_mom running on the head 
> > node(s).
> >  > >  > > Since there is only one pbs_mom running, Torque needs to be aware 
> > of the
> >  > >  > > number of nodes in your cluster, otherwise job submission will 
> > fail if
> >  > >  > > the user requests more than one node.
> >  > >  > > To make Torque aware of the number of nodes in your cluster, 
> > execute
> >  > >  > > qmgr and enter something like the following on the qmgr command 
> > prompt:
> >  > >  > >
> >  > >  > > Qmgr: s s resources_available.nodect = 91
> >  > >  > > Qmgr: s q batch resources_available.nodect=91
> >  > >  >
> >  > >  > I did read the instructions, and this is the configuration, as per 
> > qmgr:
> >  > >  >
> >  > >  > [EMAIL PROTECTED] ~]# qmgr
> >  > >  > Max open servers: 4
> >  > >  > Qmgr: print server
> >  > >  > #
> >  > >  > # Create queues and set their attributes.
> >  > >  > #
> >  > >  > #
> >  > >  > # Create and define queue batch
> >  > >  > #
> >  > >  > create queue batch
> >  > >  > set queue batch queue_type = Execution
> >  > >  > set queue batch resources_default.nodes = 1
> >  > >  > set queue batch resources_default.walltime = 01:00:00
> >  > >  > set queue batch resources_available.nodect = 2
> >  > >  > set queue batch enabled = True
> >  > >  > set queue batch started = True
> >  > >  > #
> >  > >  > # Set server attributes.
> >  > >  > #
> >  > >  > set server scheduling = True
> >  > >  > set server acl_hosts = dgk3.chem.utoronto.ca
> >  > >  > set server managers = [EMAIL PROTECTED]
> >  > >  > set server operators = [EMAIL PROTECTED]
> >  > >  > set server default_queue = batch
> >  > >  > set server log_events = 511
> >  > >  > set server mail_from = adm
> >  > >  > set server resources_available.nodect = 2
> >  > >  > set server scheduler_iteration = 600
> >  > >  > set server node_check_rate = 150
> >  > >  > set server tcp_timeout = 6
> >  > >  > set server mom_job_sync = True
> >  > >  > set server keep_completed = 300
> >  > >  > set server next_job_number = 9
> >  > >  >
> >  > >  > As you see, I tried to follow your README instructions pretty 
> > faithfully...
> >  > >  > Daniel
> >  > >  >
> >  > >  >
> >  > >  > >
> >  > >  > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote:
> >  > >  > >> Hi
> >  > >  > >>
> >  > >  > >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), 
> > and
> >  > >  > >> after following the instructions in the
> >  > >  > >> sxcpu/moab_torque/README.Moab_Torque file, and running all of
> >  > >  > >> pbs_server, pbs_mom, and moab, it appears that moab only 
> > recognizes
> >  > >  > >> one node in my cluster.  This test cluster has a master and 2 
> > slaves,
> >  > >  > >> each with 2 processors.
> >  > >  > >>
> >  > >  > >> Here are my configuration files:
> >  > >  > >>
> >  > >  > >> [EMAIL PROTECTED] torque]# cat server_priv/nodes
> >  > >  > >> dgk3.chem.utoronto.ca np=4
> >  > >  > >>
> >  > >  > >> [EMAIL PROTECTED] torque]# cat mom_priv/config
> >  > >  > >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh
> >  > >  > >>
> >  > >  > >> [EMAIL PROTECTED] moab]# cat moab.cfg
> >  > >  > >> 
> > ################################################################################
> >  > >  > >> #
> >  > >  > >> #  Moab Configuration File for moab-5.2.4
> >  > >  > >> #
> >  > >  > >> #  Documentation can be found at
> >  > >  > >> #  
> > http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml
> >  > >  > >> #
> >  > >  > >> #  For a complete list of all parameters (including those below) 
> > please see:
> >  > >  > >> #  
> > http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml
> >  > >  > >> #
> >  > >  > >> #  For more information on the initial configuration, please 
> > refer to:
> >  > >  > >> #  
> > http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm
> >  > >  > >> #
> >  > >  > >> #  Use 'mdiag -C' to check config file parameters for validity
> >  > >  > >> #
> >  > >  > >> 
> > ################################################################################
> >  > >  > >>
> >  > >  > >> SCHEDCFG[Moab]        SERVER=dgk3:42559
> >  > >  > >> ADMINCFG[1]           USERS=root
> >  > >  > >> TOOLSDIR              /opt/moab/tools
> >  > >  > >> LOGLEVEL              3
> >  > >  > >>
> >  > >  > >> 
> > ################################################################################
> >  > >  > >> #
> >  > >  > >> #  Resource Manager configuration
> >  > >  > >> #
> >  > >  > >> #  For more information on configuring a Resource Manager, see:
> >  > >  > >> #  
> > http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml
> >  > >  > >> #
> >  > >  > >> 
> > ################################################################################
> >  > >  > >>
> >  > >  > >> RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
> >  > >  > >> RMCFG[dgk3]      
> > CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl
> >  > >  > >> RMCFG[dgk3]      
> > WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl
> >  > >  > >> RMCFG[dgk3]      JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl
> >  > >  > >> RMCFG[dgk3]      
> > JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl
> >  > >  > >>
> >  > >  > >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl
> >  > >  > >> 
> > #################################################################################
> >  > >  > >> # Configuration file for xcpu tools
> >  > >  > >> #
> >  > >  > >> # This was written by ClusterResources.  Modifications were made 
> > for XCPU by
> >  > >  > >> # Hugh Greenberg.
> >  > >  > >> 
> > ################################################################################
> >  > >  > >>
> >  > >  > >> use FindBin qw($Bin);    # The $Bin directory is the directory 
> > this file is in
> >  > >  > >>
> >  > >  > >> # Important:  Moab::Tools must be included in the calling script
> >  > >  > >> # before this config file so that homeDir is properly set.
> >  > >  > >> our ($homeDir);
> >  > >  > >>
> >  > >  > >> # Set the PATH to include directories for bproc and torque 
> > binaries
> >  > >  > >> $ENV{PATH} = 
> > "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin";
> >  > >  > >>
> >  > >  > >> # Set paths as necessary -- these can be short names if PATH is 
> > included above
> >  > >  > >> $xstat    = 'xstat';
> >  > >  > >> $xrx      = 'xrx';
> >  > >  > >> $xk       = 'xk';
> >  > >  > >> $qrun     = 'qrun';
> >  > >  > >> $qstat    = 'qstat';
> >  > >  > >> $pbsnodes = 'pbsnodes';
> >  > >  > >>
> >  > >  > >> # Set configured node resources
> >  > >  > >> $processorsPerNode = 2;        # Number of processors
> >  > >  > >> $memoryPerNode     = 2048;     # Memory in megabytes
> >  > >  > >> $swapPerNode       = 2048;     # Swap in megabytes
> >  > >  > >>
> >  > >  > >> # Specify level of log detail
> >  > >  > >> $logLevel = 1;
> >  > >  > >>
> >  > >  > >> # The default number of processors to run on
> >  > >  > >> $nodes = 1;
> >  > >  > >>
> >  > >  > >>
> >  > >  > >> Here is the output from "showq":
> >  > >  > >>
> >  > >  > >> [EMAIL PROTECTED] ~]$ showq
> >  > >  > >>
> >  > >  > >> active jobs------------------------
> >  > >  > >> JOBID              USERNAME      STATE PROCS   REMAINING         
> >    STARTTIME
> >  > >  > >>
> >  > >  > >>
> >  > >  > >> 0 active jobs               0 of 4 processors in use by local 
> > jobs (0.00%)
> >  > >  > >>                             0 of 1 nodes active      (0.00%)
> >  > >  > >>
> >  > >  > >> eligible jobs----------------------
> >  > >  > >> JOBID              USERNAME      STATE PROCS     WCLIMIT         
> >    QUEUETIME
> >  > >  > >>
> >  > >  > >>
> >  > >  > >> 0 eligible jobs
> >  > >  > >>
> >  > >  > >> blocked jobs-----------------------
> >  > >  > >> JOBID              USERNAME      STATE PROCS     WCLIMIT         
> >    QUEUETIME
> >  > >  > >>
> >  > >  > >>
> >  > >  > >> 0 blocked jobs
> >  > >  > >>
> >  > >  > >> Total jobs:  0
> >  > >  > >>
> >  > >  > >>
> >  > >  > >> When I run the following script:
> >  > >  > >>
> >  > >  > >> #!/bin/bash
> >  > >  > >> #PBS -l nodes=2
> >  > >  > >> #XCPU -p
> >  > >  > >>
> >  > >  > >> date
> >  > >  > >>
> >  > >  > >> it eventually finishes, but it runs both processes on the same 
> > node
> >  > >  > >> n0000.  If I specify
> >  > >  > >> more than 2 nodes (processes, really), the job aborts saying it
> >  > >  > >> doesn't have enough resouces.  The issue seems to be that moab
> >  > >  > >> understands that it has only one active node - it appears to 
> > simply
> >  > >  > >> probe the master, since it is the node specified in the
> >  > >  > >> server_priv/nodes file, and there is a single mom running.
> >  > >  > >>
> >  > >  > >> Any ideas?
> >  > >  > >>
> >  > >  > >> Thanks,
> >  > >  > >> Daniel
> >  > >  > > --
> >  > >  > > Hugh Greenberg
> >  > >  > > Los Alamos National Laboratory, CCS-1
> >  > >  > > Email: [EMAIL PROTECTED]
> >  > >  > > Phone: (505) 665-6471
> >  > >  > >
> >  > >  > >
> >  > >
> >  > > --
> >  > >
> >  > > Hugh Greenberg
> >  > >  Los Alamos National Laboratory, CCS-1
> >  > >  Email: [EMAIL PROTECTED]
> >  > >  Phone: (505) 665-6471
> >  > >
> >  > >
> >
> > --
> >
> > Hugh Greenberg
> >  Los Alamos National Laboratory, CCS-1
> >  Email: [EMAIL PROTECTED]
> >  Phone: (505) 665-6471
> >
> >
-- 
Hugh Greenberg
Los Alamos National Laboratory, CCS-1
Email: [EMAIL PROTECTED]
Phone: (505) 665-6471

[xcpu] Re: moab/torque problems

Reply via email to