Daniel,

Can you send me Moab's log files?  For me, the Moab log directory
is /opt/moab/log/.  One of each type of log file would help me figure
out what is happening.  Thanks.

On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote:
> Hi Hugh,
> 
> On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
> >
> > Daniel,
> >
> > Just to be sure, are you running statfs?  The moab scripts gets the node
> > information from statfs and moab is only showing one node.
> 
> Yes I am.  Here is what xstat says:
> [EMAIL PROTECTED] ~]$ xstat
> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
> 
> >
> > Once that is fixed, you may see that torque error anyway.  Did you do
> > the following as specified in the Readme?:
> >
> > 3.  There only needs to be one pbs_mom running on the head node(s).
> > Since there is only one pbs_mom running, Torque needs to be aware of the
> > number of nodes in your cluster, otherwise job submission will fail if
> > the user requests more than one node.
> > To make Torque aware of the number of nodes in your cluster, execute
> > qmgr and enter something like the following on the qmgr command prompt:
> >
> > Qmgr: s s resources_available.nodect = 91
> > Qmgr: s q batch resources_available.nodect=91
> 
> I did read the instructions, and this is the configuration, as per qmgr:
> 
> [EMAIL PROTECTED] ~]# qmgr
> Max open servers: 4
> Qmgr: print server
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 01:00:00
> set queue batch resources_available.nodect = 2
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = dgk3.chem.utoronto.ca
> set server managers = [EMAIL PROTECTED]
> set server operators = [EMAIL PROTECTED]
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server resources_available.nodect = 2
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
> set server next_job_number = 9
> 
> As you see, I tried to follow your README instructions pretty faithfully...
> Daniel
> 
> 
> >
> > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote:
> >> Hi
> >>
> >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), and
> >> after following the instructions in the
> >> sxcpu/moab_torque/README.Moab_Torque file, and running all of
> >> pbs_server, pbs_mom, and moab, it appears that moab only recognizes
> >> one node in my cluster.  This test cluster has a master and 2 slaves,
> >> each with 2 processors.
> >>
> >> Here are my configuration files:
> >>
> >> [EMAIL PROTECTED] torque]# cat server_priv/nodes
> >> dgk3.chem.utoronto.ca np=4
> >>
> >> [EMAIL PROTECTED] torque]# cat mom_priv/config
> >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh
> >>
> >> [EMAIL PROTECTED] moab]# cat moab.cfg
> >> ################################################################################
> >> #
> >> #  Moab Configuration File for moab-5.2.4
> >> #
> >> #  Documentation can be found at
> >> #  http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml
> >> #
> >> #  For a complete list of all parameters (including those below) please 
> >> see:
> >> #  http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml
> >> #
> >> #  For more information on the initial configuration, please refer to:
> >> #  http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm
> >> #
> >> #  Use 'mdiag -C' to check config file parameters for validity
> >> #
> >> ################################################################################
> >>
> >> SCHEDCFG[Moab]        SERVER=dgk3:42559
> >> ADMINCFG[1]           USERS=root
> >> TOOLSDIR              /opt/moab/tools
> >> LOGLEVEL              3
> >>
> >> ################################################################################
> >> #
> >> #  Resource Manager configuration
> >> #
> >> #  For more information on configuring a Resource Manager, see:
> >> #  
> >> http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml
> >> #
> >> ################################################################################
> >>
> >> RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
> >> RMCFG[dgk3]      CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl
> >> RMCFG[dgk3]      WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl
> >> RMCFG[dgk3]      JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl
> >> RMCFG[dgk3]      JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl
> >>
> >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl
> >> #################################################################################
> >> # Configuration file for xcpu tools
> >> #
> >> # This was written by ClusterResources.  Modifications were made for XCPU 
> >> by
> >> # Hugh Greenberg.
> >> ################################################################################
> >>
> >> use FindBin qw($Bin);    # The $Bin directory is the directory this file 
> >> is in
> >>
> >> # Important:  Moab::Tools must be included in the calling script
> >> # before this config file so that homeDir is properly set.
> >> our ($homeDir);
> >>
> >> # Set the PATH to include directories for bproc and torque binaries
> >> $ENV{PATH} = "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin";
> >>
> >> # Set paths as necessary -- these can be short names if PATH is included 
> >> above
> >> $xstat    = 'xstat';
> >> $xrx      = 'xrx';
> >> $xk       = 'xk';
> >> $qrun     = 'qrun';
> >> $qstat    = 'qstat';
> >> $pbsnodes = 'pbsnodes';
> >>
> >> # Set configured node resources
> >> $processorsPerNode = 2;        # Number of processors
> >> $memoryPerNode     = 2048;     # Memory in megabytes
> >> $swapPerNode       = 2048;     # Swap in megabytes
> >>
> >> # Specify level of log detail
> >> $logLevel = 1;
> >>
> >> # The default number of processors to run on
> >> $nodes = 1;
> >>
> >>
> >> Here is the output from "showq":
> >>
> >> [EMAIL PROTECTED] ~]$ showq
> >>
> >> active jobs------------------------
> >> JOBID              USERNAME      STATE PROCS   REMAINING            
> >> STARTTIME
> >>
> >>
> >> 0 active jobs               0 of 4 processors in use by local jobs (0.00%)
> >>                             0 of 1 nodes active      (0.00%)
> >>
> >> eligible jobs----------------------
> >> JOBID              USERNAME      STATE PROCS     WCLIMIT            
> >> QUEUETIME
> >>
> >>
> >> 0 eligible jobs
> >>
> >> blocked jobs-----------------------
> >> JOBID              USERNAME      STATE PROCS     WCLIMIT            
> >> QUEUETIME
> >>
> >>
> >> 0 blocked jobs
> >>
> >> Total jobs:  0
> >>
> >>
> >> When I run the following script:
> >>
> >> #!/bin/bash
> >> #PBS -l nodes=2
> >> #XCPU -p
> >>
> >> date
> >>
> >> it eventually finishes, but it runs both processes on the same node
> >> n0000.  If I specify
> >> more than 2 nodes (processes, really), the job aborts saying it
> >> doesn't have enough resouces.  The issue seems to be that moab
> >> understands that it has only one active node - it appears to simply
> >> probe the master, since it is the node specified in the
> >> server_priv/nodes file, and there is a single mom running.
> >>
> >> Any ideas?
> >>
> >> Thanks,
> >> Daniel
> > --
> > Hugh Greenberg
> > Los Alamos National Laboratory, CCS-1
> > Email: [EMAIL PROTECTED]
> > Phone: (505) 665-6471
> >
> >
-- 
Hugh Greenberg
Los Alamos National Laboratory, CCS-1
Email: [EMAIL PROTECTED]
Phone: (505) 665-6471

Reply via email to