Daniel, Can you send me Moab's log files? For me, the Moab log directory is /opt/moab/log/. One of each type of log file would help me figure out what is happening. Thanks.
On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote: > Hi Hugh, > > On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> wrote: > > > > Daniel, > > > > Just to be sure, are you running statfs? The moab scripts gets the node > > information from statfs and moab is only showing one node. > > Yes I am. Here is what xstat says: > [EMAIL PROTECTED] ~]$ xstat > n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > > > > Once that is fixed, you may see that torque error anyway. Did you do > > the following as specified in the Readme?: > > > > 3. There only needs to be one pbs_mom running on the head node(s). > > Since there is only one pbs_mom running, Torque needs to be aware of the > > number of nodes in your cluster, otherwise job submission will fail if > > the user requests more than one node. > > To make Torque aware of the number of nodes in your cluster, execute > > qmgr and enter something like the following on the qmgr command prompt: > > > > Qmgr: s s resources_available.nodect = 91 > > Qmgr: s q batch resources_available.nodect=91 > > I did read the instructions, and this is the configuration, as per qmgr: > > [EMAIL PROTECTED] ~]# qmgr > Max open servers: 4 > Qmgr: print server > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch resources_available.nodect = 2 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = dgk3.chem.utoronto.ca > set server managers = [EMAIL PROTECTED] > set server operators = [EMAIL PROTECTED] > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server resources_available.nodect = 2 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 9 > > As you see, I tried to follow your README instructions pretty faithfully... > Daniel > > > > > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote: > >> Hi > >> > >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), and > >> after following the instructions in the > >> sxcpu/moab_torque/README.Moab_Torque file, and running all of > >> pbs_server, pbs_mom, and moab, it appears that moab only recognizes > >> one node in my cluster. This test cluster has a master and 2 slaves, > >> each with 2 processors. > >> > >> Here are my configuration files: > >> > >> [EMAIL PROTECTED] torque]# cat server_priv/nodes > >> dgk3.chem.utoronto.ca np=4 > >> > >> [EMAIL PROTECTED] torque]# cat mom_priv/config > >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh > >> > >> [EMAIL PROTECTED] moab]# cat moab.cfg > >> ################################################################################ > >> # > >> # Moab Configuration File for moab-5.2.4 > >> # > >> # Documentation can be found at > >> # http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml > >> # > >> # For a complete list of all parameters (including those below) please > >> see: > >> # http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml > >> # > >> # For more information on the initial configuration, please refer to: > >> # http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm > >> # > >> # Use 'mdiag -C' to check config file parameters for validity > >> # > >> ################################################################################ > >> > >> SCHEDCFG[Moab] SERVER=dgk3:42559 > >> ADMINCFG[1] USERS=root > >> TOOLSDIR /opt/moab/tools > >> LOGLEVEL 3 > >> > >> ################################################################################ > >> # > >> # Resource Manager configuration > >> # > >> # For more information on configuring a Resource Manager, see: > >> # > >> http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml > >> # > >> ################################################################################ > >> > >> RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP > >> RMCFG[dgk3] CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl > >> RMCFG[dgk3] WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl > >> RMCFG[dgk3] JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl > >> RMCFG[dgk3] JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl > >> > >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl > >> ################################################################################# > >> # Configuration file for xcpu tools > >> # > >> # This was written by ClusterResources. Modifications were made for XCPU > >> by > >> # Hugh Greenberg. > >> ################################################################################ > >> > >> use FindBin qw($Bin); # The $Bin directory is the directory this file > >> is in > >> > >> # Important: Moab::Tools must be included in the calling script > >> # before this config file so that homeDir is properly set. > >> our ($homeDir); > >> > >> # Set the PATH to include directories for bproc and torque binaries > >> $ENV{PATH} = "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin"; > >> > >> # Set paths as necessary -- these can be short names if PATH is included > >> above > >> $xstat = 'xstat'; > >> $xrx = 'xrx'; > >> $xk = 'xk'; > >> $qrun = 'qrun'; > >> $qstat = 'qstat'; > >> $pbsnodes = 'pbsnodes'; > >> > >> # Set configured node resources > >> $processorsPerNode = 2; # Number of processors > >> $memoryPerNode = 2048; # Memory in megabytes > >> $swapPerNode = 2048; # Swap in megabytes > >> > >> # Specify level of log detail > >> $logLevel = 1; > >> > >> # The default number of processors to run on > >> $nodes = 1; > >> > >> > >> Here is the output from "showq": > >> > >> [EMAIL PROTECTED] ~]$ showq > >> > >> active jobs------------------------ > >> JOBID USERNAME STATE PROCS REMAINING > >> STARTTIME > >> > >> > >> 0 active jobs 0 of 4 processors in use by local jobs (0.00%) > >> 0 of 1 nodes active (0.00%) > >> > >> eligible jobs---------------------- > >> JOBID USERNAME STATE PROCS WCLIMIT > >> QUEUETIME > >> > >> > >> 0 eligible jobs > >> > >> blocked jobs----------------------- > >> JOBID USERNAME STATE PROCS WCLIMIT > >> QUEUETIME > >> > >> > >> 0 blocked jobs > >> > >> Total jobs: 0 > >> > >> > >> When I run the following script: > >> > >> #!/bin/bash > >> #PBS -l nodes=2 > >> #XCPU -p > >> > >> date > >> > >> it eventually finishes, but it runs both processes on the same node > >> n0000. If I specify > >> more than 2 nodes (processes, really), the job aborts saying it > >> doesn't have enough resouces. The issue seems to be that moab > >> understands that it has only one active node - it appears to simply > >> probe the master, since it is the node specified in the > >> server_priv/nodes file, and there is a single mom running. > >> > >> Any ideas? > >> > >> Thanks, > >> Daniel > > -- > > Hugh Greenberg > > Los Alamos National Laboratory, CCS-1 > > Email: [EMAIL PROTECTED] > > Phone: (505) 665-6471 > > > > -- Hugh Greenberg Los Alamos National Laboratory, CCS-1 Email: [EMAIL PROTECTED] Phone: (505) 665-6471
