Hi Hugh, On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> wrote: > > Daniel, > > Just to be sure, are you running statfs? The moab scripts gets the node > information from statfs and moab is only showing one node.
Yes I am. Here is what xstat says: [EMAIL PROTECTED] ~]$ xstat n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > Once that is fixed, you may see that torque error anyway. Did you do > the following as specified in the Readme?: > > 3. There only needs to be one pbs_mom running on the head node(s). > Since there is only one pbs_mom running, Torque needs to be aware of the > number of nodes in your cluster, otherwise job submission will fail if > the user requests more than one node. > To make Torque aware of the number of nodes in your cluster, execute > qmgr and enter something like the following on the qmgr command prompt: > > Qmgr: s s resources_available.nodect = 91 > Qmgr: s q batch resources_available.nodect=91 I did read the instructions, and this is the configuration, as per qmgr: [EMAIL PROTECTED] ~]# qmgr Max open servers: 4 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch resources_available.nodect = 2 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = dgk3.chem.utoronto.ca set server managers = [EMAIL PROTECTED] set server operators = [EMAIL PROTECTED] set server default_queue = batch set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 2 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 9 As you see, I tried to follow your README instructions pretty faithfully... Daniel > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote: >> Hi >> >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), and >> after following the instructions in the >> sxcpu/moab_torque/README.Moab_Torque file, and running all of >> pbs_server, pbs_mom, and moab, it appears that moab only recognizes >> one node in my cluster. This test cluster has a master and 2 slaves, >> each with 2 processors. >> >> Here are my configuration files: >> >> [EMAIL PROTECTED] torque]# cat server_priv/nodes >> dgk3.chem.utoronto.ca np=4 >> >> [EMAIL PROTECTED] torque]# cat mom_priv/config >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh >> >> [EMAIL PROTECTED] moab]# cat moab.cfg >> ################################################################################ >> # >> # Moab Configuration File for moab-5.2.4 >> # >> # Documentation can be found at >> # http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml >> # >> # For a complete list of all parameters (including those below) please see: >> # http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml >> # >> # For more information on the initial configuration, please refer to: >> # http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm >> # >> # Use 'mdiag -C' to check config file parameters for validity >> # >> ################################################################################ >> >> SCHEDCFG[Moab] SERVER=dgk3:42559 >> ADMINCFG[1] USERS=root >> TOOLSDIR /opt/moab/tools >> LOGLEVEL 3 >> >> ################################################################################ >> # >> # Resource Manager configuration >> # >> # For more information on configuring a Resource Manager, see: >> # >> http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml >> # >> ################################################################################ >> >> RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP >> RMCFG[dgk3] CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl >> RMCFG[dgk3] WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl >> RMCFG[dgk3] JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl >> RMCFG[dgk3] JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl >> >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl >> ################################################################################# >> # Configuration file for xcpu tools >> # >> # This was written by ClusterResources. Modifications were made for XCPU by >> # Hugh Greenberg. >> ################################################################################ >> >> use FindBin qw($Bin); # The $Bin directory is the directory this file is >> in >> >> # Important: Moab::Tools must be included in the calling script >> # before this config file so that homeDir is properly set. >> our ($homeDir); >> >> # Set the PATH to include directories for bproc and torque binaries >> $ENV{PATH} = "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin"; >> >> # Set paths as necessary -- these can be short names if PATH is included >> above >> $xstat = 'xstat'; >> $xrx = 'xrx'; >> $xk = 'xk'; >> $qrun = 'qrun'; >> $qstat = 'qstat'; >> $pbsnodes = 'pbsnodes'; >> >> # Set configured node resources >> $processorsPerNode = 2; # Number of processors >> $memoryPerNode = 2048; # Memory in megabytes >> $swapPerNode = 2048; # Swap in megabytes >> >> # Specify level of log detail >> $logLevel = 1; >> >> # The default number of processors to run on >> $nodes = 1; >> >> >> Here is the output from "showq": >> >> [EMAIL PROTECTED] ~]$ showq >> >> active jobs------------------------ >> JOBID USERNAME STATE PROCS REMAINING STARTTIME >> >> >> 0 active jobs 0 of 4 processors in use by local jobs (0.00%) >> 0 of 1 nodes active (0.00%) >> >> eligible jobs---------------------- >> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME >> >> >> 0 eligible jobs >> >> blocked jobs----------------------- >> JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME >> >> >> 0 blocked jobs >> >> Total jobs: 0 >> >> >> When I run the following script: >> >> #!/bin/bash >> #PBS -l nodes=2 >> #XCPU -p >> >> date >> >> it eventually finishes, but it runs both processes on the same node >> n0000. If I specify >> more than 2 nodes (processes, really), the job aborts saying it >> doesn't have enough resouces. The issue seems to be that moab >> understands that it has only one active node - it appears to simply >> probe the master, since it is the node specified in the >> server_priv/nodes file, and there is a single mom running. >> >> Any ideas? >> >> Thanks, >> Daniel > -- > Hugh Greenberg > Los Alamos National Laboratory, CCS-1 > Email: [EMAIL PROTECTED] > Phone: (505) 665-6471 > >
