Daniel, >From the log file, it seems like moab is trying to contact torque directly and it is not using the xcpu scripts at all. Also, there was a warning message in the log that says:
WARNING: cannot process attribute 'TYPE=NONE' specified for RM dgk3 It seems as though Moab cannot figure out which resource manager to use. I noticed an error in your moab.cfg file. The line: RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP should be: RMCFG[dgk3] TYPE=NATIVE FLAGS=FULLCP Try that and let me know if it works or not. If it doesn't work, please send the logs. On Thu, 2008-09-11 at 10:37 -0400, Daniel Gruner wrote: > Hi Hugh, > > There is only one file, moab.log, which I attach. > > Daniel > > On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote: > > > > Daniel, > > > > Can you send me Moab's log files? For me, the Moab log directory > > is /opt/moab/log/. One of each type of log file would help me figure > > out what is happening. Thanks. > > > > > > On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote: > > > Hi Hugh, > > > > > > On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> > > wrote: > > > > > > > > Daniel, > > > > > > > > Just to be sure, are you running statfs? The moab scripts gets the > > node > > > > information from statfs and moab is only showing one node. > > > > > > Yes I am. Here is what xstat says: > > > [EMAIL PROTECTED] ~]$ xstat > > > n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > > > n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > > > > > > > > > > Once that is fixed, you may see that torque error anyway. Did you do > > > > the following as specified in the Readme?: > > > > > > > > 3. There only needs to be one pbs_mom running on the head node(s). > > > > Since there is only one pbs_mom running, Torque needs to be aware of > > the > > > > number of nodes in your cluster, otherwise job submission will fail if > > > > the user requests more than one node. > > > > To make Torque aware of the number of nodes in your cluster, execute > > > > qmgr and enter something like the following on the qmgr command prompt: > > > > > > > > Qmgr: s s resources_available.nodect = 91 > > > > Qmgr: s q batch resources_available.nodect=91 > > > > > > I did read the instructions, and this is the configuration, as per qmgr: > > > > > > [EMAIL PROTECTED] ~]# qmgr > > > Max open servers: 4 > > > Qmgr: print server > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue batch > > > # > > > create queue batch > > > set queue batch queue_type = Execution > > > set queue batch resources_default.nodes = 1 > > > set queue batch resources_default.walltime = 01:00:00 > > > set queue batch resources_available.nodect = 2 > > > set queue batch enabled = True > > > set queue batch started = True > > > # > > > # Set server attributes. > > > # > > > set server scheduling = True > > > set server acl_hosts = dgk3.chem.utoronto.ca > > > set server managers = [EMAIL PROTECTED] > > > set server operators = [EMAIL PROTECTED] > > > set server default_queue = batch > > > set server log_events = 511 > > > set server mail_from = adm > > > set server resources_available.nodect = 2 > > > set server scheduler_iteration = 600 > > > set server node_check_rate = 150 > > > set server tcp_timeout = 6 > > > set server mom_job_sync = True > > > set server keep_completed = 300 > > > set server next_job_number = 9 > > > > > > As you see, I tried to follow your README instructions pretty > > faithfully... > > > Daniel > > > > > > > > > > > > > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote: > > > >> Hi > > > >> > > > >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), and > > > >> after following the instructions in the > > > >> sxcpu/moab_torque/README.Moab_Torque file, and running all of > > > >> pbs_server, pbs_mom, and moab, it appears that moab only recognizes > > > >> one node in my cluster. This test cluster has a master and 2 slaves, > > > >> each with 2 processors. > > > >> > > > >> Here are my configuration files: > > > >> > > > >> [EMAIL PROTECTED] torque]# cat server_priv/nodes > > > >> dgk3.chem.utoronto.ca np=4 > > > >> > > > >> [EMAIL PROTECTED] torque]# cat mom_priv/config > > > >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh > > > >> > > > >> [EMAIL PROTECTED] moab]# cat moab.cfg > > > >> > > ################################################################################ > > > >> # > > > >> # Moab Configuration File for moab-5.2.4 > > > >> # > > > >> # Documentation can be found at > > > >> # http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml > > > >> # > > > >> # For a complete list of all parameters (including those below) > > please see: > > > >> # > > http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml > > > >> # > > > >> # For more information on the initial configuration, please refer to: > > > >> # > > http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm > > > >> # > > > >> # Use 'mdiag -C' to check config file parameters for validity > > > >> # > > > >> > > ################################################################################ > > > >> > > > >> SCHEDCFG[Moab] SERVER=dgk3:42559 > > > >> ADMINCFG[1] USERS=root > > > >> TOOLSDIR /opt/moab/tools > > > >> LOGLEVEL 3 > > > >> > > > >> > > ################################################################################ > > > >> # > > > >> # Resource Manager configuration > > > >> # > > > >> # For more information on configuring a Resource Manager, see: > > > >> # > > http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml > > > >> # > > > >> > > ################################################################################ > > > >> > > > >> RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP > > > >> RMCFG[dgk3] CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl > > > >> RMCFG[dgk3] WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl > > > >> RMCFG[dgk3] JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl > > > >> RMCFG[dgk3] JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl > > > >> > > > >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl > > > >> > > ################################################################################# > > > >> # Configuration file for xcpu tools > > > >> # > > > >> # This was written by ClusterResources. Modifications were made for > > XCPU by > > > >> # Hugh Greenberg. > > > >> > > ################################################################################ > > > >> > > > >> use FindBin qw($Bin); # The $Bin directory is the directory this > > file is in > > > >> > > > >> # Important: Moab::Tools must be included in the calling script > > > >> # before this config file so that homeDir is properly set. > > > >> our ($homeDir); > > > >> > > > >> # Set the PATH to include directories for bproc and torque binaries > > > >> $ENV{PATH} = "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin"; > > > >> > > > >> # Set paths as necessary -- these can be short names if PATH is > > included above > > > >> $xstat = 'xstat'; > > > >> $xrx = 'xrx'; > > > >> $xk = 'xk'; > > > >> $qrun = 'qrun'; > > > >> $qstat = 'qstat'; > > > >> $pbsnodes = 'pbsnodes'; > > > >> > > > >> # Set configured node resources > > > >> $processorsPerNode = 2; # Number of processors > > > >> $memoryPerNode = 2048; # Memory in megabytes > > > >> $swapPerNode = 2048; # Swap in megabytes > > > >> > > > >> # Specify level of log detail > > > >> $logLevel = 1; > > > >> > > > >> # The default number of processors to run on > > > >> $nodes = 1; > > > >> > > > >> > > > >> Here is the output from "showq": > > > >> > > > >> [EMAIL PROTECTED] ~]$ showq > > > >> > > > >> active jobs------------------------ > > > >> JOBID USERNAME STATE PROCS REMAINING > > STARTTIME > > > >> > > > >> > > > >> 0 active jobs 0 of 4 processors in use by local jobs > > (0.00%) > > > >> 0 of 1 nodes active (0.00%) > > > >> > > > >> eligible jobs---------------------- > > > >> JOBID USERNAME STATE PROCS WCLIMIT > > QUEUETIME > > > >> > > > >> > > > >> 0 eligible jobs > > > >> > > > >> blocked jobs----------------------- > > > >> JOBID USERNAME STATE PROCS WCLIMIT > > QUEUETIME > > > >> > > > >> > > > >> 0 blocked jobs > > > >> > > > >> Total jobs: 0 > > > >> > > > >> > > > >> When I run the following script: > > > >> > > > >> #!/bin/bash > > > >> #PBS -l nodes=2 > > > >> #XCPU -p > > > >> > > > >> date > > > >> > > > >> it eventually finishes, but it runs both processes on the same node > > > >> n0000. If I specify > > > >> more than 2 nodes (processes, really), the job aborts saying it > > > >> doesn't have enough resouces. The issue seems to be that moab > > > >> understands that it has only one active node - it appears to simply > > > >> probe the master, since it is the node specified in the > > > >> server_priv/nodes file, and there is a single mom running. > > > >> > > > >> Any ideas? > > > >> > > > >> Thanks, > > > >> Daniel > > > > -- > > > > Hugh Greenberg > > > > Los Alamos National Laboratory, CCS-1 > > > > Email: [EMAIL PROTECTED] > > > > Phone: (505) 665-6471 > > > > > > > > > > > > -- > > > > Hugh Greenberg > > Los Alamos National Laboratory, CCS-1 > > Email: [EMAIL PROTECTED] > > Phone: (505) 665-6471 > > > > -- Hugh Greenberg Los Alamos National Laboratory, CCS-1 Email: [EMAIL PROTECTED] Phone: (505) 665-6471
