It looks to me like Torque is not starting the job. Can you send me Torque's logs? Thanks.
On Thu, 2008-09-11 at 11:24 -0400, Daniel Gruner wrote: > Oh boy! Thanks for finding the typo... Happens when you cut and paste... > > Ok, so we move on: after restarting moab, the showq screen correctly > shows 2 nodes available. However, when I qsub a couple of jobs, they > remain queued: > > [EMAIL PROTECTED] ~]$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 9.dgk3 script.cmd danny 0 Q > batch > 10.dgk3 script.cmd danny 0 Q > batch > [EMAIL PROTECTED] ~]$ showq > > active jobs------------------------ > JOBID USERNAME STATE PROCS REMAINING STARTTIME > > > 0 active jobs 0 of 4 processors in use by local jobs (0.00%) > 0 of 2 nodes active (0.00%) > > eligible jobs---------------------- > JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME > > > 0 eligible jobs > > blocked jobs----------------------- > JOBID USERNAME STATE PROCS WCLIMIT > QUEUETIME > > 10.dgk3.chem.utoronto.ca danny BatchHold 2 1:00:00 Thu > Sep 11 11:12:20 > 9.dgk3.chem.utoronto.ca danny BatchHold 2 1:00:00 Thu > Sep 11 11:12:18 > > 2 blocked jobs > > Total jobs: 2 > > and in fact, they remain blocked by moab. I attach here the logs (the > latest and relevant part of moab.log, plus the other two logs). > > Thanks, > Daniel > > > > On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote: > > > > Daniel, > > > > >From the log file, it seems like moab is trying to contact torque > > directly and it is not using the xcpu scripts at all. Also, there was a > > warning message in the log that says: > > > > WARNING: cannot process attribute 'TYPE=NONE' specified for RM dgk3 > > > > It seems as though Moab cannot figure out which resource manager to use. > > I noticed an error in your moab.cfg file. The line: > > > > > > RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP > > > > > > should be: > > > > RMCFG[dgk3] TYPE=NATIVE FLAGS=FULLCP > > > > Try that and let me know if it works or not. If it doesn't work, please > > send the logs. > > > > > > On Thu, 2008-09-11 at 10:37 -0400, Daniel Gruner wrote: > > > Hi Hugh, > > > > > > There is only one file, moab.log, which I attach. > > > > > > Daniel > > > > > > On 9/11/08, Hugh Greenberg <[EMAIL PROTECTED]> wrote: > > > > > > > > Daniel, > > > > > > > > Can you send me Moab's log files? For me, the Moab log directory > > > > is /opt/moab/log/. One of each type of log file would help me figure > > > > out what is happening. Thanks. > > > > > > > > > > > > On Wed, 2008-09-10 at 23:09 -0400, Daniel Gruner wrote: > > > > > Hi Hugh, > > > > > > > > > > On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > Daniel, > > > > > > > > > > > > Just to be sure, are you running statfs? The moab scripts gets > > the node > > > > > > information from statfs and moab is only showing one node. > > > > > > > > > > Yes I am. Here is what xstat says: > > > > > [EMAIL PROTECTED] ~]$ xstat > > > > > n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > > > > > n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > > > > > > > > > > > > > > > > Once that is fixed, you may see that torque error anyway. Did > > you do > > > > > > the following as specified in the Readme?: > > > > > > > > > > > > 3. There only needs to be one pbs_mom running on the head > > node(s). > > > > > > Since there is only one pbs_mom running, Torque needs to be aware > > of the > > > > > > number of nodes in your cluster, otherwise job submission will > > fail if > > > > > > the user requests more than one node. > > > > > > To make Torque aware of the number of nodes in your cluster, > > execute > > > > > > qmgr and enter something like the following on the qmgr command > > prompt: > > > > > > > > > > > > Qmgr: s s resources_available.nodect = 91 > > > > > > Qmgr: s q batch resources_available.nodect=91 > > > > > > > > > > I did read the instructions, and this is the configuration, as per > > qmgr: > > > > > > > > > > [EMAIL PROTECTED] ~]# qmgr > > > > > Max open servers: 4 > > > > > Qmgr: print server > > > > > # > > > > > # Create queues and set their attributes. > > > > > # > > > > > # > > > > > # Create and define queue batch > > > > > # > > > > > create queue batch > > > > > set queue batch queue_type = Execution > > > > > set queue batch resources_default.nodes = 1 > > > > > set queue batch resources_default.walltime = 01:00:00 > > > > > set queue batch resources_available.nodect = 2 > > > > > set queue batch enabled = True > > > > > set queue batch started = True > > > > > # > > > > > # Set server attributes. > > > > > # > > > > > set server scheduling = True > > > > > set server acl_hosts = dgk3.chem.utoronto.ca > > > > > set server managers = [EMAIL PROTECTED] > > > > > set server operators = [EMAIL PROTECTED] > > > > > set server default_queue = batch > > > > > set server log_events = 511 > > > > > set server mail_from = adm > > > > > set server resources_available.nodect = 2 > > > > > set server scheduler_iteration = 600 > > > > > set server node_check_rate = 150 > > > > > set server tcp_timeout = 6 > > > > > set server mom_job_sync = True > > > > > set server keep_completed = 300 > > > > > set server next_job_number = 9 > > > > > > > > > > As you see, I tried to follow your README instructions pretty > > faithfully... > > > > > Daniel > > > > > > > > > > > > > > > > > > > > > > On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote: > > > > > >> Hi > > > > > >> > > > > > >> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), > > and > > > > > >> after following the instructions in the > > > > > >> sxcpu/moab_torque/README.Moab_Torque file, and running all of > > > > > >> pbs_server, pbs_mom, and moab, it appears that moab only > > recognizes > > > > > >> one node in my cluster. This test cluster has a master and 2 > > slaves, > > > > > >> each with 2 processors. > > > > > >> > > > > > >> Here are my configuration files: > > > > > >> > > > > > >> [EMAIL PROTECTED] torque]# cat server_priv/nodes > > > > > >> dgk3.chem.utoronto.ca np=4 > > > > > >> > > > > > >> [EMAIL PROTECTED] torque]# cat mom_priv/config > > > > > >> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh > > > > > >> > > > > > >> [EMAIL PROTECTED] moab]# cat moab.cfg > > > > > >> > > ################################################################################ > > > > > >> # > > > > > >> # Moab Configuration File for moab-5.2.4 > > > > > >> # > > > > > >> # Documentation can be found at > > > > > >> # > > http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml > > > > > >> # > > > > > >> # For a complete list of all parameters (including those below) > > please see: > > > > > >> # > > http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml > > > > > >> # > > > > > >> # For more information on the initial configuration, please > > refer to: > > > > > >> # > > http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm > > > > > >> # > > > > > >> # Use 'mdiag -C' to check config file parameters for validity > > > > > >> # > > > > > >> > > ################################################################################ > > > > > >> > > > > > >> SCHEDCFG[Moab] SERVER=dgk3:42559 > > > > > >> ADMINCFG[1] USERS=root > > > > > >> TOOLSDIR /opt/moab/tools > > > > > >> LOGLEVEL 3 > > > > > >> > > > > > >> > > ################################################################################ > > > > > >> # > > > > > >> # Resource Manager configuration > > > > > >> # > > > > > >> # For more information on configuring a Resource Manager, see: > > > > > >> # > > http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml > > > > > >> # > > > > > >> > > ################################################################################ > > > > > >> > > > > > >> RMCFG[dgk3] TYPE=TYPE=NATIVE FLAGS=FULLCP > > > > > >> RMCFG[dgk3] > > CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl > > > > > >> RMCFG[dgk3] > > WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl > > > > > >> RMCFG[dgk3] JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl > > > > > >> RMCFG[dgk3] > > JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl > > > > > >> > > > > > >> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl > > > > > >> > > ################################################################################# > > > > > >> # Configuration file for xcpu tools > > > > > >> # > > > > > >> # This was written by ClusterResources. Modifications were made > > for XCPU by > > > > > >> # Hugh Greenberg. > > > > > >> > > ################################################################################ > > > > > >> > > > > > >> use FindBin qw($Bin); # The $Bin directory is the directory > > this file is in > > > > > >> > > > > > >> # Important: Moab::Tools must be included in the calling script > > > > > >> # before this config file so that homeDir is properly set. > > > > > >> our ($homeDir); > > > > > >> > > > > > >> # Set the PATH to include directories for bproc and torque > > binaries > > > > > >> $ENV{PATH} = > > "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin"; > > > > > >> > > > > > >> # Set paths as necessary -- these can be short names if PATH is > > included above > > > > > >> $xstat = 'xstat'; > > > > > >> $xrx = 'xrx'; > > > > > >> $xk = 'xk'; > > > > > >> $qrun = 'qrun'; > > > > > >> $qstat = 'qstat'; > > > > > >> $pbsnodes = 'pbsnodes'; > > > > > >> > > > > > >> # Set configured node resources > > > > > >> $processorsPerNode = 2; # Number of processors > > > > > >> $memoryPerNode = 2048; # Memory in megabytes > > > > > >> $swapPerNode = 2048; # Swap in megabytes > > > > > >> > > > > > >> # Specify level of log detail > > > > > >> $logLevel = 1; > > > > > >> > > > > > >> # The default number of processors to run on > > > > > >> $nodes = 1; > > > > > >> > > > > > >> > > > > > >> Here is the output from "showq": > > > > > >> > > > > > >> [EMAIL PROTECTED] ~]$ showq > > > > > >> > > > > > >> active jobs------------------------ > > > > > >> JOBID USERNAME STATE PROCS REMAINING > > STARTTIME > > > > > >> > > > > > >> > > > > > >> 0 active jobs 0 of 4 processors in use by local > > jobs (0.00%) > > > > > >> 0 of 1 nodes active (0.00%) > > > > > >> > > > > > >> eligible jobs---------------------- > > > > > >> JOBID USERNAME STATE PROCS WCLIMIT > > QUEUETIME > > > > > >> > > > > > >> > > > > > >> 0 eligible jobs > > > > > >> > > > > > >> blocked jobs----------------------- > > > > > >> JOBID USERNAME STATE PROCS WCLIMIT > > QUEUETIME > > > > > >> > > > > > >> > > > > > >> 0 blocked jobs > > > > > >> > > > > > >> Total jobs: 0 > > > > > >> > > > > > >> > > > > > >> When I run the following script: > > > > > >> > > > > > >> #!/bin/bash > > > > > >> #PBS -l nodes=2 > > > > > >> #XCPU -p > > > > > >> > > > > > >> date > > > > > >> > > > > > >> it eventually finishes, but it runs both processes on the same > > node > > > > > >> n0000. If I specify > > > > > >> more than 2 nodes (processes, really), the job aborts saying it > > > > > >> doesn't have enough resouces. The issue seems to be that moab > > > > > >> understands that it has only one active node - it appears to > > simply > > > > > >> probe the master, since it is the node specified in the > > > > > >> server_priv/nodes file, and there is a single mom running. > > > > > >> > > > > > >> Any ideas? > > > > > >> > > > > > >> Thanks, > > > > > >> Daniel > > > > > > -- > > > > > > Hugh Greenberg > > > > > > Los Alamos National Laboratory, CCS-1 > > > > > > Email: [EMAIL PROTECTED] > > > > > > Phone: (505) 665-6471 > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Hugh Greenberg > > > > Los Alamos National Laboratory, CCS-1 > > > > Email: [EMAIL PROTECTED] > > > > Phone: (505) 665-6471 > > > > > > > > > > > > -- > > > > Hugh Greenberg > > Los Alamos National Laboratory, CCS-1 > > Email: [EMAIL PROTECTED] > > Phone: (505) 665-6471 > > > > -- Hugh Greenberg Los Alamos National Laboratory, CCS-1 Email: [EMAIL PROTECTED] Phone: (505) 665-6471
