[xcpu] Re: moab/torque problems

Daniel Gruner Wed, 10 Sep 2008 20:10:00 -0700

Hi Hugh,

On Wed, Sep 10, 2008 at 5:54 PM, Hugh Greenberg <[EMAIL PROTECTED]> wrote:
>
> Daniel,
>
> Just to be sure, are you running statfs?  The moab scripts gets the node
> information from statfs and moab is only showing one node.


Yes I am.  Here is what xstat says:
[EMAIL PROTECTED] ~]$ xstat
n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0

>
> Once that is fixed, you may see that torque error anyway.  Did you do
> the following as specified in the Readme?:
>
> 3.  There only needs to be one pbs_mom running on the head node(s).
> Since there is only one pbs_mom running, Torque needs to be aware of the
> number of nodes in your cluster, otherwise job submission will fail if
> the user requests more than one node.
> To make Torque aware of the number of nodes in your cluster, execute
> qmgr and enter something like the following on the qmgr command prompt:
>
> Qmgr: s s resources_available.nodect = 91
> Qmgr: s q batch resources_available.nodect=91

I did read the instructions, and this is the configuration, as per qmgr:

[EMAIL PROTECTED] ~]# qmgr
Max open servers: 4
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch resources_available.nodect = 2
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = dgk3.chem.utoronto.ca
set server managers = [EMAIL PROTECTED]
set server operators = [EMAIL PROTECTED]
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 2
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 9

As you see, I tried to follow your README instructions pretty faithfully...
Daniel


>
> On Wed, 2008-09-10 at 17:38 -0400, Daniel Gruner wrote:
>> Hi
>>
>> I got an evaluation version of Moab (5.2.4), and torque (2.3.3), and
>> after following the instructions in the
>> sxcpu/moab_torque/README.Moab_Torque file, and running all of
>> pbs_server, pbs_mom, and moab, it appears that moab only recognizes
>> one node in my cluster.  This test cluster has a master and 2 slaves,
>> each with 2 processors.
>>
>> Here are my configuration files:
>>
>> [EMAIL PROTECTED] torque]# cat server_priv/nodes
>> dgk3.chem.utoronto.ca np=4
>>
>> [EMAIL PROTECTED] torque]# cat mom_priv/config
>> $preexec /opt/moab/tools/xcpu-torque-wrapper.sh
>>
>> [EMAIL PROTECTED] moab]# cat moab.cfg
>> ################################################################################
>> #
>> #  Moab Configuration File for moab-5.2.4
>> #
>> #  Documentation can be found at
>> #  http://www.clusterresources.com/products/mwm/docs/moabadmin.shtml
>> #
>> #  For a complete list of all parameters (including those below) please see:
>> #  http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml
>> #
>> #  For more information on the initial configuration, please refer to:
>> #  http://www.clusterresources.com/products/mwm/docs/2.2initialconfig.shtm
>> #
>> #  Use 'mdiag -C' to check config file parameters for validity
>> #
>> ################################################################################
>>
>> SCHEDCFG[Moab]        SERVER=dgk3:42559
>> ADMINCFG[1]           USERS=root
>> TOOLSDIR              /opt/moab/tools
>> LOGLEVEL              3
>>
>> ################################################################################
>> #
>> #  Resource Manager configuration
>> #
>> #  For more information on configuring a Resource Manager, see:
>> #  
>> http://www.clusterresources.com/products/mwm/docs/13.2rmconfiguration.shtml
>> #
>> ################################################################################
>>
>> RMCFG[dgk3]      TYPE=TYPE=NATIVE FLAGS=FULLCP
>> RMCFG[dgk3]      CLUSTERQUERYURL=exec:///$TOOLSDIR/node.query.xcpu.pl
>> RMCFG[dgk3]      WORKLOADQUERYURL=exec:///$TOOLSDIR/job.query.xcpu.pl
>> RMCFG[dgk3]      JOBSTARTURL=exec:///$TOOLSDIR/job.start.xcpu.pl
>> RMCFG[dgk3]      JOBCANCELURL=exec:///$TOOLSDIR/job.cancel.xcpu.pl
>>
>> [EMAIL PROTECTED] moab]# cat tools/config.xcpu.pl
>> #################################################################################
>> # Configuration file for xcpu tools
>> #
>> # This was written by ClusterResources.  Modifications were made for XCPU by
>> # Hugh Greenberg.
>> ################################################################################
>>
>> use FindBin qw($Bin);    # The $Bin directory is the directory this file is 
>> in
>>
>> # Important:  Moab::Tools must be included in the calling script
>> # before this config file so that homeDir is properly set.
>> our ($homeDir);
>>
>> # Set the PATH to include directories for bproc and torque binaries
>> $ENV{PATH} = "$ENV{PATH}:/opt/torque/bin:/usr/bin:/usr/local/bin";
>>
>> # Set paths as necessary -- these can be short names if PATH is included 
>> above
>> $xstat    = 'xstat';
>> $xrx      = 'xrx';
>> $xk       = 'xk';
>> $qrun     = 'qrun';
>> $qstat    = 'qstat';
>> $pbsnodes = 'pbsnodes';
>>
>> # Set configured node resources
>> $processorsPerNode = 2;        # Number of processors
>> $memoryPerNode     = 2048;     # Memory in megabytes
>> $swapPerNode       = 2048;     # Swap in megabytes
>>
>> # Specify level of log detail
>> $logLevel = 1;
>>
>> # The default number of processors to run on
>> $nodes = 1;
>>
>>
>> Here is the output from "showq":
>>
>> [EMAIL PROTECTED] ~]$ showq
>>
>> active jobs------------------------
>> JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
>>
>>
>> 0 active jobs               0 of 4 processors in use by local jobs (0.00%)
>>                             0 of 1 nodes active      (0.00%)
>>
>> eligible jobs----------------------
>> JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
>>
>>
>> 0 eligible jobs
>>
>> blocked jobs-----------------------
>> JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
>>
>>
>> 0 blocked jobs
>>
>> Total jobs:  0
>>
>>
>> When I run the following script:
>>
>> #!/bin/bash
>> #PBS -l nodes=2
>> #XCPU -p
>>
>> date
>>
>> it eventually finishes, but it runs both processes on the same node
>> n0000.  If I specify
>> more than 2 nodes (processes, really), the job aborts saying it
>> doesn't have enough resouces.  The issue seems to be that moab
>> understands that it has only one active node - it appears to simply
>> probe the master, since it is the node specified in the
>> server_priv/nodes file, and there is a single mom running.
>>
>> Any ideas?
>>
>> Thanks,
>> Daniel
> --
> Hugh Greenberg
> Los Alamos National Laboratory, CCS-1
> Email: [EMAIL PROTECTED]
> Phone: (505) 665-6471
>
>

[xcpu] Re: moab/torque problems

Reply via email to