Hello,

   Writing between lines...

   El 04/10/2017 a las 18:52, Jeffrey Frey escribió:

     I didn't realize prior to this that the "--distribution" flag to "sbatch" 
only affects how an "srun" within the batch script will make CPU allocations.  
Prior to that happening, SLURM must allocate CPUs to the batch job, and _that_ 
distribution is dictated by how you have the "select/cons_res" plugin 
configured:
       

       SelectType=select/cons_res
       SelectTypeParameters=CR_Core     

     The default behavior is to spread the allocation across the available 
nodes -- thus, 4/4/3/3/3.  If you'd rather "pack" allocations onto the nodes, 
enable the CR_PACK_NODES option:
       

       SelectType=select/cons_res
       SelectTypeParameters=CR_Core,CR_Pack_Nodes     


   OK, I have added "CR_Pack_Nodes"...

            

     This will produce the 4/4/4/4/1 allocation pattern.  AFAIK there's no way 
to alter which CPU allocation pattern gets used on a per-job basis.   

   Nop, result is not 4/4/4/4/1... submiting with "sbatch" and running
   "srun", not "mpirun" after compiling OpenMPI with --mpi=pmi2 support

     
     
     Once the job has been assigned nodes and CPUs on those nodes, the 
"--distribution" option you provide informs "srun" how to distribute the tasks 
it starts.  Not using "srun" to start the MPI program, Open MPI itself knows 
nothing beyond seeing
     
             SLURM_NODELIST=n[009-013]
             SLURM_TASKS_PER_NODE=4(x2),3(x3)
     
     in the environment which produces the host list
     
             n009:4
             n010:4
             n011:3
             n012:3
             n013:3
     
     for which the --map-by and --rank-by options to "mpirun" will affect the 
distribution.   


   How could I test with a small program how the cores are filling
   step-by-step? Because my outfile file shows my "n" tasks line by line,
   but each line starts with "Process 0 on", so my "n" tasks seem to be
   always the task number 0...
   My "hostname" program is:

     #include <stdio.h>
     #include <mpi.h>

     int main(int argc, char *argv[]) {
     int numprocs, rank, namelen;
     char processor_name[MPI_MAX_PROCESSOR_NAME];

     MPI_Init(&argc, &argv);
     MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Get_processor_name(processor_name, &namelen);

     printf("Process %d on %s out of %d\n", rank, processor_name,
     numprocs);

     MPI_Finalize();
     }


   This small program, once compiled and running with sbatch+mpirun shows
   for a 12 tasks running in 12 nodes:

     Process 0 on clus01.hpc.local out of 12
     Process 3 on clus04.hpc.local out of 12
     Process 7 on clus08.hpc.local out of 12
     Process 8 on clus09.hpc.local out of 12
     Process 4 on clus05.hpc.local out of 12
     Process 2 on clus03.hpc.local out of 12
     Process 9 on clus10.hpc.local out of 12
     Process 10 on clus11.hpc.local out of 12
     Process 5 on clus06.hpc.local out of 12
     Process 11 on clus12.hpc.local out of 12
     Process 6 on clus07.hpc.local out of 12
     Process 1 on clus02.hpc.local out of 12

   However, if I run with sbatch+srun, output is:

     Process 0 on clus01.hpc.local out of 1
     Process 0 on clus05.hpc.local out of 1
     Process 0 on clus03.hpc.local out of 1
     Process 0 on clus04.hpc.local out of 1
     Process 0 on clus02.hpc.local out of 1
     Process 0 on clus06.hpc.local out of 1
     Process 0 on clus12.hpc.local out of 1
     Process 0 on clus08.hpc.local out of 1
     Process 0 on clus11.hpc.local out of 1
     Process 0 on clus07.hpc.local out of 1
     Process 0 on clus10.hpc.local out of 1
     Process 0 on clus09.hpc.local out of 1

   Help, please...

     
     
     
     
     
             

       On Oct 3, 2017, at 8:26 PM, Christopher Samuel <!-- tmpl_var 
LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET -->       <[email protected]>  
      wrote:
       
       
       On 02/10/17 20:51, Sysadmin CAOS wrote:


         I'm execution  my MPI program with "mpirun"... Maybe could be this the
         problem? Do I need to execute with "srun"?       

       I suspect so, try it and see..
       
       -- 
       Christopher Samuel        Senior Systems Administrator
       Melbourne Bioinformatics - The University of Melbourne
       Email: <!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET -->    
   [email protected]        Phone: +61 (0)3 903 55545

     
     ::::::::::::::::::::::::::::::::::::::::::::::::::::::
     Jeffrey T. Frey, Ph.D.
     Systems Programmer V / HPC Management
     Network & Systems Services / College of Engineering
     University of Delaware, Newark DE  19716
     Office: (302) 831-6034  Mobile: (302) 419-4976
     ::::::::::::::::::::::::::::::::::::::::::::::::::::::
     
     
   

   


   <!-- tmpl_var LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET --> 
mailto:[email protected]
   <!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET --> 
mailto:[email protected]


Reply via email to