[slurm-dev] Re: Bug report from Vladivostok.

Daniel Letai Fri, 06 May 2016 18:21:56 -0700
   body p { margin-bottom: 0cm; margin-top: 0pt; } 
 Looking at your patch, and without reviewing the code, I have one
 question - is it possible for core 'c' not to be in core_map, nor in
 part_core_map? I'm only asking because that case doesn't seem to be
 covered by your patch (A private case would be if there is no
 part_core_map for c to be in).
 
 On 05/04/2016 02:09 PM, Marco Ehlert
   wrote:
   Hi Vova,
   some weeks ago I proposed a bug fix for this problem.
https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ
   But this solves only half of the problem if you are using distinct
   partitioning of cores like this:
   NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10
   ThreadsPerCore=1 Gres=gpu:4
   
   PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4
   
   PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16
   and bind concrete cpu cores to GPU cards in gres.conf.
   Didn't understand whether this is the configuration you are using,
   but for me it is the case. Slurm does not manage the core maps for
   the two distinct partitions on the same node correctly. At some
   point slurm gets confused, I saw definitely wrong
   core_map/part_coremap variables.
   That's why I decided not to use "CPUs" arguments in gres.conf but
   set the core map in a prolog script depending on the partition
   choosen by the job.
   
   Job scripts are reading this core map and are then setting the
   taskset by themselves.
   
   Maybe it is not the best work around, but it is working to some
   satisfaction at least.
   
   It ends up in a difference between the picture of used/non-used
   cores slurm has of and the actual core map in use. But it works
   anyway, because the cores are correctly counted now.
   Best,
   
   Marco
   On Wed, 4 May 2016, Vladimir Goy wrote:
   Dear Developers, 
     when I use MaxCPUsPerNode options in partition definition with 2
     GPU per node (defined in gres) and SelectType=select/cons_res,
     Slurm allocate only one GPU
     
     for job per node, another GPU is free, but Slurm can not
     allocate it for jobs.
     In man slurm.conf:
     "MaxCPUsPerNode
     
                   Maximum  number  of  CPUs on any node available to
     all jobs from this parti-
     
                   tion.  This can be especially useful to schedule
     GPUs. For  example  a  node
     
                   can  be  associated with two Slurm partitions
     (e.g. "cpu" and "gpu") and the
     
                   partition/queue "cpu" could be limited to only a
     subset of the node?s  CPUs,
     
                   insuring  that one or more CPUs would be available
     to jobs in the "gpu" par-
     
                   tition/queue."
     Can You help me? Why option MaxCPUsPerNode is not working
     correctly with SelectType=select/cons_res?
     Best Regards, Vova.
     2016-04-20 14:20 GMT+10:00 Vladimir Goy
     <[email protected]>:
     
           Dear developers,
     I have found something looking like a bug in Slurm code (file
     src/plugins/select/cons_res/job_test.c, _allocate_sc(...)).
     
     For example I have 2 nodes:GresTypes=gpu
     
     NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10
     ThreadsPerCore=1 Gres=gpu:kepler:2 RealMemory=64000
     TmpDisk=16384 State=UNKNOWN
     
     PartitionName=gpu   Nodes=n[01-02] Shared=NO  MaxCPUsPerNode=4 
     Default=YES  MaxTime=INFINITE State=UP
     gres.conf:
     
     Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1
     
     Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11
     I use
     
     SelectType=select/cons_res
     
     SelectTypeParameters=CR_CPU
     => I have 4 GPU on the cluster.
     Problem with the next: I would like submit 4 one-proces jobs,
     which each need 1 GPU per jobs. Slurm runs 2 tasks, other tasks
     are pending. It is fail.
     Please see more carefully on the next code from file
     src/plugins/select/cons_res/job_test.c, function:
     _allocate_sc(...):
     /* Step 1: create and compute core-count-per-socket
     
     * arrays and total core counts */
     
     free_cores = xmalloc(sockets * sizeof(uint16_t));
     
     used_cores = xmalloc(sockets * sizeof(uint16_t));
     
     used_cpu_array = xmalloc(sockets * sizeof(uint32_t));
     for (c = core_begin; c < core_end; c++) {         //Cycle 1.
     
     i = (uint16_t) (c - core_begin) / cores_per_socket;
     
     if (bit_test(core_map, c)) {
     
     free_cores[i]++;
     
     free_core_count++;
     
     } else {
     
     used_cores[i]++;                 //<-------Here can be
     error!!! (1 line)
     
     }
     
     if (part_core_map && bit_test(part_core_map, c))
     
     used_cpu_array[i]++;
     
     }
     for (i = 0; i < sockets; i++) {       //Cycle 2.
     
     /* if a socket is already in use and entire_sockets_only is
     
     * enabled, it cannot be used by this job */
     
     if (entire_sockets_only && used_cores[i]) {
     
     free_core_count -= free_cores[i];
     
     used_cores[i] += free_cores[i];
     
     free_cores[i] = 0;
     
     }
     
     free_cpu_count += free_cores[i] * threads_per_core;
     
     if (used_cpu_array[i])
     
     used_cpu_count += used_cores[i] * threads_per_core;  
     //<----Here can be error.  (2 line)
     
     }
     
     xfree(used_cores);
     
     xfree(used_cpu_array);
     /* Ignore resources that would push a job allocation over the
     
     * partition CPU limit (if any) */
     
     if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE)
     &&
     
        (free_cpu_count + used_cpu_count >
     
         job_ptr->part_ptr->max_cpus_per_node)) {
     
     int excess = free_cpu_count + used_cpu_count -
     
         job_ptr->part_ptr->max_cpus_per_node;
     
     for (c = core_begin; c < core_end; c++) {
     
     i = (uint16_t) (c - core_begin) / cores_per_socket;
     
     if (free_cores[i] > 0) {
     
     free_core_count--;
     
     free_cores[i]--;
     
     excess -= threads_per_core;
     
     if (excess <= 0)
     
     break;
     
     }
     
     }
     
     }
     I mark two lines, in which I think contain errors. Because when
     I use Gres, line 1 is wrong, because some of this cores may be
     used or can
     
     be forbidden by gres. Gres allow use only 0,1,10,11 cores, other
     cores are forbidden. In this case after cylce 2 variable
     used_cpu_count be wrong, and
     
     on the next if operator I can not allocate this node for job,
     because ((used_cpu_count is equal to 10) + (free_cpu_count is
     equal to 2) = 12) > 4 => it
     
     is wrong!!!
     Could this be a bug?
     Best regards, Vova.
[slurm-dev] Re: Bug report from Vladivostok.

Reply via email to