Dear Marco, Thank you so much for your response. I have recently been added to the mailing list. I did not see your last letter. It is good, that this problem is known. In the end, I made the following configuration of our cluster without use MaxCPUsPerNode:
GresTypes=gpu NodeName=n[01-10] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 Gres=gpu:K40:2,gpu:cpu:16 RealMemory=64000 TmpDisk=16384 State=UNKNOWN PartitionName=long Nodes=n[01-10] Shared=NO Default=YES DefMemPerCPU=1024 MaxMemPerCPU=3072 DefaultTime=24:0:0 MaxTime=INFINITE State=UP and gres.conf: Name=gpu Type=K40 File=/dev/nvidia0 CPUs=0 Name=gpu Type=K40 File=/dev/nvidia1 CPUs=10 Name=gpu Type=cpu CPUs=2-9,12-19 Count=16 In this case all working good!! Best Regards, Vova 2016-05-04 21:09 GMT+10:00 Marco Ehlert <[email protected]>: > > Hi Vova, > > some weeks ago I proposed a bug fix for this problem. > > > https://groups.google.com/forum/#!searchin/slurm-devel/marco$20ehlert/slurm-devel/CRsW-eiUfms/MI2aAL4UGwAJ > > But this solves only half of the problem if you are using distinct > partitioning of cores like this: > > NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 > Gres=gpu:4 > PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4 > PartitionName=cpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=16 > > and bind concrete cpu cores to GPU cards in gres.conf. > > Didn't understand whether this is the configuration you are using, but for > me it is the case. Slurm does not manage the core maps for the two distinct > partitions on the same node correctly. At some point slurm gets confused, I > saw definitely wrong core_map/part_coremap variables. > > That's why I decided not to use "CPUs" arguments in gres.conf but set the > core map in a prolog script depending on the partition choosen by the job. > Job scripts are reading this core map and are then setting the taskset by > themselves. > Maybe it is not the best work around, but it is working to some > satisfaction at least. > It ends up in a difference between the picture of used/non-used cores > slurm has of and the actual core map in use. But it works anyway, because > the cores are correctly counted now. > > Best, > Marco > > On Wed, 4 May 2016, Vladimir Goy wrote: > > Dear Developers, >> when I use MaxCPUsPerNode options in partition definition with 2 GPU per >> node (defined in gres) and SelectType=select/cons_res, Slurm allocate only >> one GPU >> for job per node, another GPU is free, but Slurm can not allocate it for >> jobs. >> >> In man slurm.conf: >> >> "MaxCPUsPerNode >> Maximum number of CPUs on any node available to all jobs >> from this parti- >> tion. This can be especially useful to schedule GPUs. For >> example a node >> can be associated with two Slurm partitions (e.g. "cpu" >> and "gpu") and the >> partition/queue "cpu" could be limited to only a subset of >> the node?s CPUs, >> >> insuring that one or more CPUs would be available to jobs >> in the "gpu" par- >> tition/queue." >> >> Can You help me? Why option MaxCPUsPerNode is not working correctly with >> SelectType=select/cons_res? >> >> Best Regards, Vova. >> >> >> >> >> 2016-04-20 14:20 GMT+10:00 Vladimir Goy <[email protected]>: >> Dear developers, >> >> I have found something looking like a bug in Slurm code (file >> src/plugins/select/cons_res/job_test.c, _allocate_sc(...)). >> For example I have 2 nodes:GresTypes=gpu >> NodeName=n[01-02] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 >> Gres=gpu:kepler:2 RealMemory=64000 TmpDisk=16384 State=UNKNOWN >> PartitionName=gpu Nodes=n[01-02] Shared=NO MaxCPUsPerNode=4 >> Default=YES MaxTime=INFINITE State=UP >> >> gres.conf: >> Name=gpu Type=kepler File=/dev/nvidia0 CPUs=0,1 >> Name=gpu Type=kepler File=/dev/nvidia1 CPUs=10,11 >> >> I use >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU >> >> => I have 4 GPU on the cluster. >> >> Problem with the next: I would like submit 4 one-proces jobs, which each >> need 1 GPU per jobs. Slurm runs 2 tasks, other tasks are pending. It is >> fail. >> >> Please see more carefully on the next code from file >> src/plugins/select/cons_res/job_test.c, function: _allocate_sc(...): >> >> /* Step 1: create and compute core-count-per-socket >> * arrays and total core counts */ >> free_cores = xmalloc(sockets * sizeof(uint16_t)); >> used_cores = xmalloc(sockets * sizeof(uint16_t)); >> used_cpu_array = xmalloc(sockets * sizeof(uint32_t)); >> >> for (c = core_begin; c < core_end; c++) { //Cycle 1. >> i = (uint16_t) (c - core_begin) / cores_per_socket; >> if (bit_test(core_map, c)) { >> free_cores[i]++; >> free_core_count++; >> } else { >> used_cores[i]++; //<-------Here can be error!!! (1 line) >> } >> if (part_core_map && bit_test(part_core_map, c)) >> used_cpu_array[i]++; >> } >> >> for (i = 0; i < sockets; i++) { //Cycle 2. >> /* if a socket is already in use and entire_sockets_only is >> * enabled, it cannot be used by this job */ >> if (entire_sockets_only && used_cores[i]) { >> free_core_count -= free_cores[i]; >> used_cores[i] += free_cores[i]; >> free_cores[i] = 0; >> } >> free_cpu_count += free_cores[i] * threads_per_core; >> if (used_cpu_array[i]) >> used_cpu_count += used_cores[i] * threads_per_core; //<----Here can be >> error. (2 line) >> } >> xfree(used_cores); >> xfree(used_cpu_array); >> >> /* Ignore resources that would push a job allocation over the >> * partition CPU limit (if any) */ >> if ((job_ptr->part_ptr->max_cpus_per_node != INFINITE) && >> (free_cpu_count + used_cpu_count > >> job_ptr->part_ptr->max_cpus_per_node)) { >> int excess = free_cpu_count + used_cpu_count - >> job_ptr->part_ptr->max_cpus_per_node; >> for (c = core_begin; c < core_end; c++) { >> i = (uint16_t) (c - core_begin) / cores_per_socket; >> if (free_cores[i] > 0) { >> free_core_count--; >> free_cores[i]--; >> excess -= threads_per_core; >> if (excess <= 0) >> break; >> } >> } >> } >> >> >> I mark two lines, in which I think contain errors. Because when I use >> Gres, line 1 is wrong, because some of this cores may be used or can >> be forbidden by gres. Gres allow use only 0,1,10,11 cores, other cores >> are forbidden. In this case after cylce 2 variable used_cpu_count be wrong, >> and >> on the next if operator I can not allocate this node for job, because >> ((used_cpu_count is equal to 10) + (free_cpu_count is equal to 2) = 12) > 4 >> => it >> is wrong!!! >> >> Could this be a bug? >> >> Best regards, Vova. >> >> >> >> >>
