I think I found the problem. In certain cases, _allocate_sockets in
job_test.c does not return the correct number of cpus. This can prevent
allocation of all the available cpus on a socket, such as when
ALLOCATE_FULL_SOCKET = 1 is used. The patch below fixes the problem in
2.4.0-pre3.
Index: src/plugins/select/cons_res/job_test.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/plugins/select/cons_res/job_test.c,v
retrieving revision 1.1.1.38.4.1
diff -u -r1.1.1.38.4.1 job_test.c
--- src/plugins/select/cons_res/job_test.c 7 Feb 2012 18:26:23 -0000
1.1.1.38.4.1
+++ src/plugins/select/cons_res/job_test.c 15 Feb 2012 22:31:09 -0000
@@ -294,7 +294,8 @@
} else {
j = avail_cpus / cpus_per_task;
num_tasks = MIN(num_tasks, j);
- avail_cpus = num_tasks * cpus_per_task;
+ if (job_ptr->details->ntasks_per_node)
+ avail_cpus = num_tasks * cpus_per_task;
}
if ((job_ptr->details->ntasks_per_node &&
(num_tasks < job_ptr->details->ntasks_per_node)) ||
> I would think that if, alloc_sockets is set and sockets_used is true,
> it should make sure those bits potentially ( and previously ) cleared
are
> now set back to 1. I yet didn't verify my own speculation in a full
> temporary debug of that piece. I could bewrong, of course.
I think it's only clearing the bits if the socket is not used. If the
socket is used, the bits for all available cores in the socket remain set.
I agree the code is hard to follow.
Regards,
Martin
[slurm-dev] Re: select/cons_res ALLOCATE_FULL_SOCKET
Michel Bourget
to:
slurm-dev
02/14/2012 06:45 PM
From:
Michel Bourget <[email protected]>
To:
"slurm-dev" <[email protected]>
Please respond to "slurm-dev" <[email protected]>
On 02/13/2012 12:33 PM, [email protected] wrote:
>
> Michel,
> I'm testing on 2.4.0-pre3, not 2.3.3. I ran some additional tests
> with values for -c greater than 1 and like you I'm getting unexpected
> results for certain cases. I'm working on some other changes to this
> module (src/plugins/select/cons_res/dist_tasks.c), so I'll look into
> this issue also.
( Mailing list note: since the "change", there are all kind of funny
icons and formatting output,
not necessarily ASCII. I use ThunderBird . Is that normal ? )
Thanks for looking at this !!!
Btw, when re-re-reading that loop, it looks goofy, at least to me, when
alloc_sockets=1:
/* release remaining cores of the unused sockets */
for (s = 0; s < nsockets_nb; s++) {
-> if ( sockets_used[s] )
continue;
bit_nclear(job_res->core_bitmap,
c+(s*ncores_nb),
c+((s+1)*ncores_nb)-1);
}
I would think that if, alloc_sockets is set and sockets_used is true,
it should make sure those bits potentially ( and previously ) cleared are
now set back to 1. I yet didn't verify my own speculation in a full
temporary debug of that piece. I could bewrong, of course.
> Martin
--
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------