I think I found the problem.  In certain cases, _allocate_sockets in 
job_test.c does not return the correct number of cpus.  This can prevent 
allocation of all the available cpus on a socket, such as when 
ALLOCATE_FULL_SOCKET = 1 is used.  The patch below fixes the problem in 
2.4.0-pre3.

Index: src/plugins/select/cons_res/job_test.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/plugins/select/cons_res/job_test.c,v
retrieving revision 1.1.1.38.4.1
diff -u -r1.1.1.38.4.1 job_test.c
--- src/plugins/select/cons_res/job_test.c      7 Feb 2012 18:26:23 -0000 
1.1.1.38.4.1
+++ src/plugins/select/cons_res/job_test.c      15 Feb 2012 22:31:09 -0000
@@ -294,7 +294,8 @@
        } else {
                j = avail_cpus / cpus_per_task;
                num_tasks = MIN(num_tasks, j);
-               avail_cpus = num_tasks * cpus_per_task;
+               if (job_ptr->details->ntasks_per_node)
+                       avail_cpus = num_tasks * cpus_per_task;
        }
        if ((job_ptr->details->ntasks_per_node &&
             (num_tasks < job_ptr->details->ntasks_per_node)) ||

> I would think that if, alloc_sockets is set and sockets_used is true,
> it should make sure those bits potentially ( and previously ) cleared 
are
> now set back to 1. I yet didn't verify my own speculation in a full
> temporary debug of that piece. I could bewrong, of course.

I think it's only clearing the bits if the socket is not used. If the 
socket is used, the bits for all available cores in the socket remain set. 
 I agree the code is hard to follow. 

Regards,
Martin









[slurm-dev] Re: select/cons_res ALLOCATE_FULL_SOCKET

Michel Bourget 
to:
slurm-dev
02/14/2012 06:45 PM





From:
Michel Bourget <[email protected]>




To:
"slurm-dev" <[email protected]>





Please respond to "slurm-dev" <[email protected]>





On 02/13/2012 12:33 PM, [email protected] wrote:
>
> Michel,
> I'm testing on 2.4.0-pre3, not 2.3.3.  I ran some additional tests 
> with values for -c greater than 1 and like you I'm getting unexpected 
> results for certain cases. I'm working on some other changes to this 
> module (src/plugins/select/cons_res/dist_tasks.c), so I'll look into 
> this issue also.

( Mailing list note: since the "change", there are all kind of funny 
icons and formatting output,
   not necessarily ASCII. I use ThunderBird . Is that normal ? )

Thanks for looking at this !!!

Btw, when re-re-reading that loop, it looks goofy, at least to me, when 
alloc_sockets=1:

    /* release remaining cores of the unused sockets */
    for (s = 0; s < nsockets_nb; s++) {
    ->  if ( sockets_used[s] )
             continue;
         bit_nclear(job_res->core_bitmap,
             c+(s*ncores_nb),
             c+((s+1)*ncores_nb)-1);
    }

I would think that if, alloc_sockets is set and sockets_used is true,
it should make sure those bits potentially ( and previously ) cleared are
now set back to 1. I yet didn't verify my own speculation in a full
temporary debug of that piece. I could bewrong, of course.

> Martin
-- 
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------


Reply via email to