The block management code was quite different in 2.3; the bg_job_run 
code was explicitly calling a function to add users to the block up 
front, rather than reusing the bridge_block_sync_users function as it 
does now.

It's not a problem keeping the slurm user there as far as I know, it 
just complicates the logic a bit, and I don't think there's any 
advantage to having the slurm user on that list.

 From what MMCS on /L implies the slurm user still manages the block:

mmcs$ list_blocks
OK
RMP03Jl120831053 I slurm   (0)  connected
RMP03Jl120831060 I slurm   (0)  connected
RMP03Jl120831073 I slurm   (0)  connected
RMP03Jl120831105 I slurm   (0)  connected

even when the only user on the block (as set through SLURM over the 
bridge api) is the job owner.

FWIW, the patch has been running ~16 hours on our 1-rack without issue. 
Is there anyone with /P that can vet it on theirs?

- Tim

On 07/04/2012 12:09 PM, Danny Auble wrote:
>
> Tim, this is a good find.  Was there a problem keeping the slurm user
> though?  I would feel more comfortable leaving that check in there.  Was
> this not an issue with 2.3?
>
> Danny
>
> On 07/03/12 14:33, Tim Wickberg wrote:
>> Attached patch fixes a bug in the select/bluegene plugin for BG/L+P in
>> the 2.4 series.
>>
>> Without this applied, the userid assigned to a block will never be
>> updated in MMCS, preventing the user from launching a job with a
>> message like:
>>
>>> <Jul 03 11:50:56.264854>  BE_MPI (ERROR): Current user is not the
>>> owner of the partition,
>>> <Jul 03 11:50:56.264925>  BE_MPI (ERROR):   and is not in the
>>> partition's user list - Aborting
>>> <Jul 03 11:50:56.406071>  FE_MPI (ERROR): Back-end failed while
>>> preparing partition with return code 31.
>>> <Jul 03 11:50:56.477110>  FE_MPI (ERROR): Failure list:
>>> <Jul 03 11:50:56.477145>  FE_MPI (ERROR):   - 1. A user does not have
>>> permission to run the job on specified partition (failure #31)
>>
>> The patch simplifies the logic a bit: it removes all users that aren't
>> the correct assigned user including the slurm user account. (This
>> doesn't seem to affect operation on our 1-rack BG/L here at least,
>> although I can't guarantee that for BG/P.) And, correcting the bug
>> itself: it makes sure to add the assigned user to the block.
>>
>> Before, if user_count=0 or 1 (which was likely slurm_user, hitting a
>> continue out of the one pass of the loop), the loop would be skipped
>> over and the correct user would never be added in to the block.
>>
>> - Tim

Reply via email to