Tim, I am going to add this with only a couple of mods.

I moved the returnc = REMOVE_USER_FOUND; to be inside the if (!found && 
user_name) clause.  This probably doesn't mean anything though since 
REMOVE_USER_FOUND isn't checked anywhere anymore ;), just did it just to 
get the same effect as before.

I also added a note about the removal of the check for the slurm user 
and commented out the code instead of removing it all together just in 
case on a P it causes issues.  I doubt it will if it didn't on L.  The 
subsystem is very similar.

Thanks for finding this.  It will be in the next 2.4 release.  Final 
patch is at 
https://github.com/SchedMD/slurm/commit/865bec2aed477182d01123db3853eda7d7f0c2f8

Danny

On 07/04/12 11:04, Tim Wickberg wrote:
> The block management code was quite different in 2.3; the bg_job_run
> code was explicitly calling a function to add users to the block up
> front, rather than reusing the bridge_block_sync_users function as it
> does now.
>
> It's not a problem keeping the slurm user there as far as I know, it
> just complicates the logic a bit, and I don't think there's any
> advantage to having the slurm user on that list.
>
>   From what MMCS on /L implies the slurm user still manages the block:
>
> mmcs$ list_blocks
> OK
> RMP03Jl120831053 I slurm   (0)        connected
> RMP03Jl120831060 I slurm   (0)        connected
> RMP03Jl120831073 I slurm   (0)        connected
> RMP03Jl120831105 I slurm   (0)        connected
>
> even when the only user on the block (as set through SLURM over the
> bridge api) is the job owner.
>
> FWIW, the patch has been running ~16 hours on our 1-rack without issue.
> Is there anyone with /P that can vet it on theirs?
>
> - Tim
>
> On 07/04/2012 12:09 PM, Danny Auble wrote:
>> Tim, this is a good find.  Was there a problem keeping the slurm user
>> though?  I would feel more comfortable leaving that check in there.  Was
>> this not an issue with 2.3?
>>
>> Danny
>>
>> On 07/03/12 14:33, Tim Wickberg wrote:
>>> Attached patch fixes a bug in the select/bluegene plugin for BG/L+P in
>>> the 2.4 series.
>>>
>>> Without this applied, the userid assigned to a block will never be
>>> updated in MMCS, preventing the user from launching a job with a
>>> message like:
>>>
>>>> <Jul 03 11:50:56.264854>  BE_MPI (ERROR): Current user is not the
>>>> owner of the partition,
>>>> <Jul 03 11:50:56.264925>  BE_MPI (ERROR):   and is not in the
>>>> partition's user list - Aborting
>>>> <Jul 03 11:50:56.406071>  FE_MPI (ERROR): Back-end failed while
>>>> preparing partition with return code 31.
>>>> <Jul 03 11:50:56.477110>  FE_MPI (ERROR): Failure list:
>>>> <Jul 03 11:50:56.477145>  FE_MPI (ERROR):   - 1. A user does not have
>>>> permission to run the job on specified partition (failure #31)
>>> The patch simplifies the logic a bit: it removes all users that aren't
>>> the correct assigned user including the slurm user account. (This
>>> doesn't seem to affect operation on our 1-rack BG/L here at least,
>>> although I can't guarantee that for BG/P.) And, correcting the bug
>>> itself: it makes sure to add the assigned user to the block.
>>>
>>> Before, if user_count=0 or 1 (which was likely slurm_user, hitting a
>>> continue out of the one pass of the loop), the loop would be skipped
>>> over and the correct user would never be added in to the block.
>>>
>>> - Tim

Reply via email to