Tim, I am going to add this with only a couple of mods. I moved the returnc = REMOVE_USER_FOUND; to be inside the if (!found && user_name) clause. This probably doesn't mean anything though since REMOVE_USER_FOUND isn't checked anywhere anymore ;), just did it just to get the same effect as before.
I also added a note about the removal of the check for the slurm user and commented out the code instead of removing it all together just in case on a P it causes issues. I doubt it will if it didn't on L. The subsystem is very similar. Thanks for finding this. It will be in the next 2.4 release. Final patch is at https://github.com/SchedMD/slurm/commit/865bec2aed477182d01123db3853eda7d7f0c2f8 Danny On 07/04/12 11:04, Tim Wickberg wrote: > The block management code was quite different in 2.3; the bg_job_run > code was explicitly calling a function to add users to the block up > front, rather than reusing the bridge_block_sync_users function as it > does now. > > It's not a problem keeping the slurm user there as far as I know, it > just complicates the logic a bit, and I don't think there's any > advantage to having the slurm user on that list. > > From what MMCS on /L implies the slurm user still manages the block: > > mmcs$ list_blocks > OK > RMP03Jl120831053 I slurm (0) connected > RMP03Jl120831060 I slurm (0) connected > RMP03Jl120831073 I slurm (0) connected > RMP03Jl120831105 I slurm (0) connected > > even when the only user on the block (as set through SLURM over the > bridge api) is the job owner. > > FWIW, the patch has been running ~16 hours on our 1-rack without issue. > Is there anyone with /P that can vet it on theirs? > > - Tim > > On 07/04/2012 12:09 PM, Danny Auble wrote: >> Tim, this is a good find. Was there a problem keeping the slurm user >> though? I would feel more comfortable leaving that check in there. Was >> this not an issue with 2.3? >> >> Danny >> >> On 07/03/12 14:33, Tim Wickberg wrote: >>> Attached patch fixes a bug in the select/bluegene plugin for BG/L+P in >>> the 2.4 series. >>> >>> Without this applied, the userid assigned to a block will never be >>> updated in MMCS, preventing the user from launching a job with a >>> message like: >>> >>>> <Jul 03 11:50:56.264854> BE_MPI (ERROR): Current user is not the >>>> owner of the partition, >>>> <Jul 03 11:50:56.264925> BE_MPI (ERROR): and is not in the >>>> partition's user list - Aborting >>>> <Jul 03 11:50:56.406071> FE_MPI (ERROR): Back-end failed while >>>> preparing partition with return code 31. >>>> <Jul 03 11:50:56.477110> FE_MPI (ERROR): Failure list: >>>> <Jul 03 11:50:56.477145> FE_MPI (ERROR): - 1. A user does not have >>>> permission to run the job on specified partition (failure #31) >>> The patch simplifies the logic a bit: it removes all users that aren't >>> the correct assigned user including the slurm user account. (This >>> doesn't seem to affect operation on our 1-rack BG/L here at least, >>> although I can't guarantee that for BG/P.) And, correcting the bug >>> itself: it makes sure to add the assigned user to the block. >>> >>> Before, if user_count=0 or 1 (which was likely slurm_user, hitting a >>> continue out of the one pass of the loop), the loop would be skipped >>> over and the correct user would never be added in to the block. >>> >>> - Tim
