The block management code was quite different in 2.3; the bg_job_run code was explicitly calling a function to add users to the block up front, rather than reusing the bridge_block_sync_users function as it does now.
It's not a problem keeping the slurm user there as far as I know, it just complicates the logic a bit, and I don't think there's any advantage to having the slurm user on that list. From what MMCS on /L implies the slurm user still manages the block: mmcs$ list_blocks OK RMP03Jl120831053 I slurm (0) connected RMP03Jl120831060 I slurm (0) connected RMP03Jl120831073 I slurm (0) connected RMP03Jl120831105 I slurm (0) connected even when the only user on the block (as set through SLURM over the bridge api) is the job owner. FWIW, the patch has been running ~16 hours on our 1-rack without issue. Is there anyone with /P that can vet it on theirs? - Tim On 07/04/2012 12:09 PM, Danny Auble wrote: > > Tim, this is a good find. Was there a problem keeping the slurm user > though? I would feel more comfortable leaving that check in there. Was > this not an issue with 2.3? > > Danny > > On 07/03/12 14:33, Tim Wickberg wrote: >> Attached patch fixes a bug in the select/bluegene plugin for BG/L+P in >> the 2.4 series. >> >> Without this applied, the userid assigned to a block will never be >> updated in MMCS, preventing the user from launching a job with a >> message like: >> >>> <Jul 03 11:50:56.264854> BE_MPI (ERROR): Current user is not the >>> owner of the partition, >>> <Jul 03 11:50:56.264925> BE_MPI (ERROR): and is not in the >>> partition's user list - Aborting >>> <Jul 03 11:50:56.406071> FE_MPI (ERROR): Back-end failed while >>> preparing partition with return code 31. >>> <Jul 03 11:50:56.477110> FE_MPI (ERROR): Failure list: >>> <Jul 03 11:50:56.477145> FE_MPI (ERROR): - 1. A user does not have >>> permission to run the job on specified partition (failure #31) >> >> The patch simplifies the logic a bit: it removes all users that aren't >> the correct assigned user including the slurm user account. (This >> doesn't seem to affect operation on our 1-rack BG/L here at least, >> although I can't guarantee that for BG/P.) And, correcting the bug >> itself: it makes sure to add the assigned user to the block. >> >> Before, if user_count=0 or 1 (which was likely slurm_user, hitting a >> continue out of the one pass of the loop), the loop would be skipped >> over and the correct user would never be added in to the block. >> >> - Tim
