> Here is a patch to fix the issue. Thanks! That solved the problem:
login-0-0 689(1)$ bjob -j 1516 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1516 1436.36 bhm staff lowpri lowpri PD 0.0000000086( 37) 0:00 5:00 1 1 400 0 (Priority) login-0-0 690(1)$ scontrol hold 1516 login-0-0 691(1)$ bjob -j 1516 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1516 1436.36 bhm staff lowpri lowpri PD 0.0000000000( 0) 0:00 5:00 1 1 400 0 (JobHeldUser) login-0-0 692(1)$ scontrol release 1516 login-0-0 693(1)$ bjob -j 1516 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1516 1436.36 bhm staff lowpri lowpri PD 0.0000000090( 39) 0:00 5:00 1 1 400 0 (Resources) > Let me know if you see any more problems, but this should fix all > issues in this area. This patch will be in 2.2.4. Since you asked: :-) The fix drew my attention to another small problem: When the _admin_ user releases a job, the Reason is not cleared, and the priority is not set back to its normal value (it is set to 1). It does become runnable, though. For instance: login-0-0 696(1)$ bjob -j 1504 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1504 1436.30 bhm staff lowpri lowpri PD 0.0000000093( 40) 0:00 5:00 1 1 400 0 (Priority) teflon 520(2)# scontrol hold 1504 login-0-0 697(1)$ bjob -j 1504 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1504 1436.30 bhm staff lowpri lowpri PD 0.0000000000( 0) 0:00 5:00 1 1 400 0 (JobHeldAdmin) teflon 521(2)# scontrol release 1504 login-0-0 698(1)$ bjob -j 1504 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1504 1436.30 bhm staff lowpri lowpri PD 0.0000000002( 1) 0:00 5:00 1 1 400 0 (JobHeldAdmin) login-0-0 699(1)$ The job is indeed runnable: A little later: login-0-0 700(1)$ bjob -j 1504 JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORITY(PRIOR) TIME TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON) 1504 1436.30 bhm staff lowpri lowpri R 0.0000000002( 1) 0:19 4:41 1 1 400 0 compute-0-9 The same happens with scontrol uhold, or when the admin user releases a job held by the job owner. I've verified that this also happens with an unpatched 2.2.3, so the patch did not introduce the problem. -- Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo
