> Here is a patch to fix the issue.

Thanks!  That solved the problem:

login-0-0 689(1)$ bjob -j 1516
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1516 1436.36    bhm    staff   lowpri  lowpri PD 0.0000000086(   37)  0:00     
 5:00   1   1     400       0 (Priority)
login-0-0 690(1)$ scontrol hold 1516
login-0-0 691(1)$ bjob -j 1516
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1516 1436.36    bhm    staff   lowpri  lowpri PD 0.0000000000(    0)  0:00     
 5:00   1   1     400       0 (JobHeldUser)
login-0-0 692(1)$ scontrol release 1516
login-0-0 693(1)$ bjob -j 1516
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1516 1436.36    bhm    staff   lowpri  lowpri PD 0.0000000090(   39)  0:00     
 5:00   1   1     400       0 (Resources)

> Let me know if you see any more problems, but this should fix all
> issues in this area.  This patch will be in 2.2.4.

Since you asked: :-) The fix drew my attention to another small problem:
When the _admin_ user releases a job, the Reason is not cleared, and the
priority is not set back to its normal value (it is set to 1).  It does
become runnable, though.

For instance:

login-0-0 696(1)$ bjob -j 1504
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1504 1436.30    bhm    staff   lowpri  lowpri PD 0.0000000093(   40)  0:00     
 5:00   1   1     400       0 (Priority)

teflon 520(2)# scontrol hold 1504

login-0-0 697(1)$ bjob -j 1504
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1504 1436.30    bhm    staff   lowpri  lowpri PD 0.0000000000(    0)  0:00     
 5:00   1   1     400       0 (JobHeldAdmin)

teflon 521(2)# scontrol release 1504

login-0-0 698(1)$ bjob -j 1504
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1504 1436.30    bhm    staff   lowpri  lowpri PD 0.0000000002(    1)  0:00     
 5:00   1   1     400       0 (JobHeldAdmin)
login-0-0 699(1)$ 

The job is indeed runnable: A little later:

login-0-0 700(1)$ bjob -j 1504
JOBID NAME       USER   ACCOUNT PARTITI QOS    ST     PRIORITY(PRIOR)  TIME 
TIME_LEFT CPU NOD MIN_MEM MIN_TMP NODELIST(REASON)
 1504 1436.30    bhm    staff   lowpri  lowpri  R 0.0000000002(    1)  0:19     
 4:41   1   1     400       0 compute-0-9


The same happens with scontrol uhold, or when the admin user releases a
job held by the job owner.

I've verified that this also happens with an unpatched 2.2.3, so the
patch did not introduce the problem.


-- 
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to