Hello, We upgraded to 15.08.1 yesterday. I see these messages in /var/log/slurm/slurmctld.log frequently now:
[2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2556644: QOS override TRES cpu grp_used_tres_run_secs underflow, tried to remove 19200 seconds when only 0 remained. [2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2556644: QOS override TRES mem grp_used_tres_run_secs underflow, tried to remove 76800000 seconds when only 0 remained. [2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2690887: QOS sahl TRES cpu grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 0 remained. [2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2691478: QOS trilling TRES cpu grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 0 remained. [2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2691478: QOS trilling TRES mem grp_used_tres_run_secs underflow, tried to remove 48000000 seconds when only 0 remained. [2015-10-01T08:09:35.461] error: _handle_qos_tres_run_secs: job 2691478: QOS trilling TRES node grp_used_tres_run_secs underflow, tried to remove 300 seconds when only 0 remained. To me it sounds harmless, like there is some race condition in tracking of the tres cpu/mem seconds in use. I thought I’d mention it anyway as no one likes errors in their logs! :) We were using GrpCPURunMins to limit resource use to accounts, that seems to be handled by GrpTRESRunMin now. Let me know if you need more info! Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167
