On Tue, 07 Feb 2012 10:22:49 -0800 Danny Auble <d...@schedmd.com> wrote:
> Yuri, turn on the Priority DebugFlag in the slurm.conf and see what is > happening. Perhaps that would shead some light on the subject. You can > do it from sview or alter the slurm.conf file and scontrol reconfig > without having to restart the slurmctld. Ok, I had to submit ~1000 jobs to make it happen again: $ sprio -j 465060 Unable to find jobs matching user/id(s) specified $ squeue -j 465060 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 465060 batch sbatch ydelia PD 0:00 1 (Priority) $ sacct -j 465060 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 465060 sbatch batch default 0 PENDING 0:0 The slurmctld.log contains the following: [2012-02-07T19:39:33] Fairshare priority of job 465060 for user ydelia in acct default is 2**(-0.999268/0.050000) = 0.000001 [2012-02-07T19:39:33] Weighted Age priority is 0.000000 * 1000 = 0.00 [2012-02-07T19:39:33] Weighted Fairshare priority is 0.000001 * 10000 = 0.01 [2012-02-07T19:39:33] Weighted JobSize priority is 0.000000 * 0 = 0.00 [2012-02-07T19:39:33] Weighted Partition priority is 1.000000 * 1000 = 1000.00 [2012-02-07T19:39:33] Weighted QOS priority is 0.000000 * 0 = 0.00 [2012-02-07T19:39:33] Job 465060 priority: 0.00 + 0.01 + 0.00 + 1000.00 + 0.00 - 1000 = 2.00 [2012-02-07T19:39:33] _slurm_rpc_submit_batch_job JobId=465060 usec=84514 [2012-02-07T19:39:33] Normalized usage for account default off root 5747776.753815 / 5747776.753815 = 1.000000 [2012-02-07T19:39:33] Effective usage for account default off root 1.000000 1.000000 [2012-02-07T19:39:33] Decay factor over 300 seconds goes from 0.999998854166667 -> 0.999656308878391 [2012-02-07T19:39:34] job 460729 ran for 300 seconds on 1 cpus [2012-02-07T19:39:34] grp_used_cpu_run_secs is 0, will subtract 0 [2012-02-07T19:39:34] grp_used_cpu_run_secs is 0, will subtract 0 .... (followed by what looks like a priority decay run). It seems that 465060 is the first submitted job (in a row of submissions) where priority has not been calculated. It's immediately followed by a decay run. The jobs before/after this job just contain the following: [2012-02-07T19:39:34] Fairshare priority of job 465061 for user ydelia in acct default is 2**(-0.999269/0.050000) = 0.000001 [2012-02-07T19:39:34] Weighted Age priority is 0.000000 * 1000 = 0.00 [2012-02-07T19:39:34] Weighted Fairshare priority is 0.000001 * 10000 = 0.01 [2012-02-07T19:39:34] Weighted JobSize priority is 0.000000 * 0 = 0.00 [2012-02-07T19:39:34] Weighted Partition priority is 1.000000 * 1000 = 1000.00 [2012-02-07T19:39:34] Weighted QOS priority is 0.000000 * 0 = 0.00 [2012-02-07T19:39:34] Job 465061 priority: 0.00 + 0.01 + 0.00 + 1000.00 + 0.00 - 1000 = 2.00 (repeated over and over) My relevant config (if necessary): DebugFlags = Priority PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = 0 PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1000 PriorityWeightFairShare = 10000 PriorityWeightJobSize = 0 PriorityWeightPartition = 1000 PriorityWeightQOS = 0