Hi Lennart,

Don't worry about the Accounting thing, I didn't know this you now say:


> This has worked for a long time and usually still does.
> But sometimes it goes seriously wrong, with a new job starting
> at a age value of 20160 instead.


So if it usually works, all I can think of right now is about this part of
multifactor's code (for 2.4.1):

if (weight_age) {
uint32_t diff;
if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS)
 diff = start_time - job_ptr->details->submit_time;
else
diff = start_time - job_ptr->details->begin_time;
 if (job_ptr->details->begin_time) {
if (diff < max_age) {
job_ptr->prio_factors->priority_age =
 (double)diff / (double)max_age;
} else
job_ptr->prio_factors->priority_age = 1.0;
 } else if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS) {
if (diff < max_age) {
 job_ptr->prio_factors->priority_age =
(double)diff / (double)max_age;
 } else
job_ptr->prio_factors->priority_age = 1.0;
}
 }

( You can find it here:
https://github.com/chaos/slurm/blob/master/src/plugins/priority/multifactor/priority_multifactor.c#L458
 )

You may be getting to any of those "job_ptr->prio_factors->priority_age =
1.0;" for some reason. Maybe, as you suggest, there is a problem with some
of those times ("begin_time" maybe). I would check the times you're getting
in job details.

Regards,

Miguel



On Wed, Sep 5, 2012 at 12:07 PM, Lennart Karlsson <[email protected]
> wrote:

>
> On 09/04/2012 04:22 PM, Miguel Méndez wrote:
> > Hi Lennart,
> >
> > I have some questions for you so I can help you:
> >
> > Have you tried to set DebugFlags=Priority in slurm.conf to get some more
> > info about priorities on slurmctld.log?
> >
> > Are your priorities being recalculated every "PriorityCalcPeriod" (in
> > slurm.conf as well, default is 5 min)? If not, do you have Accounting
> > enabled?
>
> Hi Miguel,
>
> And thanks for trying to help me!
>
> Yes, I have configured
>
>    PriorityCalcPeriod=5
>
> in the slurm.conf file.
>
> I do not understand your question about if I have Accounting enabled.
> I have no such configuration variable in my slurm.conf file. I run
>
>
> I have now tried your suggestion to set DebugFlags=Priority,
> so now I can rewrite my question in a new way.
>
> In slurm.conf, I have configured
> PriorityMaxAge=14-0
> PriorityWeightAge=20160
>
> The plan behind this configuration is to start with an age
> value of zero and get approximately one priority point added
> for each minute that the job has been waiting, up to a
> maximum of 20160.
>
> This has worked for a long time and usually still does.
> But sometimes it goes seriously wrong, with a new job starting
> at a age value of 20160 instead.
>
> This can be seen with the sprio command and also with Priority
> debugging on:
>
> [2012-09-05T10:43:37] Weighted Age priority is 1.000000 * 20160 = 20160.00
> [2012-09-05T10:43:37] Weighted Fairshare priority is 10.000000 * 10000 =
> 100000.00
> [2012-09-05T10:43:37] Weighted JobSize priority is 0.001616 * 104 = 0.17
> [2012-09-05T10:43:37] Weighted Partition priority is 0.000000 * 0 = 0.00
> [2012-09-05T10:43:37] Weighted QOS priority is 0.000000 * 400000 = 0.00
> [2012-09-05T10:43:37] Job 2182878 priority: 20160.00 + 100000.00 + 0.17 +
> 0.00 + 0.00 - 0 = 120160.17
>
> The job was submitted 2012-09-05T10:42:22, so it should have a weighted
> age priority of zero or one, but it got for some unknown reason the
> maximum value instead.
>
> Here are a job that behaves the normal way, as expected:
> [2012-09-05T10:44:17] Weighted Age priority is 0.000000 * 20160 = 0.00
> [2012-09-05T10:44:17] Weighted Fairshare priority is 6.000000 * 10000 =
> 60000.00
> [2012-09-05T10:44:17] Weighted JobSize priority is 0.002874 * 104 = 0.30
> [2012-09-05T10:44:17] Weighted Partition priority is 0.000000 * 0 = 0.00
> [2012-09-05T10:44:17] Weighted QOS priority is 0.000000 * 400000 = 0.00
> [2012-09-05T10:44:17] Job 2182879 priority: 0.00 + 60000.00 + 0.30 + 0.00
> + 0.00 - 0 = 60000.30
>
> This job was submitted 2012-09-05T10:44:17, so the weighted age
> priority is zero, as expected.
>
> Here is an example for a job that has waited for some time:
> [2012-09-05T00:07:31] Weighted Age priority is 0.004721 * 20160 = 95.17
> [2012-09-05T00:07:31] Weighted Fairshare priority is 10.000000 * 10000 =
> 100000.00
> [2012-09-05T00:07:31] Weighted JobSize priority is 0.002874 * 104 = 0.30
> [2012-09-05T00:07:31] Weighted Partition priority is 0.000000 * 0 = 0.00
> [2012-09-05T00:07:31] Weighted QOS priority is 0.300000 * 400000 =
> 120000.00
> [2012-09-05T00:07:31] Job 2178648 priority: 95.17 + 100000.00 + 0.30 +
> 0.00 + 120000.00 - 0 = 220095.47
>
> Submit time was 2012-09-04T22:32:08, so the Weighted Age
> priority works as intended in this case.
>
> This is version 2.4.1 of SLURM. (If someone thinks that the Fairshare
> priorities are strange, do not worry. They are intended to be in this
> way, but that is another story.)
>
> Full slurm.conf configuration is at the bottom of this e-mail,
> with line numbers added.
>
> Cheers,
> -- Lennart Karlsson
>      UPPMAX, Uppsala University, Sweden
>      http://www.uppmax.uu.se
>
> ==============================================
>       1  ControlMachine=kalkyl2
>       2  AuthType=auth/munge
>       3  CacheGroups=0
>       4  CryptoType=crypto/munge
>       5  EnforcePartLimits=YES
>       6  Epilog=/etc/slurm/slurm.epilog
>       7  JobCredentialPrivateKey=/etc/slurm/slurm.key
>       8  JobCredentialPublicCertificate=/etc/slurm/slurm.cert
>       9  JobRequeue=0
>      10  MaxJobCount=1000000
>      11  MpiDefault=none
>      12  Proctracktype=proctrack/cgroup
>      13  Prolog=/etc/slurm/slurm.prolog
>      14  PropagateResourceLimits=RSS
>      15  ReturnToService=0
>      16  SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env
> --mpi=none -Q $SHELL"
>      17
>  
> SchedulerParameters=default_queue_depth=5000,bf_window=10080,max_job_bf=5000,bf_interval=120
>      18  SlurmctldPidFile=/var/run/slurmctld.pid
>      19  SlurmctldPort=6817
>      20  SlurmdPidFile=/var/run/slurmd.pid
>      21  SlurmdPort=6818
>      22  SlurmdSpoolDir=/var/spool/slurmd
>      23  SlurmUser=slurm
>      24  StateSaveLocation=/usr/local/slurm-state
>      25  SwitchType=switch/none
>      26  TaskPlugin=task/cgroup
>      27  TaskProlog=/etc/slurm/slurm.taskprolog
>      28  TopologyPlugin=topology/tree
>      29  TmpFs=/scratch
>      30  TrackWCKey=yes
>      31  TreeWidth=20
>      32  UsePAM=1
>      33  HealthCheckInterval=1800
>      34  HealthCheckProgram=/etc/slurm/slurm.healthcheck
>      35  InactiveLimit=0
>      36  KillWait=600
>      37  MessageTimeout=60
>      38  ResvOverRun=UNLIMITED
>      39  MinJobAge=43200
>      40  SlurmctldTimeout=300
>      41  SlurmdTimeout=1200
>      42  Waittime=0
>      43  FastSchedule=1
>      44  MaxMemPerCPU=3072
>      45  SchedulerType=sched/backfill
>      46  SchedulerPort=7321
>      47  SelectType=select/cons_res
>      48  SelectTypeParameters=CR_Core_Memory
>      49  PriorityType=priority/multifactor
>      50  PriorityDecayHalfLife=0
>      51  PriorityCalcPeriod=5
>      52  PriorityUsageResetPeriod=MONTHLY
>      53  PriorityFavorSmall=NO
>      54  PriorityMaxAge=14-0
>      55  PriorityWeightAge=20160
>      56  PriorityWeightFairshare=10000
>      57  PriorityWeightJobSize=104
>      58  PriorityWeightPartition=0
>      59  PriorityWeightQOS=400000
>      60  AccountingStorageEnforce=associations,limits,qos
>      61  AccountingStorageHost=kalkyl2
>      62  AccountingStoragePort=7031
>      63  AccountingStorageType=accounting_storage/slurmdbd
>      64  ClusterName=kalkyl
>      65  DebugFlags=NO_CONF_HASH,Priority
>      66  JobCompLoc=/etc/slurm/slurm_jobcomp_logger
>      67  JobCompType=jobcomp/script
>      68  JobAcctGatherFrequency=30
>      69  JobAcctGatherType=jobacct_gather/linux
>      70  SlurmctldDebug=3
>      71  SlurmctldLogFile=/var/log/slurm/slurmctld.log
>      72  SlurmdDebug=3
>      73  SlurmdLogFile=/var/log/slurm/slurmd.log
>      74  NodeName=DEFAULT Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
> State=UNKNOWN TmpDisk=100000
>      75
>      76  NodeName=q[1-16]    RealMemory=72000 Feature=fat,mem72GB,ibsw1
> Weight=3
>      77  NodeName=q[17-32]   RealMemory=48000 Feature=fat,mem48GB,ibsw1
> Weight=2
>      78  NodeName=q[33-64]   RealMemory=24000 Feature=thin,mem24GB,ibsw2
>  Weight=1
>      79  NodeName=q[65-96]   RealMemory=24000 Feature=thin,mem24GB,ibsw3
>  Weight=1
>      80  NodeName=q[97-108]  RealMemory=24000 Feature=thin,mem24GB,ibsw4
>  Weight=1
>      81  NodeName=q[109-140] RealMemory=24000 Feature=thin,mem24GB,ibsw5
>  Weight=1
>      82  NodeName=q[141-172] RealMemory=24000 Feature=thin,mem24GB,ibsw6
>  Weight=1
>      83  NodeName=q[173-204] RealMemory=24000 Feature=thin,mem24GB,ibsw7
>  Weight=1
>      84  NodeName=q[205-216] RealMemory=24000 Feature=thin,mem24GB,ibsw8
>  Weight=1
>      85
>      86  NodeName=q[217-232] RealMemory=24000 Feature=thin,mem24GB,ibsw4
>  Weight=1
>      87
>      88  NodeName=q[233-252] RealMemory=24000 Feature=thin,mem24GB,ibsw8
>  Weight=1
>      89  NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9
>  Weight=1
>      90  NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10
> Weight=1
>      91  NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11
> Weight=1
>      92
>      93  PartitionName=all Nodes=q[1-348] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=14400 State=DOWN
>      94  PartitionName=core Nodes=q[45-348] Default=YES Shared=NO
> MaxTime=14400 MaxNodes=1 State=UP
>      95  PartitionName=node Nodes=q[1-32,45-348] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=14400 State=UP
>      96  PartitionName=devel Nodes=q[33-44] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=60 MaxNodes=4 State=UP
>

Reply via email to