Hello List, I think I have found the culprit. Tickets are distributed according to STN_stt (short term target), which in turn is basically the inverse of the the recorded (and decayed) usage. Since recorded usage may be zero, the denominator is actually max (SGE_MIN_USAGE * ltt), where for a simple share tree with just a "default" user and nothing else, ltt (long_term_entitlement) is just 1/#users. Unless I am mistaken, usage values (as used for the share tree) are never normalized/scaled, so we get actual CPU-seconds here, which can easily be of the order of 10**8. However, SGE_MIN_USAGE is #defined to 1.0, so while calling max() does prevent a real devision by zero, it will allow very large values of STN_stt.
That said, I still do not understand why the limiting effects of compensation_factor (i.e. stt := min (stt, compensation_factor * ltt) ) prevent this. A. On Dec 21, 2011, at 10:34 , Esztermann, Ansgar wrote: > The problem at hand is this: sometimes, a user with no recorded usage will > submit jobs with very demanding resource requests, which consequently sit > around in qw for quite some time. They will also keep all available share > tickets to themselves, leaving none for other users' jobs. This is OK for > user A (no recorded usage means top priority), but users B and C might have > very different usage, and should therefore be assigned very different > priorities. However, there are no tickets left, so everyone but user A is > treated as an equal, upsetting fairness until those few special nodes that > can run A's jobs become free. Once A accumulates usage, share tickets are > gradually re-distributed to other users, and things work fine; until user X > becomes the new A, that is. -- Ansgar Esztermann DV-Systemadministration Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
