> > Each job starting on a machine will contribute 1 to the adjustment which > will decay over time to 0, in your case in 7:30 minutes. The 38.23 is the > sum of all these adjustments of all jobs starting in the last 7:30 while > each job will have it's own individual contribution to this sum. If no job > started in the last 7:30 on a machine it should read 0.50 * 0.000000. This > value is then divided by 56 before being added to 0.965536. >
I actually realized the 38.23 while I was writing this email and noticed the decay time and started to read about it - still what made me send the question was the fact I don't see where the load_formula kicks into play here - the minus num_procs seems to be completely ignored here so I'm probably missing something - what is it? > > load_formula is load_avg-num_proc and load_adjustments are 0.5: > What was the reason to implement it this way? Having a full loaded machine > and subtracting num_proc would read zero - which doesn't reflect the actual > use of the machine. > no one remembers.. talked with the people who configured it - they have absolutely no idea :) "probably copy pasted from somewhere online" <-- real quote. > - A job_load_adjustments does handle the fact that a job isn't using the > granted resources instantly, what is not happening in your case. > I would also assumes it's good for "starting engines" - since the load_avg is the 5 minutes load submitting a huge array after some idle time will make all jobs see almost zero load on the machine.. I wouldn't mind bombing the machine because we only have 1 slot per core so not really worried about killing the cpu but I can see the logic in it even in cases of always intensive jobs. > - alarm_threshold in the queue definition takes care in case you want to > oversubscribe a machine by intention as your parallel job doesn't scale > well > we basically have 2 kinds of queue - a workhorse queue "all.q" which has 1 slot per core and an interactive queue which also has 1 slot per core but gets a better priority. we set the load_thresholds to 1.3 to allow 30% oversubscription to ensure interactive jobs can always run.. we never ever put our nodes in alarm mode, we use zabbix to monitor machine's health and we automatically take it out of the cluster (by disabling all of it's queues) in cases of "mess" (disk failures, out of space, mounting issues, stuff like that)..
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users