On 24/06/2015 16:31, Alex Gaudio wrote:
Does anyone have other ideas?
HTCondor deals with this by having a "defrag" demon, which periodically
stops hosts accepting small jobs, so that it can coalesce small slots
into larger ones.
http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag
You can configure policies based on how many drained machines are
already available, and how many can be draining at once.
Maybe there would be a benefit if Mesos could work out what is the
largest job any framework has waiting to run, so it knows whether
draining is required and how far to drain down. This might take the
form of a message to the framework: "suppose I offered you all the
resources on the cluster, what is the largest single job you would want
to run, and which machine(s) could it run on?" Or something like that.
Regards,
Brian.