Brian, these are very good and nicely written questions, let me try to ask them.
Mesos built-in allocator does the bookkeeping based on the declared resource consumption and not actual (you call it "measured"). This means the example in Apache docs and the white paper is correct. However, there is an ongoing work for adding oversubscription to Mesos. I'm not an expert in this area, maybe Niklas Nielsen chimes in and corrects me later, but IMO it is planned to measure the actual resource consumption on each Mesos agent node and notify Mesos master (including allocator) about extra free but revokable resources. Here <https://docs.google.com/document/d/1pUnElxHy1uWfHY_FOvvRC73QaOGgdXE0OXN-gbxdXA0/edit#heading=h.yvd9qbi4swb4> is the design doc for this feature, some code has been already landed in Mesos master branch, check, for example, include/mesos/slave/resource_estimator.hpp. As far as I know we do not have execution priorities for tasks. We plan to takle this problem from a different direction: introduce quota (i.e. cluster-wide resource reservations) for production frameworks, which guarantees a certain amount of resources can be used by the framework at any time, together with introducing oversubscription for quota resources, that are currently unused by the framework. Another effort that aims to increase cluster utilization is optimistic offers, which means offering same resources to multiple frameworks at the same time. Please be advised, that both quota and optimistic offers are in the early design phase right now and will definitely not land in Mesos 0.23. And yes, to increase CPU utilization, you may also lie to Mesos master about how many CPUs your agent nodes have : ). Hope this sheds some light on the topic. On Wed, Jun 17, 2015 at 11:47 AM, Brian Candler <b.cand...@pobox.com> wrote: > On 17/06/2015 10:33, Brian Candler wrote: > >> It's made more complicated by the fact that the jobs use mmap() on large >> shared databases, so running multiple instances of the same task doesn't >> use N times as much memory as one task. >> > Aside: combined with cgroups this gets hairy. > > As I understand it, mmap() memory is charged to the first process which > touches it, and not to subsequent users of the same page. When a process > terminates, the charge gets passed to the parent cgroup. > > Also: 3.2-vintage kernels have issues: even if only using cgroups for > accounting (no hard limits), it seems the OOM killer kicks in if there are > too many dirty pages waiting to be written to disk. > > > https://www-auth.cs.wisc.edu/lists/htcondor-users/2015-February/msg00087.shtml > > https://www-auth.cs.wisc.edu/lists/htcondor-users/2015-February/msg00135.shtml > > Just thought that might be of interest. > >