Two Sigma's Cook framework (http://github.com/twosigma/cook) has this functionality, which is achieved by preempting spark executors if the cluster isn't correctly balanced.
If you'd like some help configuring it, I'd be happy to help. On Mon, Feb 1, 2016 at 6:58 PM Hans van den Bogert <[email protected]> wrote: > Hi Benjamin, > > Thanks for having taken the time to answer my, in hindsight, rather vague > email. > After carefully examining more logs etc, I finally know what was going on. > > (numbers mentioned are specific to my case) As you may know Spark, in > fine-grained mode, holds on to memory (25% / Framework) and there is about > 30% churn of CPU-resources every allocation interval, at the default 1 > second. > Then Mesos does its allocation round, looping per slave. There will be a > point that my Framework 2, F2, will get one of these slave resources, this > will happen when F1 is at approx 80%. At that point Mesos will give him, > F2, a stochastic 30% of the slave CPU-resources and a deterministic 75% of > the memory-resources for that slave. This will increase F2’s dominant > resource share (memory) with 0.75*0.10 (mem*amount of slaves) = 7.5%, to a > total of 25%+7.5% = 32.5%. > From that point on in subsequent resource offers, in the same allocation > round, as far as DRF is concerned, F2 now has a dominant resource share of > 32.5%. Honoring the ratio of the 4:1, or even my “compensated” 3.2:1, F2 > will never be offered resources again for that allocation interval. > > Now why does my compensated ratio work in the case of a 50ms allocation > interval?, well, the expected churn of CPU resources for a 50ms allocation > interval is close to, or smaller than 1. So Mesos’ allocation round almost > always exists of 1 slave, and its expected amount of 1 CPU get correctly > given to the right role, F1/F2, as at the beginning of every allocation > interval, the dominant resource for F2, memory, is back to 25% > > So, it seems DRF is not favorable to frameworks which do not think in > terms of a <CPU,MEM> tuples for tasks, like spark in fine-grained mode. > > To verify my theory I “lied” about the memory in the cluster by a large > factor, >10x and suddenly fairness was honored, although great to prove my > theory, the Mesos scheduling has now become useless if I every want to make > this a heterogeneous cluster, i.e. adding more types of frameworks. > > Will revocable resources remedy this in one way or another? Are there > mechanisms in Mesos which can help me out with my use case without > resorting to lying about the memory per slave? > > Regards, > > Hans > > PS > Sorry of the tough read, it’s hard to explain these things properly over > email. > > > > On Jan 29, 2016, at 8:41 PM, Benjamin Mahler <[email protected]> wrote: > > Hi Hans, > > The biggest thing to note here is that we (in retrospect) made the mistake > a long time ago of offering resources as non-revocable by default. We'd > like to change this default so that frameworks only receive non-revocable > resources when they have quota or reservations in place. While I don't have > enough information to comment on your exact scenario, it's worth mentioning > that we do not have the ability to revoke resources yet and so if there are > long-running executors we can get into situations where fairness is not > respected. For example, if framework 1 arrives before framework 2 and takes > all of the resources, framework 2 will be starved since mesos cannot take > action to revoke and provide fairness. We're currently looking at making > things revocable by default so that mesos can dynamically maintain fairness > via revocation. In that world, you should see weighted fairness maintained, > and you would use quota and/or reservations to provide guarantees to > frameworks. > > Hope that helps you diagnose further and get some context on the (current) > caveats! > > Ben > > On Tue, Jan 26, 2016 at 5:43 AM, Hans van den Bogert <[email protected] > > wrote: > >> Hi, >> >> While investigating fairness possibilities with Mesos for Spark workloads >> I’m trying to achieve for example a 4:1 weight ratio for two frameworks. >> Imagine a system with two Spark frameworks (in fine-grained mode if >> you’re familiar with Spark) and I want one the two frameworks to get four >> times more resources than the other when both are contending for resources. >> >> In mesos I set two roles “F1” “F2", with a weight of 4 and 1 respectively. >> >> However during the times when both frameworks are in need of resources >> the latter gets close to zero offers. Having read and more carefully >> investigated DRF I understood that memory is the dominant resource in the >> case of framework 2 (F2) which Spark sets statically, i.e., it doesn’t >> release once acquired, and in my case that is ~25% per slave. So the >> allocator thinks that F2 has received enough resources, since its dominant >> resource is already above if weighted fair share. Thus all CPU offers go to >> framework 1 (F1). >> To remedy this first hurdle I recalculate, although somewhat contrived, >> the ratio to 3.2 ( =80% / 25% ). >> After using the 3.2:1 ratio things are a bit better but still framework 2 >> (F2), during high resource demand of both frameworks, only gets half of the >> resources it should get. >> >> At this point I was quite lost and tried changing several parameters, on >> of them was the allocation interval (master option --allocation_interval) >> and set it to a relatively low 50ms instead of the default 1000ms. >> Suddenly my ratio was being honored perfectly and I was getting roughly a >> 4:1 CPU ratio between the two Spark frameworks. (Verifying that my ratio >> 3.2:1, to circumvent spark’s static memory allocation, was working. ) >> >> Perhaps it’s because I’m using only 10 physical nodes, however I made >> unit-tests in the mesos-source to mimic my case, and there I could verify >> that the offers are made fairly according to the weights. >> >> Why is the fairness, expressed as being close as close to the defined >> role-weights, only honored when the allocation interval is relatively low? >> Hope someone can explain the phenomenon. >> >> >> Thanks, >> >> Hans > > > >

