Re: Removing external containerizer from code base
On 08/10/2015 19:04, Kapil Arya wrote: I don't know all the details, but I guess, depending upon the exact interface, it might be possible to have a C++ wrapper around a non-C++ containerizer module. In a very naive approach, one simply fork/execs the "external" non-C++ containerizer script/program. ... and sends instructions to it. Which is, as far as I can see, what the external containerizer does.
Batch/queue frameworks?
Are there any open-source job queue/batch systems which run under Mesos? I am thinking of things like HTCondor, Torque etc. The requirement is to be able to: - define an overall job as a set of sub-tasks (could be many thousands) - put sub-tasks into a queue; execute tasks from the queue - dependencies: don't add a sub-task into the queue until its precursors have completed successfully - restart: after an error, be able to restart the job but skipping those sub-tasks which completed successfully - preferably handle short-lived tasks efficiently (of order of 10 seconds duration) Clearly it's possible to write a framework to do this, but I don't want to re-invent the wheel if it has been done already. Thanks, Brian. P.S. I found Chronos, but it doesn't seem a good match. As far as I can see, it's intended for applications where you pre-define a bunch of tasks (via GUI? via REST?) and then trigger them periodically.
Re: Batch/queue frameworks?
On 07/10/2015 09:44, Nikolaos Ballas neXus wrote: Maybe you need to read a bit :) I have read plenty, including those you list, and I didn't find anything which met my requirements. Again I apologise if I was not clear in my question. Spark has a very specific data model (RDDs) and applications which write to its API. I want to run arbitrary compute jobs - think "shell scripts" or "docker containers" which run pre-existing applications which I can't change. And I want to fill a queue or pipeline with those jobs. Hadoop also is for specific workloads, written to run under Hadoop and preferably using HDFS. The nearest Hadoop gets to general-purpose computing, as far as I can see, is its YARN scheduler. YARN can in turn run under Mesos. Therefore a job queue which can run on YARN might be acceptable, although I'd rather not have an additional layer in the stack. (There was an old project for running Torque under YARN, but this has been abandoned) Regards, Brian.
Re: Batch/queue frameworks?
On 07/10/2015 09:01, Nikolaos Ballas neXus wrote: Check for Marathon I don't see how Marathon does what I want. Maybe I wasn't clear enough in explaining my requirements. What I need is basically a supercomputer cluster where I can take a large computation job, break it into lots of sub-tasks, and run as many of those sub-tasks in parallel as possible given the CPU resources available, until all the sub-tasks are done. The core of any sort of system like that is a "job queue" where all the sub-tasks are entered. The executor picks out another task whenever there is some free resource available, and when it finishes, it is removed from the queue. I don't see how Marathon has such a job queue. As far as I can tell, Marathon is for starting long-lived applications; you define what things you want running, it starts them, and restarts them if they die for any reason. Or have I misunderstood what Marathon is capable of? If so, can you point me at the relevant documentation? The advantage of running such a supercomputer cluster under Mesos would be that I could run *other* applications (including those started by Marathon or Chronos) on the same hardware. Thanks, Brian.
Re: Batch/queue frameworks?
On 07/10/2015 11:08, Pablo Cingolani wrote: It looks like you are looking for something like BDS http://pcingola.github.io/BigDataScript/ It has the additional advantage that you can port your scripts seamlessly between Mesos and other cluster systems (SGE, PBS, Torque, etc.). Yes, that looks very interesting, thank you! It seems to perform the same role as HTCondor Dagman, but with pluggable backends and a much more expressive language. At http://pcingola.github.io/BigDataScript/bigDataScript_manual.html under "Resource consumption and task options", I don't see any option for declaring the memory used by a task. Is that a wishlist feature? In fact, mesos allows arbitrary resources, so it would be good to be able to specify resource requirements of any particular resource. I note that BDS allows a task to specify it runs on one particular cluster node. In my application it would also be helpful to be able to specify a particular class of node. (When submitting a job to HTCondor this could be expanded to a requirements expression) Regards, Brian.
Broken images in documentation on web
It seems several documentation pages now have broken images, for example: http://mesos.apache.org/documentation/latest/external-containerizer/ (under " Container Lifecycle Sequence Diagrams") http://mesos.apache.org/documentation/latest/oversubscription/ (under "How does it work?") However, some pages are OK, e.g. http://mesos.apache.org/documentation/latest/mesos-architecture/ Could someone investigate please? Thanks, Brian Candler.
Shedding some light on oversubscription
I see in the upcoming 0.23.0 there is support for oversubscription of resources: the objective is to fill your cluster with background jobs while still ensuring critical jobs have the resources they need. This is great. In order to implement this properly, it seems to me that the executor is going to have to run the background and normal jobs differently (or run them with different containerizer settings). For example, I expect that the background jobs will need to be run at a lower CPU priority, so they only use spare CPU time which is not required by the higher-priority jobs. This presumably will be done by different cgroup settings, depending on whether the offer contains revocable resources or not. What's not clear to me is where this will be implemented, and whether this needs further changes in either mesos or the frameworks running under it. Does either the executor or the containerizer need to run differently depending on whether the resources are revocable, and is it up to the framework to tell it so? Documentation in this area seems very hard to find. For example, one of the frameworks (Aurora) has a detailed page on CPU isolation: http://aurora.apache.org/documentation/latest/resource-isolation/ But it's not made clear whether Aurora provides an external containerizer of its own which implements this, or whether this page is documenting standard Mesos functionality. I can see that Mesos itself provides a containerizer: http://mesos.apache.org/documentation/latest/mesos-containerizer/ but that page doesn't even mention the two most important things you might want to constrain, namely CPU usage and RAM usage! I have also checked in git, and docs/mesos-containerizer.md doesn't mention whether it could handle revocable and standard resources differently. Mesos also provides a docker containerizer: http://mesos.apache.org/documentation/latest/docker-containerizer/ Apparently that shells out to the docker CLI. Again, it's not clear whether it can tell the difference between revocable and standard offers, nor indeed whether docker is able to run containers at different priority levels. So in summary: is 0.23.0 laying the foundations for all this, but with more work required down the line? And does that work include changes to frameworks as well as the core of mesos? Regards, Brian.
Re: Setting minimum offer size
On 24/06/2015 15:35, David Greenberg wrote: I'm not aware of any minimum offer size option in mesos. What I've seen success with is holding onto small offers and waiting until I accumulate enough to launch the large task. This way, the need for large offers doesn't affect the cluster, but the framework that needs it gets it. This means that the execution of a task doesn't need to be linked to a single resource offer, but they can be aggregated? Also, it seems to me this would break down if there were two frameworks both which required 20GB of RAM. Let's say each of them grabs 16 x 1GB offers. At this point all 32GB of one node has been committed, and there is deadlock. Conversely, ISTM that a 20GB minimum offer size would lead to big wastage in the case where frameworks only made 1GB allocations. After 13GB of RAM had been used (19GB free), no further offers would be made. Regards, Brian.
Setting minimum offer size
The following problem is mentioned in the Mesos technical paper at http://mesos.berkeley.edu/mesos_tech_report.pdf when a cluster is filled by tasks with small resource requirements, a framework f with large resource requirements may starve, because whenever a small task finishes, f cannot accept the resources freed up by it, but other frameworks can. To accommodate frameworks with large per-task resource requirements, allocation modules can support a minimum offer size on each slave, and abstain from offering resources on that slave until this minimum amount is free. However minimum offer size is not mentioned subsequently. Can someone explain how this works in practice? For example, suppose you have machines with 32GB of RAM, one framework which has a large queue of jobs which use 1GB each. Then along comes another framework with some jobs which require 20GB. The problem is that when a 1GB job terminates, 1GB slots will be offered to both frameworks; the second one will be forced to decline them (not big enough), but the first framework will continue to run jobs. Is the solution here to set the minimum offer size to 20GB? And this means that the 1GB jobs will slowly drain away until 20GB is free, at which point an offer is made to both frameworks? If so, how is it actually configured? (I googled mesos minimum offer size but only found variations on that paper, not the actual config settings) Thanks, Brian.
Re: Setting minimum offer size
On 24/06/2015 16:31, Alex Gaudio wrote: Does anyone have other ideas? HTCondor deals with this by having a defrag demon, which periodically stops hosts accepting small jobs, so that it can coalesce small slots into larger ones. http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag You can configure policies based on how many drained machines are already available, and how many can be draining at once. Maybe there would be a benefit if Mesos could work out what is the largest job any framework has waiting to run, so it knows whether draining is required and how far to drain down. This might take the form of a message to the framework: suppose I offered you all the resources on the cluster, what is the largest single job you would want to run, and which machine(s) could it run on? Or something like that. Regards, Brian.
Re: Resource modelling questions
On 19/06/2015 01:59, Benjamin Mahler wrote: 100ms is the default period for quota: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt Ah, that's very interesting: thank you. Now if I understand this correctly, assuming Mesos runs all its tasks in cgroups with CPU bandwidth control, it means that a given task can never use more than the CPU share it has requested - even if there is spare capacity. This is good for deterministic behaviour - if it works well when the cluster is idle, it will still work well when the cluster is busy (*) - but will almost always result in under-utilisation. I think that means there is a compelling case for having some sort of backfill capability for batch jobs to use spare capacity. Regards, Brian. (*) Actually, according to http://aurora.apache.org/documentation/latest/resource-isolation/ Mesos considers logical cores, also known as hyperthreading or SMT cores, as the unit of CPU. So if you have a 4 core processor with hyperthreading, it sounds like it will schedule up to 8 cores of workload. That will lead to a degree of oversubscription of each CPU core, so actually it doesn't *guarantee* you will always see the same performance under high load, but it does help alleviate the under-utilisation issue.
Re: Thoughts and opinions in physically building a cluster
On 19/06/2015 18:38, Oliver Nicholas wrote: Unless you have some true HA requirements, it seems intuitively wasteful to have 3 masters and 2 slaves (unless the cost of 5 nodes is inconsequential to you and you hate the environment). Any particular reason not to have three nodes which are acting both as master and slaves?
Re: Resource modelling questions
On 18/06/2015 06:41, zhou weitao wrote: A partial solution might be if each task could request a fraction of a CPU. But CPU fraction is Time-Division. I don't think it's possible to request a fraction. Because it's timeslicing, there should be no problem at all requesting fractions. For example if you request 0.5 CPUs, then your application can run on one CPU for up to 50% of the time; or it could run on 2 CPUs for 25% of the time. There's a nice diagram here: http://aurora.apache.org/documentation/latest/resource-isolation/ Having said that, I'm not sure where the 100ms timeslices in that diagram come from. I'm pretty sure the Linux kernel CFS scheduler does not work in that way. https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of timeslices in the way the previous scheduler had Maybe that diagram is out-of-date?
mesosphere.io broken?
Looking for Mesos .deb packages, on Google I find links to http://mesosphere.io/downloads/ http://elastic.mesosphere.io/ but these are giving 503 Service Unavailable errors. Is there a problem, or have these sites gone / migrated away?
Resource modelling questions
I have some questions about resource accounting in Mesos. In the [mesos architecture](http://mesos.apache.org/documentation/latest/mesos-architecture/) and the [2010 white paper](https://www.cs.berkeley.edu/~alig/papers/mesos.pdf), the following simple example is given: * A node declares it has 4 CPUs and 4GB RAM * A resource offer is made to a framework * The framework scheduler accepts (2 CPU, 1GB) for one task and (1 CPU, 2GB) for another tasks * Mesos then is able to offer the remaining 1 CPU and 1GB of RAM to another framework. My question is, is this the total resource accounting which Mesos does, or is it able to take account of the *measured* resource utilisation on a particular node at the current point in time? Here are some examples of what I'm thinking of. (1) A node has 12 CPUs. If tasks each declare a usage of 1 CPU, will Mesos only run 12 of them? Let's say some of those tasks only use 50% of a CPU on average (because part of the time they are waiting on network or disk I/O); some tasks use 0% CPU most of the time (because they are interactive and are waiting for incoming queries) Clearly, if the remaining tasks are multi-threaded they will be able to take up the spare CPU capacity. But if those 12 tasks were all single-threaded processes, it seems to me that the CPU would be underutilised. A partial solution might be if each task could request a fraction of a CPU. (2) A task requests 2GB of RAM, but when running it actually uses only 1GB of RAM. This means there is spare RAM, at least at the current point in time, bearing in mind that the task could take up its allocation whenever it choses. Conversely: a task requests 2GB of RAM usage, but actually uses 3GB (assuming there is no hard cgroups enforcement to prevent it doing so). Third case: a sysadmin manually runs some process on the node which uses some RAM, but the process was not started via Mesos. Would any of these scenarios result in the Mesos allocator making larger or smaller resource offers, or would it always calculate RAM available = (initial system RAM) - (total of accepted resource offers) ? (3) Can the executor set execution priorities for different tasks, so that if the CPU is overcommitted, real time tasks can take precedence over batch tasks? Where I'm coming from is trying to arrange for a single cluster which runs some real time jobs, and all the remaining available resources are soaked up running background batch jobs. The real-time jobs may be bursty interactive workloads - when not being queried they are idle and the resources, but when the real-time jobs are being queried they should not be slowed down significantly by the batch jobs. The background jobs are mixed CPU and I/O, so to achieve full CPU utilisation you need to run more of them than the number of CPU cores you have. At the moment I'm using HTCondor but this only really supports batch workloads where each task has a defined lifetime; I can't run the long-running, real-time query tasks under the same framework. To get full CPU utilisation I have to get Condor to lie about the number of CPU cores are on each node, pretending there are more than there really are. Therefore I'm wondering whether Mesos would suit this better, maybe using one framework for the persistent real-time workflows and one for the batch workloads, but I really want to understand better how the resources would be divided up between the frameworks and how to achieve both real-time operation and good utilisation of remaining resources. Also, with HTCondor I have to take a lot of care to calculate (guess) up-front the amount of RAM each task will take, and that is difficult, as it depends on the algorithms used and the size of the incoming data sets. Set the estimate too low, and processes can be killed by the OOM killer due to RAM exhaustion. Set the estimate too high and the machine is underutilised. It's made more complicated by the fact that the jobs use mmap() on large shared databases, so running multiple instances of the same task doesn't use N times as much memory as one task. I'm guessing that Mesos isn't going to free me from having to pre-calculate resource requirements for tasks, but if it were able to respond to *actual* RAM usage on the system, allow for some overcommitment, and terminate lower-priority tasks if RAM is exhausted, that would be extremely helpful. Thanks, Brian Candler.