Re: Removing external containerizer from code base

2015-10-09 Thread Brian Candler

On 08/10/2015 19:04, Kapil Arya wrote:
I don't know all the details, but I guess, depending upon the exact 
interface, it might be possible to have a C++ wrapper around a non-C++ 
containerizer module. In a very naive approach, one simply fork/execs 
the "external" non-C++ containerizer script/program.
... and sends instructions to it. Which is, as far as I can see, what 
the external containerizer does.




Batch/queue frameworks?

2015-10-07 Thread Brian Candler
Are there any open-source job queue/batch systems which run under Mesos? 
I am thinking of things like HTCondor, Torque etc.


The requirement is to be able to:
- define an overall job as a set of sub-tasks (could be many thousands)
- put sub-tasks into a queue; execute tasks from the queue
- dependencies: don't add a sub-task into the queue until its precursors 
have completed successfully
- restart: after an error, be able to restart the job but skipping those 
sub-tasks which completed successfully
- preferably handle short-lived tasks efficiently (of order of 10 
seconds duration)


Clearly it's possible to write a framework to do this, but I don't want 
to re-invent the wheel if it has been done already.


Thanks,

Brian.

P.S. I found Chronos, but it doesn't seem a good match. As far as I can 
see, it's intended for applications where you pre-define a bunch of 
tasks (via GUI? via REST?) and then trigger them periodically.


Re: Batch/queue frameworks?

2015-10-07 Thread Brian Candler

On 07/10/2015 09:44, Nikolaos Ballas neXus wrote:

Maybe you need to read a bit  :)
I have read plenty, including those you list, and I didn't find anything 
which met my requirements. Again I apologise if I was not clear in my 
question.


Spark has a very specific data model (RDDs) and applications which write 
to its API. I want to run arbitrary compute jobs - think "shell scripts" 
or "docker containers" which run pre-existing applications which I can't 
change.  And I want to fill a queue or pipeline with those jobs.


Hadoop also is for specific workloads, written to run under Hadoop and 
preferably using HDFS.


The nearest Hadoop gets to general-purpose computing, as far as I can 
see, is its YARN scheduler. YARN can in turn run under Mesos. Therefore 
a job queue which can run on YARN might be acceptable, although I'd 
rather not have an additional layer in the stack. (There was an old 
project for running Torque under YARN, but this has been abandoned)


Regards,

Brian.



Re: Batch/queue frameworks?

2015-10-07 Thread Brian Candler

On 07/10/2015 09:01, Nikolaos Ballas neXus wrote:

Check for Marathon


I don't see how Marathon does what I want. Maybe I wasn't clear enough 
in explaining my requirements.


What I need is basically a supercomputer cluster where I can take a 
large computation job, break it into lots of sub-tasks, and run as many 
of those sub-tasks in parallel as possible given the CPU resources 
available, until all the sub-tasks are done.


The core of any sort of system like that is a "job queue" where all the 
sub-tasks are entered. The executor picks out another task whenever 
there is some free resource available, and when it finishes, it is 
removed from the queue.


I don't see how Marathon has such a job queue. As far as I can tell, 
Marathon is for starting long-lived applications; you define what things 
you want running, it starts them, and restarts them if they die for any 
reason.


Or have I misunderstood what Marathon is capable of? If so, can you 
point me at the relevant documentation?


The advantage of running such a supercomputer cluster under Mesos would 
be that I could run *other* applications (including those started by 
Marathon or Chronos) on the same hardware.


Thanks,

Brian.



Re: Batch/queue frameworks?

2015-10-07 Thread Brian Candler

On 07/10/2015 11:08, Pablo Cingolani wrote:

It looks like you are looking for something like BDS

http://pcingola.github.io/BigDataScript/

It has the additional advantage that you can port your scripts seamlessly
between Mesos and other cluster systems (SGE, PBS, Torque, etc.).

Yes, that looks very interesting, thank you!  It seems to perform the 
same role as HTCondor Dagman, but with pluggable backends and a much 
more expressive language.


At http://pcingola.github.io/BigDataScript/bigDataScript_manual.html
under "Resource consumption and task options", I don't see any option 
for declaring the memory used by a task. Is that a wishlist feature? In 
fact, mesos allows arbitrary resources, so it would be good to be able 
to specify resource requirements of

any particular resource.

I note that BDS allows a task to specify it runs on one particular 
cluster node. In my application it would also be helpful to be able to 
specify a particular class of node. (When submitting a job to HTCondor 
this could be expanded to a requirements expression)


Regards,

Brian.



Broken images in documentation on web

2015-10-06 Thread Brian Candler

It seems several documentation pages now have broken images, for example:

http://mesos.apache.org/documentation/latest/external-containerizer/
(under " Container Lifecycle Sequence Diagrams")

http://mesos.apache.org/documentation/latest/oversubscription/
(under "How does it work?")

However, some pages are OK, e.g.
http://mesos.apache.org/documentation/latest/mesos-architecture/

Could someone investigate please?

Thanks,

Brian Candler.



Shedding some light on oversubscription

2015-07-20 Thread Brian Candler
I see in the upcoming 0.23.0 there is support for oversubscription of 
resources: the objective is to fill your cluster with background jobs 
while still ensuring critical jobs have the resources they need. This is 
great.


In order to implement this properly, it seems to me that the executor is 
going to have to run the background and normal jobs differently (or run 
them with different containerizer settings). For example, I expect that 
the background jobs will need to be run at a lower CPU priority, so they 
only use spare CPU time which is not required by the higher-priority 
jobs. This presumably will be done by different cgroup settings, 
depending on whether the offer contains revocable resources or not.


What's not clear to me is where this will be implemented, and whether 
this needs further changes in either mesos or the frameworks running 
under it.  Does either the executor or the containerizer need to run 
differently depending on whether the resources are revocable, and is it 
up to the framework to tell it so?


Documentation in this area seems very hard to find.

For example, one of the frameworks (Aurora) has a detailed page on CPU 
isolation:

http://aurora.apache.org/documentation/latest/resource-isolation/

But it's not made clear whether Aurora provides an external 
containerizer of its own which implements this, or whether this page is 
documenting standard Mesos functionality.


I can see that Mesos itself provides a containerizer:
http://mesos.apache.org/documentation/latest/mesos-containerizer/
but that page doesn't even mention the two most important things you 
might want to constrain, namely CPU usage and RAM usage! I have also 
checked in git, and docs/mesos-containerizer.md doesn't mention whether 
it could handle revocable and standard resources differently.


Mesos also provides a docker containerizer:
http://mesos.apache.org/documentation/latest/docker-containerizer/
Apparently that shells out to the docker CLI. Again, it's not clear 
whether it can tell the difference between revocable and standard 
offers, nor indeed whether docker is able to run containers at different 
priority levels.


So in summary: is 0.23.0 laying the foundations for all this, but with 
more work required down the line? And does that work include changes to 
frameworks as well as the core of mesos?


Regards,

Brian.



Re: Setting minimum offer size

2015-06-24 Thread Brian Candler

On 24/06/2015 15:35, David Greenberg wrote:
I'm not aware of any minimum offer size option in mesos. What I've 
seen success with is holding onto small offers and waiting until I 
accumulate enough to launch the large task. This way, the need for 
large offers doesn't affect the cluster, but the framework that needs 
it gets it. 
This means that the execution of a task doesn't need to be linked to a 
single resource offer, but they can be aggregated?


Also, it seems to me this would break down if there were two frameworks 
both which required 20GB of RAM. Let's say each of them grabs 16 x 1GB 
offers. At this point all 32GB of one node has been committed, and there 
is deadlock.


Conversely, ISTM that a 20GB minimum offer size would lead to big 
wastage in the case where frameworks only made 1GB allocations. After 
13GB of RAM had been used (19GB free), no further offers would be made.


Regards,

Brian.



Setting minimum offer size

2015-06-24 Thread Brian Candler
The following problem is mentioned in the Mesos technical paper at 
http://mesos.berkeley.edu/mesos_tech_report.pdf


 when a cluster is filled by tasks with small resource requirements, a 
framework f with large resource requirements may starve, because 
whenever a small task finishes, f cannot accept the resources freed up 
by it, but other frameworks can. To accommodate frameworks with large 
per-task resource requirements, allocation modules can support a minimum 
offer size on each slave, and abstain from offering resources on that 
slave until this minimum amount is free.


However minimum offer size is not mentioned subsequently. Can someone 
explain how this works in practice?


For example, suppose you have machines with 32GB of RAM, one framework 
which has a large queue of jobs which use 1GB each. Then along comes 
another framework with some jobs which require 20GB.


The problem is that when a 1GB job terminates, 1GB slots will be offered 
to both frameworks; the second one will be forced to decline them (not 
big enough), but the first framework will continue to run jobs.


Is the solution here to set the minimum offer size to 20GB? And this 
means that the 1GB jobs will slowly drain away until 20GB is free, at 
which point an offer is made to both frameworks?


If so, how is it actually configured? (I googled mesos minimum offer 
size but only found variations on that paper, not the actual config 
settings)


Thanks,

Brian.



Re: Setting minimum offer size

2015-06-24 Thread Brian Candler

On 24/06/2015 16:31, Alex Gaudio wrote:

Does anyone have other ideas?
HTCondor deals with this by having a defrag demon, which periodically 
stops hosts accepting small jobs, so that it can coalesce small slots 
into larger ones.


http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag

You can configure policies based on how many drained machines are 
already available, and how many can be draining at once.


Maybe there would be a benefit if Mesos could work out what is the 
largest job any framework has waiting to run, so it knows whether 
draining is required and how far to drain down.  This might take the 
form of a message to the framework: suppose I offered you all the 
resources on the cluster, what is the largest single job you would want 
to run, and which machine(s) could it run on?  Or something like that.


Regards,

Brian.



Re: Resource modelling questions

2015-06-19 Thread Brian Candler

On 19/06/2015 01:59, Benjamin Mahler wrote:

100ms is the default period for quota:

https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt


Ah, that's very interesting: thank you.

Now if I understand this correctly, assuming Mesos runs all its tasks in 
cgroups with CPU bandwidth control, it means that a given task can never 
use more than the CPU share it has requested - even if there is spare 
capacity.


This is good for deterministic behaviour - if it works well when the 
cluster is idle, it will still work well when the cluster is busy (*) - 
but will almost always result in under-utilisation.


I think that means there is a compelling case for having some sort of 
backfill capability for batch jobs to use spare capacity.


Regards,

Brian.

(*) Actually, according to 
http://aurora.apache.org/documentation/latest/resource-isolation/
Mesos considers logical cores, also known as hyperthreading or SMT 
cores, as the unit of CPU.


So if you have a 4 core processor with hyperthreading, it sounds like it 
will schedule up to 8 cores of workload. That will lead to a degree of 
oversubscription of each CPU core, so actually it doesn't *guarantee* 
you will always see the same performance under high load, but it does 
help alleviate the under-utilisation issue.




Re: Thoughts and opinions in physically building a cluster

2015-06-19 Thread Brian Candler

On 19/06/2015 18:38, Oliver Nicholas wrote:
Unless you have some true HA requirements, it seems intuitively 
wasteful to have 3 masters and 2 slaves (unless the cost of 5 nodes is 
inconsequential to you and you hate the environment).
Any particular reason not to have three nodes which are acting both as 
master and slaves?


Re: Resource modelling questions

2015-06-18 Thread Brian Candler

On 18/06/2015 06:41, zhou weitao wrote:


A partial solution might be if each task could request a fraction
of a CPU.

But CPU fraction is Time-Division. I don't think it's possible to 
request a fraction.
Because it's timeslicing, there should be no problem at all requesting 
fractions. For example if you request 0.5 CPUs, then your application 
can run on one CPU for up to 50% of the time; or it could run on 2 CPUs 
for 25% of the time.


There's a nice diagram here:
http://aurora.apache.org/documentation/latest/resource-isolation/

Having said that, I'm not sure where the 100ms timeslices in that 
diagram come from. I'm pretty sure the Linux kernel CFS scheduler does 
not work in that way.


https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt
CFS uses nanosecond granularity accounting and does not rely on any 
jiffies or
other HZ detail.  Thus the CFS scheduler has no notion of timeslices 
in the

way the previous scheduler had

Maybe that diagram is out-of-date?



mesosphere.io broken?

2015-06-17 Thread Brian Candler

Looking for Mesos .deb packages, on Google I find links to
http://mesosphere.io/downloads/
http://elastic.mesosphere.io/
but these are giving 503 Service Unavailable errors.

Is there a problem, or have these sites gone / migrated away?



Resource modelling questions

2015-06-17 Thread Brian Candler

I have some questions about resource accounting in Mesos.

In the [mesos 
architecture](http://mesos.apache.org/documentation/latest/mesos-architecture/) 
and the [2010 white 
paper](https://www.cs.berkeley.edu/~alig/papers/mesos.pdf), the 
following simple example is given:


* A node declares it has 4 CPUs and 4GB RAM
* A resource offer is made to a framework
* The framework scheduler accepts (2 CPU, 1GB) for one task and (1 CPU, 
2GB) for another tasks
* Mesos then is able to offer the remaining 1 CPU and 1GB of RAM to 
another framework.


My question is, is this the total resource accounting which Mesos does, 
or is it able to take account of the *measured* resource utilisation on 
a particular node at the current point in time?


Here are some examples of what I'm thinking of.

(1) A node has 12 CPUs. If tasks each declare a usage of 1 CPU, will 
Mesos only run 12 of them?


Let's say some of those tasks only use 50% of a CPU on average (because 
part of the time they are waiting on network or disk I/O); some tasks 
use 0% CPU most of the time (because they are interactive and are 
waiting for incoming queries)


Clearly, if the remaining tasks are multi-threaded they will be able to 
take up the spare CPU capacity. But if those 12 tasks were all 
single-threaded processes, it seems to me that the CPU would be 
underutilised.


A partial solution might be if each task could request a fraction of a CPU.

(2) A task requests 2GB of RAM, but when running it actually uses only 
1GB of RAM. This means there is spare RAM, at least at the current point 
in time, bearing in mind that the task could take up its allocation 
whenever it choses.


Conversely: a task requests 2GB of RAM usage, but actually uses 3GB 
(assuming there is no hard cgroups enforcement to prevent it doing so).


Third case: a sysadmin manually runs some process on the node which uses 
some RAM, but the process was not started via Mesos.


Would any of these scenarios result in the Mesos allocator making larger 
or smaller resource offers, or would it always calculate RAM available = 
(initial system RAM) - (total of accepted resource offers) ?


(3) Can the executor set execution priorities for different tasks, so 
that if the CPU is overcommitted, real time tasks can take precedence 
over batch tasks?


Where I'm coming from is trying to arrange for a single cluster which 
runs some real time jobs, and all the remaining available resources are 
soaked up running background batch jobs.


The real-time jobs may be bursty interactive workloads - when not being 
queried they are idle and the resources, but when the real-time jobs are 
being queried they should not be slowed down significantly by the batch 
jobs. The background jobs are mixed CPU and I/O, so to achieve full CPU 
utilisation you need to run more of them than the number of CPU cores 
you have.


At the moment I'm using HTCondor but this only really supports batch 
workloads where each task has a defined lifetime; I can't run the 
long-running, real-time query tasks under the same framework. To get 
full CPU utilisation I have to get Condor to lie about the number of CPU 
cores are on each node, pretending there are more than there really are.


Therefore I'm wondering whether Mesos would suit this better, maybe 
using one framework for the persistent real-time workflows and one for 
the batch workloads, but I really want to understand better how the 
resources would be divided up between the frameworks and how to achieve 
both real-time operation and good utilisation of remaining resources.


Also, with HTCondor I have to take a lot of care to calculate (guess) 
up-front the amount of RAM each task will take, and that is difficult, 
as it depends on the algorithms used and the size of the incoming data 
sets. Set the estimate too low, and processes can be killed by the OOM 
killer due to RAM exhaustion. Set the estimate too high and the machine 
is underutilised. It's made more complicated by the fact that the jobs 
use mmap() on large shared databases, so running multiple instances of 
the same task doesn't use N times as much memory as one task.


I'm guessing that Mesos isn't going to free me from having to 
pre-calculate resource requirements for tasks, but if it were able to 
respond to *actual* RAM usage on the system, allow for some 
overcommitment, and terminate lower-priority tasks if RAM is exhausted, 
that would be extremely helpful.


Thanks,

Brian Candler.