Multi-machine jobs

Adam Sylvester Sun, 03 Dec 2017 09:14:06 -0800

I have a use case where my Scheduler gets an externally-generated request
to produce an image.  This is a CPU-intensive task that I can divide up
into, say, 20 largely independent jobs, and I have an application which can
take in the input filename and which slot out of the 20 it is and produce
1/20th of the output image.  Each job runs on its own machine, using all
CPUs and memory on the machine.  The final output image isn't finished
until all 20 jobs are complete, so I don't want to send an external 'job
complete' message until these 20 jobs all finish.


I can do this in Mesos by accepting 20 resource offers and launching tasks
on them, where each task says it needs all resources on the machine, then
doing bookkeeping on the Scheduler as tasks complete to keep track of when
all 20 finish, at which point I can send my external job complete message.

This is all doable, but there are some obvious complications here (for
example, if any of the 20 jobs fail, I want to fail all 20 jobs, but I have
to keep track of that myself).

AWS Batch has Array Jobs which would give me the kind of functionality I
want (http://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html).
I'm wondering if there's any way to do this - specifically running a single
logical task across multiple machines - using either Mesos or an additional
framework that lives on top of Mesos.

Thanks.
-Adam

Multi-machine jobs

Reply via email to