What you are looking for is probably a workflow manager. It is more or less independent from a cluster management system, such as Mesos.
Here is a suggestion for a tool shopping list: https://github.com/spotify/luigi https://azkaban.github.io/ https://github.com/airbnb/airflow https://github.com/pinterest/pinball https://github.com/sailthru/stolos Luigi is probably least risk - easy to get started and battle-tested. I am biased, though. In batch processing environments, the workflow managers typically run on a small cluster of "edge nodes", which in turn schedule jobs on Hadoop or Spark. One could conceive scheduling jobs from edge nodes both onto Hadoop/Spark and Mesos - the latter would be appropriate for jobs that fit in a single machine. Hadoop or Spark are often used also for simpler jobs, at a high cost in hardware and complexity. I have not heard of any such hybrid integrations, however. If you go down that path, you may want to look at Aurora for Mesos scheduling and resource allocation. Unlike Marathon and Kubernetes, it supports batch jobs. You can build a batch worker farm on Mesos with e.g. Marathon + RabbitMQ, but you would likely reinvent what Aurora does. I answered a related question on the Spark mailing list, which may provide some useful additional information: https://www.mail-archive.com/[email protected]/msg34417.html Regards, Lars Albertsson On Wed, Oct 7, 2015 at 9:56 AM, Brian Candler <[email protected]> wrote: > Are there any open-source job queue/batch systems which run under Mesos? I > am thinking of things like HTCondor, Torque etc. > > The requirement is to be able to: > - define an overall job as a set of sub-tasks (could be many thousands) > - put sub-tasks into a queue; execute tasks from the queue > - dependencies: don't add a sub-task into the queue until its precursors > have completed successfully > - restart: after an error, be able to restart the job but skipping those > sub-tasks which completed successfully > - preferably handle short-lived tasks efficiently (of order of 10 seconds > duration) > > Clearly it's possible to write a framework to do this, but I don't want to > re-invent the wheel if it has been done already. > > Thanks, > > Brian. > > P.S. I found Chronos, but it doesn't seem a good match. As far as I can see, > it's intended for applications where you pre-define a bunch of tasks (via > GUI? via REST?) and then trigger them periodically.

