@Tim Chen Thank your for your comment! We use Stolos run about 200k jobs a day, where some jobs are themselves Mesos frameworks (ie Spark jobs). The tool seems so far to be scalable because it offloads all scaling problems to established tools. Queue/job state is stored to a user-chosen queue backend, configuration is stored in a user-chosen configuration backend, and jobs themselves can be huge, parallelized jobs (Hadoop jobs or Spark jbos) or small one-off reports.
@Tim St. Clair Pegasus and Makeflow look really neat! Are there any plans to integrate them with Mesos? I am very excited to check these out further, as Pegasus's documentation on job clustering <http://pegasus.isi.edu/wms/docs/latest/job_clustering.php> looks particularly interesting! There are some very interesting algorithms in this space. To your question about integrating with batch queuing systems, I'm not sure to what extent Stolos integrates with, competes with, or is compatible with batch queuing systems like Condor, Quincy, etc. Stolos solves only one problem: As a wrapper around scripts, chooses whether to run the wrapped script or queue another wrapped script. I wrote and still actively develop Stolos because I haven't found anything else that works with Mesos and meets our needs. That said, it's totally worth looking at your suggestions more deeply! @Tom I would definitely agree with you that there is a "gap" in the Mesos world for a really good DAG scheduler. Chronos sort of gets there, but when it comes down to it, I don't currently believe a tools should combine both job dependency management (which gets really complicated) with distributed crontab (which is a different set of problems). Alex On Wed, May 13, 2015 at 6:50 PM Sharma Podila <[email protected]> wrote: > I keep longing for folks with decades of experience in HTC&HPC to chime >> in "on-list". > > > FWIW, I come from that background, but, am not in that space at this time. > My prior life was in developing a (not open source) distributed job > scheduler and management system for batch and interactive jobs that handled > dependencies, deadlines, preemptions, advance reservation of resources, > etc. with multi-level priority and share tree hierarchy based allocation. > Typically, dependencies and deadlines are handled outside of schedulers and > fed into schedulers as task submission after dependencies have been met. We > found it more optimal to have the scheduler resolve dependencies and > deadlines inherently. This way, a high priority job dependent on another > low priority job can induce higher priority on that dependent job. > Similarly, a job with a deadline depending on another job's completion can > induce an earlier launch of the latter job in order to meet it's deadline. > Also, a dependent job can reserve its resources in advance, knowing the > expected completion time of its dependent jobs. This was important because > in that environment we always had more jobs to run than can run on > available resources. It wasn't unusual to have 10s of 1000s of jobs waiting > in queue to run during the day. > > Not sure if this helps the original question in this thread in any way. > But, I am glad to share my learning, if that helps. > > Sharma > > > On Wed, May 13, 2015 at 1:12 PM, Tim St Clair <[email protected]> wrote: > >> Hi Alex, >> >> Have you by chance integrated with any of the tradition batch DAG >> systems? >> >> http://pegasus.isi.edu/ , http://ccl.cse.nd.edu/software/makeflow/ >> >> >> I keep longing for folks with decades of experience in HTC&HPC to chime >> in "on-list". >> >> Subtle nudge ;-) >> Tim >> >> ------------------------------ >> >> *From: *"Alex Gaudio" <[email protected]> >> *To: *[email protected] >> *Sent: *Wednesday, May 13, 2015 3:04:20 PM >> >> *Subject: *Re: Batch Scheduler with dependency support >> >> Hi Tim (and everyone else!), >> >> I am the primary author of Stolos. We use Stolos to run all of our batch >> jobs on Mesos. The batch jobs are scripts we can run from the >> command-line. Scripts range from bash scripts, Spark jobs and R scripts. >> >> It's a great tool for us because, unlike Chronos, it lets us define a >> script as stage in a dependency chain, where the script can run with >> different parameters for different dependency contexts. (The closest usage >> of this would be to have many Chronos servers, though this does not work in >> all cases). >> >> The tool is a critical component of Sailthru's data science >> infrastructure, but I believe we are the only people who use the tool right >> now. >> >> If you are interested in learning more, I'm happy to invest time to talk >> more about Stolos, what it does and how we use it! >> >> Alex >> >> On Wed, May 13, 2015 at 2:02 PM Tim Chen <[email protected]> wrote: >> >>> How are you running your batch jobs? Is the batch job script/executable >>> an in-house app? >>> >>> Tim >>> >>> On Wed, May 13, 2015 at 9:46 AM, Andras Kerekes < >>> [email protected]> wrote: >>> >>>> You might want to have a look at stolos too: >>>> >>>> >>>> >>>> https://github.com/sailthru/stolos >>>> >>>> >>>> >>>> Andras >>>> >>>> >>>> >>>> >>>> >>>> *From:* Aaron Carey [mailto:[email protected]] >>>> *Sent:* Wednesday, May 13, 2015 11:54 AM >>>> *To:* [email protected] >>>> *Subject:* RE: Batch Scheduler with dependency support >>>> >>>> >>>> >>>> Thanks! I hadn't come across that one before :) >>>> ------------------------------ >>>> >>>> *From:* [email protected] [[email protected]] on behalf of >>>> Jeff Schroeder [[email protected]] >>>> *Sent:* 13 May 2015 16:39 >>>> *To:* [email protected] >>>> *Subject:* Re: Batch Scheduler with dependency support >>>> >>>> Lookup Hubspot's Singularity >>>> >>>> On Wednesday, May 13, 2015, Aaron Carey <[email protected]> wrote: >>>> >>>> Thanks Jeff, >>>> >>>> Any other options around as well? >>>> ------------------------------ >>>> >>>> *From:* [email protected] <http://UrlBlockedError.aspx> [ >>>> [email protected] <http://UrlBlockedError.aspx>] on behalf of Jeff >>>> Schroeder [[email protected] <http://UrlBlockedError.aspx>] >>>> *Sent:* 13 May 2015 14:12 >>>> *To:* [email protected] <http://UrlBlockedError.aspx> >>>> *Subject:* Batch Scheduler with dependency support >>>> >>>> It does both just as well, along with cron-like functionality. It is >>>> harder to install and takes a bit more understanding however. The official >>>> tutorial is a process that loops 100 times and then exits. >>>> >>>> >>>> >>>> http://aurora.apache.org/documentation/latest/tutorial/#the-script >>>> >>>> Aurora is pretty much a superset of most other generic frameworks sans >>>> maybe hubspot's singularity. >>>> >>>> >>>> On Wednesday, May 13, 2015, Aaron Carey <[email protected] >>>> <http://UrlBlockedError.aspx>> wrote: >>>> >>>> I was under the impression Aurora was for long running services? Is it >>>> suitable for scheduling one of batch processes too? >>>> >>>> thanks, >>>> Aaron >>>> ------------------------------ >>>> >>>> *From:* [email protected] [[email protected]] on behalf of >>>> Jeff Schroeder [[email protected]] >>>> *Sent:* 13 May 2015 13:12 >>>> *To:* [email protected] >>>> *Subject:* Re: Batch Scheduler with dependency support >>>> >>>> Apache Aurora does this and you can be explicit about the ordering >>>> >>>> On Wednesday, May 13, 2015, Aaron Carey <[email protected]> wrote: >>>> >>>> Hi All, >>>> >>>> I was just wondering if anyone out there knew of a good mesos batch >>>> scheduler which supports dependencies between tasks? (ie Task B cannot run >>>> until Task A is complete) >>>> >>>> Thanks, >>>> Aaron >>>> >>>> >>>> >>>> -- >>>> Text by Jeff, typos by iPhone >>>> >>>> >>>> >>>> -- >>>> Text by Jeff, typos by iPhone >>>> >>>> >>>> >>>> -- >>>> Text by Jeff, typos by iPhone >>>> >>> >>> >> >> >> -- >> Cheers, >> Timothy St. Clair >> Red Hat Inc. >> > >

