Hi Everyone, I've been working on a simple programming language to create large data pipelines on Mesos. The language is called BDS which stands for BigDataScript (yes, the name is kind of a joke for all jargon-lovers out there) and here is the web page:
http://pcingola.github.io/BigDataScript/ Needles to say, it's open source and the code is available is GitHub. At the moment I'm using BDS mostly for analysis of large genetic datasets on our 25,000 core cluster, but it should scale to large(er) clusters as well. BDS has a few interesting features: - Runs on Mesos (obviously) as well as SunGridEngine, Torque, MOAB, a large server or just your laptop. - You can develop on your laptop (without having to install Mesos or any cluster management system) and then deploy your script to a Mesos cluster/datacenter without modification. - It performs automatic task dependency and schedules tasks according to the implicit (or explicit) DAG. - It has lazy processing. Checks whether performing a task is necessary and skips tasks whose output does not need to be updated (make-style). - It performs automatic checkpointing and has absolute serialization, so you can copy the checkpoint file to another computer and continue running exactly where you left. - It can handle several parallel pipeline branches (threads). - Allows to define DAGs in a declarative form (using 'goals'). - Cleans up stale files (and queues tasks in non-Mesos cluster). Other cool features: - Automatically parses command line options in your scripts (it also creates "help" for you) - Logs all processes's stdout / stderr and exit status - It has a built in debugger - It has a built in unity testing framework You can read more about all these features here: http://pcingola.github.io/BigDataScript/bigDataScript_manual.html I hope you find it useful and please do send me any feedback you have. Yours Pablo

