CDH5.2.3 on mesos
H, I was trying to setup mesos/hadoop with the latest CDH version (MR1) and it seems like the instructions are sort of out of date and I also tried the suggestions in https://github.com/mesos/hadoop/issues/25 https://github.com/mesos/hadoop/issues/25 but after 4 hours of flailing around I am still kind of stuck :-/ It seems like the configuration/installation instructions aren't complete and I am just too new to hadoop to figure out what's missing or going wrong. Does anyone know of a good resource I can use to get going? -- Ankur
Re: CDH5.2.3 on mesos
Hi Ankur, There aren't any getting started resources other than the documention there as far as I know. Could you share your hadoop configuration and perhaps a description of the problems you're having? Tom. On Tue, Oct 28, 2014 at 8:53 AM, Ankur Chauhan an...@malloc64.com wrote: H, I was trying to setup mesos/hadoop with the latest CDH version (MR1) and it seems like the instructions are sort of out of date and I also tried the suggestions in https://github.com/mesos/hadoop/issues/25 https://github.com/mesos/hadoop/issues/25 but after 4 hours of flailing around I am still kind of stuck :-/ It seems like the configuration/installation instructions aren't complete and I am just too new to hadoop to figure out what's missing or going wrong. Does anyone know of a good resource I can use to get going? -- Ankur
0.21.0-pre Spark latest
Folks - We have some automated tests that run the latest Mesos against the latest Spark, and we've run across a series of issues in both fine, and course grained mode that I believe stem from a series of changes in the 0.21 cycle. I'm not certain if anyone owns this integration, but we should probably ensure it's fixed before we push out 0.21. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: Does Mesos support Hadoop MR V2
Porting YARN to run atop Mesos is quite reasonable. Some folks at eBay have started some work on this (https://github.com/mesos/myriad). If you're interested, you should check it out, and contribute to the project. On Tue, Oct 28, 2014 at 5:21 AM, Yaneeve Shekel yaneeve.she...@sizmek.com wrote: To quote John below, “So excuse my naivety… but…”, I am also confused as to the version/naming convention going on at the hadoop project. I would like to run hadoop over mesos as opposed to over yarn. I would also like to use the *“new”* mapreduce packages. https://github.com/mesos/hadoop mentions that “The pom.xml included is configured and tested against CDH5 and MRv1. Hadoop on Mesos does not currently support YARN (and MRv2).” Does this all mean that the mapreduce package is not available. I think it does not, I think I should be able to use the “new” api over any scheduling system just as I could over plain vanilla cdh (where I could configure and use any combination of the the cross product - (mapred, mapreduce) X (MRv1, YARN)). Could anyone verify this? Second, has any work been done as pertaining the original thread with regards to what John has suggested below? Thanks a lot, Yaneeve On Jul 27, 2014 7:00 PM, John Omernik j...@omernik.com wrote: So excuse my naivety in this space, but my ignorance has never really stopped me from asking questions: I see YARN (Yet another resource negotiator) as very similar to Mesos. I.e. something to manage resources on a cluster of machines. So when I hear talk of running YARN on Mesos it's seems very redundant indeed, and I ask myself, what are we actually getting out of this setup? So, going to the mapr/reduce question, I see Mapr Reduce V1 and MaprReduce V2 like this: Map Reduce V2 is an application that runs on YARN. I.e. if you run a job, it creates an application master, that application master requests resources, and the job gets run. It differs from Map Reduce V1 is there is no long running Job Tracker (other than the YARN Resource Manager, but that is managing resources for all applications, not just Map Reduce Applications). Ok, so Mesos, why can't there be a Mesos Application that is similar to a Map Reduce V2 Application in YARN? Why do we need to run YARN on Mesos? That doesn't really make sense. Basically, for M/R V2 vs M/R V1, the only difference is to mimic M/R V1 we need task trackers and job trackers running as Mesos applications (which we have). So in M/R v2, we just need the equivalent of an application master running on Yarn, requesting resources across the cluster. Fundamentally, YARN is confusing because I think they coupled running Map Reduce jobs with the resource manager and called it Hadoop v2. By coupling the two, people look at YARN as Map Reduce V2, but it's not really. It's a way to running jobs on a cluster of machines (ala Mesos) with a application that is the equivalent of Map Reduce V1. The names being given seem to be confusing to me, it makes people who have invested in Hadoop (Map Reduce V1) be very interested in YARN because it's called Hadoop V2. While Mesos is seen as the Other Just for my sake I summarized a TL;DR form so if someone wants to correct my understanding they can Mesos = Tool to manage resources YARN = Tool to manage resources it's also called Hadoopv2 Map Reduce V1 = Job trackers/Task Trackers it's what we know. It can run on Hadoop clusters, and Mesos. It's also called Hadoopv1 Map Reduce V2 = Application that can run on YARN that mimics Map Reduce V1 on a YARN Cluster. This + YARN has been called Hadoopv2. On Sun, Jul 27, 2014 at 4:10 AM, Maxime Brugidou maxime.brugi...@gmail.com wrote: When I said that running yarn over mesos did not make sense I meant that running a resource manager in a resource manager was very sub-optimal. You will eventually do static allocation of resources for the Yarn framework in Mesos or have complex logic to determine how much resource should be given to yarn. You will also have the same burden of managing 2 different clusters instead of one, even if yarn is sort of hidden as mesos framework. However yes I believe its easier to run yarn on mesos than to run mrv2 on top of mesos. The solution I was discussing was obviously ideal and I looked at the MRAppMaster since and it discouraged me :) On Jul 27, 2014 12:41 AM, Rick Richardson rick.richard...@gmail.com wrote: FWIW I also think the fastest approach here is is porting Yarn onto Mesos. In a perfect world, writing an implementation layer for the Yarn Interface on Mesos would certainly be the optimal approach, but looking at the MRv2 code, it is very very coupled to many Yarn modules. If someone wanted to take on the project of
Re: 0.21.0-pre Spark latest
Since we've recently adopted Spark, I'll second Tim's comment. We had an issue with 0.20.1 that was possibly related to Spark[1], so it's important for us to get this stuff fixed in 0.21.0. Tim, can you elaborate on the issues you saw? Have you tested with my recent Spark patches[2][3]? [1]: https://issues.apache.org/jira/browse/MESOS-1973 [2]: https://github.com/apache/spark/pull/2401 [3]: https://github.com/apache/spark/pull/2453 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair tstcl...@redhat.com wrote: Folks - We have some automated tests that run the latest Mesos against the latest Spark, and we've run across a series of issues in both fine, and course grained mode that I believe stem from a series of changes in the 0.21 cycle. I'm not certain if anyone owns this integration, but we should probably ensure it's fixed before we push out 0.21. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: 0.21.0-pre Spark latest
inline - Original Message - From: Brenden Matthews brenden.matth...@airbedandbreakfast.com To: user@mesos.apache.org Cc: mesos-devel d...@mesos.apache.org, RJ Nowling rnowl...@redhat.com, Erik Erlandson e...@redhat.com Sent: Tuesday, October 28, 2014 9:51:58 AM Subject: Re: 0.21.0-pre Spark latest Since we've recently adopted Spark, I'll second Tim's comment. We had an issue with 0.20.1 that was possibly related to Spark[1], so it's important for us to get this stuff fixed in 0.21.0. Tim, can you elaborate on the issues you saw? Have you tested with my recent Spark patches[2][3]? We are building against Spark 1.1.0 unpatched: - Fine grained mode appears broken. - Course grained mode appears to work via normal runs, but crashes in the REPL. http://fpaste.org/145782/14506564/ [1]: https://issues.apache.org/jira/browse/MESOS-1973 [2]: https://github.com/apache/spark/pull/2401 [3]: https://github.com/apache/spark/pull/2453 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair tstcl...@redhat.com wrote: Folks - We have some automated tests that run the latest Mesos against the latest Spark, and we've run across a series of issues in both fine, and course grained mode that I believe stem from a series of changes in the 0.21 cycle. I'm not certain if anyone owns this integration, but we should probably ensure it's fixed before we push out 0.21. -- Cheers, Timothy St. Clair Red Hat Inc. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: 0.21.0-pre Spark latest
Hi Tim, Thanks for doing the integration tests, that's something that I wanted to do but never got to yet. I have great interest ensuring spark and mesos work, and I know Brenden as well does. I have been tracking these spark mesos problems with spark jira and labeling them mesos. Can you create these bugs on jira and we can dig more on each one? Also is this integration test automated? Thanks! Tim On Oct 28, 2014, at 8:03 AM, Tim St Clair tstcl...@redhat.com wrote: inline - Original Message - From: Brenden Matthews brenden.matth...@airbedandbreakfast.com To: user@mesos.apache.org Cc: mesos-devel d...@mesos.apache.org, RJ Nowling rnowl...@redhat.com, Erik Erlandson e...@redhat.com Sent: Tuesday, October 28, 2014 9:51:58 AM Subject: Re: 0.21.0-pre Spark latest Since we've recently adopted Spark, I'll second Tim's comment. We had an issue with 0.20.1 that was possibly related to Spark[1], so it's important for us to get this stuff fixed in 0.21.0. Tim, can you elaborate on the issues you saw? Have you tested with my recent Spark patches[2][3]? We are building against Spark 1.1.0 unpatched: - Fine grained mode appears broken. - Course grained mode appears to work via normal runs, but crashes in the REPL. http://fpaste.org/145782/14506564/ [1]: https://issues.apache.org/jira/browse/MESOS-1973 [2]: https://github.com/apache/spark/pull/2401 [3]: https://github.com/apache/spark/pull/2453 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair tstcl...@redhat.com wrote: Folks - We have some automated tests that run the latest Mesos against the latest Spark, and we've run across a series of issues in both fine, and course grained mode that I believe stem from a series of changes in the 0.21 cycle. I'm not certain if anyone owns this integration, but we should probably ensure it's fixed before we push out 0.21. -- Cheers, Timothy St. Clair Red Hat Inc. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: 0.21.0-pre Spark latest
Hi RJ, I see, are you or the team on working on this problem already? If not I'd like to take a look as well. Tim On Tue, Oct 28, 2014 at 8:47 AM, RJ Nowling rnowl...@redhat.com wrote: Hi Tim, The integration test is simply to open the spark shell (Spark 1.1.0) using mesos 0.21 in coarse-grained mode. We didn't even have to run any commands. RJ - Original Message - From: Timothy Chen tnac...@gmail.com To: d...@mesos.apache.org Cc: user@mesos.apache.org, RJ Nowling rnowl...@redhat.com, Erik Erlandson e...@redhat.com Sent: Tuesday, October 28, 2014 11:40:19 AM Subject: Re: 0.21.0-pre Spark latest Hi Tim, Thanks for doing the integration tests, that's something that I wanted to do but never got to yet. I have great interest ensuring spark and mesos work, and I know Brenden as well does. I have been tracking these spark mesos problems with spark jira and labeling them mesos. Can you create these bugs on jira and we can dig more on each one? Also is this integration test automated? Thanks! Tim On Oct 28, 2014, at 8:03 AM, Tim St Clair tstcl...@redhat.com wrote: inline - Original Message - From: Brenden Matthews brenden.matth...@airbedandbreakfast.com To: user@mesos.apache.org Cc: mesos-devel d...@mesos.apache.org, RJ Nowling rnowl...@redhat.com, Erik Erlandson e...@redhat.com Sent: Tuesday, October 28, 2014 9:51:58 AM Subject: Re: 0.21.0-pre Spark latest Since we've recently adopted Spark, I'll second Tim's comment. We had an issue with 0.20.1 that was possibly related to Spark[1], so it's important for us to get this stuff fixed in 0.21.0. Tim, can you elaborate on the issues you saw? Have you tested with my recent Spark patches[2][3]? We are building against Spark 1.1.0 unpatched: - Fine grained mode appears broken. - Course grained mode appears to work via normal runs, but crashes in the REPL. http://fpaste.org/145782/14506564/ [1]: https://issues.apache.org/jira/browse/MESOS-1973 [2]: https://github.com/apache/spark/pull/2401 [3]: https://github.com/apache/spark/pull/2453 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair tstcl...@redhat.com wrote: Folks - We have some automated tests that run the latest Mesos against the latest Spark, and we've run across a series of issues in both fine, and course grained mode that I believe stem from a series of changes in the 0.21 cycle. I'm not certain if anyone owns this integration, but we should probably ensure it's fixed before we push out 0.21. -- Cheers, Timothy St. Clair Red Hat Inc. -- Cheers, Timothy St. Clair Red Hat Inc.
Re: Multiple schedulers on one machine?
Sorry for the delay in response. It looks like master received the registration request from the scheduler at 20:13, at which point it sent a registration ack back to the scheduler. What is not clear is why no registration requests were sent/received from 19:47 when the framework apparently started. You can add GLOG_v=1 to the scheduler's environment to print more verbose logging for the driver. It will tell us what the driver has been doing. Are you sure you didn't have any network partitions around that time? On Fri, Oct 10, 2014 at 1:30 PM, Colleen Lee c...@coursera.org wrote: 2014-09-19 19:46:49,361 INFO [JobManager] Running job 1uZArT-yEeS7gCIACpcfeA snip 2014-09-19 20:13:48,134 INFO [JobScheduler] Job 1uZArT-yEeS7gCIACpcfeA: Registered as 20140818-235718-3165886730-5050-901-1507 to master '20140818-235718-3165886730-5050-901' The snipped code is for unrelated internals of our client. Going back to implementation, we call for the Running job ... log line to be output immediately before calling driver.run(), and our implementation of the registered() method in the scheduler is simply to print out the second log line above. During this time, from the mesos master logs, the master continues to function as normal, sending offers to (other) frameworks, processing the replies, adding/launching tasks, completing/removing tasks, unregistering/removing frameworks, etc. Here are the log lines that may be suspicious during that window: W0919 19:47:00.258894 938 master.cpp:2718] Ignoring unknown exited executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) W0919 19:47:00.260349 939 master.cpp:2718] Ignoring unknown exited executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:07:02.690067 940 master.cpp:1041] Received registration request from scheduler(316)@10.151.31.120:36446 I0919 20:07:02.690192 940 master.cpp:1059] Registering framework 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@ 10.151.31.120:36446 The log line for framework in master is for a different framework (20140818-235718-3165886730-5050-901-1502) than the problematic job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for the log lines corresponding to the problematic framework (grep for ip:port) in the master logs? That should tell us what's happening. Good catch, here are the relevant lines from the master logs (I grepped for the framework id -- how is the port determined?). I0919 20:13:48.122954 941 master.cpp:1041] Received registration request from scheduler(13)@10.45.199.181:43573 I0919 20:13:48.123080 941 master.cpp:1059] Registering framework 20140818-235718-3165886730-5050-901-1507 at scheduler(13)@ 10.45.199.181:43573 I0919 20:13:48.123914 941 hierarchical_allocator_process.hpp:331] Added framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.124739 941 master.cpp:2933] Sending 2 offers to framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.129932 940 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486420 ] on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.130028 940 master.hpp:655] Adding task 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal) I0919 20:13:48.130630 940 master.cpp:3111] Launching task 1uZArT-yEeS7gCIACpcfeA of framework 20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:13:48.131531 940 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486421 ] on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.131589 937 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-5 for 5secs I0919 20:13:48.133910 937 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-4 for 5secs I0919 20:13:49.833616 940 master.cpp:2628] Status update TASK_RUNNING (UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task 1uZArT-yEeS7gCIACpcfeA of framework 20140818-235718-3165886730-5050-901-1507 from slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:13:53.184873 941 master.cpp:2933] Sending 2 offers to framework
Re: Running mesos-slave in Docker
Hi Tim, Yes you are right, I was running the last automated build from the master version, with v0.7.1 everything works fine. I am running Marathon, Mesos master and slave and zookeeper in a container in CoreOS. My plan is to create the needed unit files to run it in CoreOS and to write a simple http endpoint (probably written in node) to use hipache to route the requests (as I will have to run a lot of containers). Are you interested in putting this in the main repo or I create a dedicated side project? Thanks again, Alessandro Le 28 oct. 2014 à 22:19, Tim Chen t...@mesosphere.io a écrit : Hi Alessrando, I think Mesos is running your task fine, but Marathon is killing your task. Are you launching Marathon through a docker container as well? And what version of Marathon are you using? Tim On Tue, Oct 28, 2014 at 2:07 PM, Alessandro Siragusa alessandro.sirag...@gmail.com mailto:alessandro.sirag...@gmail.com wrote: Hi guys, I still have a problem running mesos-slave in a Docker container. It continuously kills and starts the containers on all the three slave nodes. In the Marathon UI I can see multiple instances at the same time on all the nodes. I1028 20:43:19.572377 8 slave.cpp:1002] Got assigned task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for framework 20141027-004948-352326060-38124-1- I1028 20:43:19.572691 8 slave.cpp:1112] Launching task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for framework 20141027-004948-352326060-38124-1- I1028 20:43:19.573457 8 slave.cpp:1222] Queuing task 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' for executor ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework '20141027-004948-352326060-38124-1- I1028 20:43:19.57545113 docker.cpp:743] Starting container '7f23db8e-9fb5-4e20-9f06-eb4caf361d86' for task 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' (and executor 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c') of framework '20141027-004948-352326060-38124-1-' I1028 20:43:20.936192 8 slave.cpp:2538] Monitoring executor 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework '20141027-004948-352326060-38124-1-' in container '7f23db8e-9fb5-4e20-9f06-eb4caf361d86' I1028 20:43:20.94739113 slave.cpp:1733] Got registration for executor 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework 20141027-004948-352326060-38124-1- from executor(1)@176.31.235.180:42593 http://176.31.235.180:42593/ I1028 20:43:20.94798613 slave.cpp:1853] Flushing queued task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for executor 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework 20141027-004948-352326060-38124-1- I1028 20:43:20.949553 9 slave.cpp:2088] Handling status update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 20141027-004948-352326060-38124-1- from executor(1)@176.31.235.180:42593 http://176.31.235.180:42593/ I1028 20:43:20.949733 6 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 20141027-004948-352326060-38124-1- I1028 20:43:20.949831 6 status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 20141027-004948-352326060-38124-1- to master@176.31.225.8 mailto:master@176.31.225.8:5050 I1028 20:43:20.949935 6 slave.cpp:2252] Sending acknowledgement for status update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 20141027-004948-352326060-38124-1- to executor(1)@176.31.235.180:42593 http://176.31.235.180:42593/ I1028 20:43:20.95590510 status_update_manager.cpp:398] Received status update acknowledgement (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 20141027-004948-352326060-38124-1- I1028 20:43:21.938161 9 docker.cpp:1286] Updated 'cpu.shares' to 614 at /sys/fs/cgroup/cpu,cpuacct/system.slice/docker-e84f6042adeb23101e6a147f6f5ed1a01f748b432d7e33c8c8d1e9c091487095.scope for container 7f23db8e-9fb5-4e20-9f06-eb4caf361d86 I1028 20:43:21.938460 9 docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 544MB for container 7f23db8e-9fb5-4e20-9f06-eb4caf361d86 I1028 20:43:21.938865 9 docker.cpp:1347] Updated 'memory.limit_in_bytes' to 544MB at /sys/fs/cgroup/memory/system.slice/docker-e84f6042adeb23101e6a147f6f5ed1a01f748b432d7e33c8c8d1e9c091487095.scope for container 7f23db8e-9fb5-4e20-9f06-eb4caf361d86 I1028 20:43:25.57190710 slave.cpp:1002] Got assigned task ubuntu.73fb8c9e-5ee3-11e4-8b6f-42c8cb288d5c for framework 20141027-004948-352326060-38124-1-
MPI on Mesos
Hi, I am having a couple of issues trying to run MPI over Mesos. I am using Mesos 0.20.0 on Ubuntu 12.04 and MPICH2. - I was able to successfully (?) run a helloworld MPI program but still the task appears as lost in the GUI. Here is the stack trace from the mpi execution: We've launched all our MPDs; waiting for them to come up Got 1 mpd(s), running mpiexec Running mpiexec *** Hello world from processor euca-10-2-235-206, rank 0 out of 1 processors *** mpiexec completed, calling mpdallexit euca-10-2-248-74_57995 Task 0 in state 5 A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995 mpdroot: perror msg: No such file or directory mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root probable cause: no mpd daemon on this machine possible cause: unix socket /tmp/mpd2.console_root has been removed mpdexit (__init__ 1208): forked process failed; status=255 I1028 22:15:04.774554 4859 sched.cpp:747] Stopping framework '20141028-203440-1257767434-5050-3638-0006' 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505: Closing zookeeper sessionId=0x14959388d4e0020 And also in *executor stdout* I get: sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74 --port=39237'Command exited with status 127 → command not found and on *stderr*: sh: 1 mpd: not found I am assuming the messages on the executor's log files appear because after mpiexec is completed the task is finished and the mpd ring is no longer running - so it complains about not finding the mpd command, which normally works fine. - An other thing I would like to ask has to do with the procedure to follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I was used to have an executor shared on HDFS and there was no need to distributed the code to the slaves. With MPI I had to distribute the helloworld executable to slaves, because having it on HDFS didn't work. Moreover I was expecting that the mpd ring would be started from Mesos (in the same way that the hadoop jobtracker is started from Mesos for the HadoopOnMesos implementations). Now I have to first run mpdboot before being able to run mpi on Mesos. Is the above procedure what I should do or I am missing something? - Finally, in order to make MPI to work I had to install the mesos.interface with pip and manually copy the native directory from the python/dist-packages (native doesn't exist on the pip repo). And then I realized there is the mpiexec-mesos.in file that it does all that - I can update the README to be a little more clear if you want - I am guessing someone else might also get confused with this. thanks, Stratos