Hi Stratos, Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra That should distribute the binaries to the slaves for you. Try it out and let us know if things go better/worse that way.
Thanks, -Adam- On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos < [email protected]> wrote: > Hi, > > I am having a couple of issues trying to run MPI over Mesos. I am using > Mesos 0.20.0 on Ubuntu 12.04 and MPICH2. > > - I was able to successfully (?) run a helloworld MPI program but still > the task appears as lost in the GUI. Here is the stack trace from the mpi > execution: > > >> We've launched all our MPDs; waiting for them to come up > Got 1 mpd(s), running mpiexec > Running mpiexec > > > *** Hello world from processor euca-10-2-235-206, rank 0 out of 1 > processors *** > > mpiexec completed, calling mpdallexit euca-10-2-248-74_57995 > Task 0 in state 5 > A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995 > mpdroot: perror msg: No such file or directory > mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root > probable cause: no mpd daemon on this machine > possible cause: unix socket /tmp/mpd2.console_root has been removed > mpdexit (__init__ 1208): forked process failed; status=255 > I1028 22:15:04.774554 4859 sched.cpp:747] Stopping framework > '20141028-203440-1257767434-5050-3638-0006' > 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505: > Closing zookeeper sessionId=0x14959388d4e0020 > > > And also in *executor stdout* I get: > sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74 > --port=39237'Command exited with status 127 → command not found > > and on *stderr*: > sh: 1 mpd: not found > > I am assuming the messages on the executor's log files appear because > after mpiexec is completed the task is finished and the mpd ring is no > longer running - so it complains about not finding the mpd command, which > normally works fine. > > > - An other thing I would like to ask has to do with the procedure to > follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I > was used to have an executor shared on HDFS and there was no need to > distributed the code to the slaves. With MPI I had to distribute the > helloworld executable to slaves, because having it on HDFS didn't work. > Moreover I was expecting that the mpd ring would be started from Mesos (in > the same way that the hadoop jobtracker is started from Mesos for the > HadoopOnMesos implementations). Now I have to first run mpdboot before > being able to run mpi on Mesos. Is the above procedure what I should do or > I am missing something? > > - Finally, in order to make MPI to work I had to install the > mesos.interface with pip and manually copy the native directory from the > python/dist-packages (native doesn't exist on the pip repo). And then I > realized there is the mpiexec-mesos.in file that it does all that - I can > update the README to be a little more clear if you want - I am guessing > someone else might also get confused with this. > > thanks, > Stratos >

