Hi Stratos,

Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra
That should distribute the binaries to the slaves for you.
Try it out and let us know if things go better/worse that way.

Thanks,
-Adam-

On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos <
[email protected]> wrote:

> Hi,
>
> I am having a couple of issues trying to run MPI over Mesos. I am using
> Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.
>
> - I was able to successfully (?) run a helloworld MPI program but still
> the task appears as lost in the GUI. Here is the stack trace from the mpi
> execution:
>
> >> We've launched all our MPDs; waiting for them to come up
> Got 1 mpd(s), running mpiexec
> Running mpiexec
>
>
>  *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
> processors ***
>
> mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
> Task 0 in state 5
> A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
> mpdroot: perror msg: No such file or directory
> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>     probable cause:  no mpd daemon on this machine
>     possible cause:  unix socket /tmp/mpd2.console_root has been removed
> mpdexit (__init__ 1208): forked process failed; status=255
> I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
> '20141028-203440-1257767434-5050-3638-0006'
> 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
> Closing zookeeper sessionId=0x14959388d4e0020
>
>
> And also in *executor stdout* I get:
> sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
> --port=39237'Command exited with status 127 → command not found
>
> and on *stderr*:
> sh: 1 mpd: not found
>
> I am assuming the messages on the executor's log files appear because
> after mpiexec is completed the task is finished and the mpd ring is no
> longer running - so it complains about not finding the mpd command, which
> normally works fine.
>
>
> - An other thing I would like to ask has to do with the procedure to
> follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I
> was used to have an executor shared on HDFS and there was no need to
> distributed the code to the slaves. With MPI I had to distribute the
> helloworld executable to slaves, because having it on HDFS didn't work.
> Moreover I was expecting that the mpd ring would be started from Mesos (in
> the same way that the hadoop jobtracker is started from Mesos for the
> HadoopOnMesos implementations). Now I have to first run mpdboot before
> being able to run mpi on Mesos. Is the above procedure what I should do or
> I am missing something?
>
> - Finally, in order to make MPI to work I had to install the
> mesos.interface with pip and manually copy the native directory from the
> python/dist-packages (native doesn't exist on the pip repo). And then I
> realized there is the mpiexec-mesos.in file that it does all that - I can
> update the README to be a little more clear if you want - I am guessing
> someone else might also get confused with this.
>
> thanks,
> Stratos
>

Reply via email to