Hi Adam,

No I haven't used the Mesosphere version - It requires downloading packages
from AWS and I was hoping to avoid this. I am using the one on Apache git
repo: https://github.com/apache/mesos/tree/master/mpi but it has the
problems I mentioned. I'll probably try out the Mesopshere version this
week - since it doesn't seem I am getting an answer for the other one
anyway.

thanks,
Stratos


On Mon, Nov 3, 2014 at 12:00 PM, Adam Bordelon <[email protected]> wrote:

> Hi Stratos,
>
> Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra
> That should distribute the binaries to the slaves for you.
> Try it out and let us know if things go better/worse that way.
>
> Thanks,
> -Adam-
>
> On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos <
> [email protected]> wrote:
>
>> Hi,
>>
>> I am having a couple of issues trying to run MPI over Mesos. I am using
>> Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.
>>
>> - I was able to successfully (?) run a helloworld MPI program but still
>> the task appears as lost in the GUI. Here is the stack trace from the mpi
>> execution:
>>
>> >> We've launched all our MPDs; waiting for them to come up
>> Got 1 mpd(s), running mpiexec
>> Running mpiexec
>>
>>
>>  *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
>> processors ***
>>
>> mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
>> Task 0 in state 5
>> A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
>> mpdroot: perror msg: No such file or directory
>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>     probable cause:  no mpd daemon on this machine
>>     possible cause:  unix socket /tmp/mpd2.console_root has been removed
>> mpdexit (__init__ 1208): forked process failed; status=255
>> I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
>> '20141028-203440-1257767434-5050-3638-0006'
>> 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
>> Closing zookeeper sessionId=0x14959388d4e0020
>>
>>
>> And also in *executor stdout* I get:
>> sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
>> --port=39237'Command exited with status 127 → command not found
>>
>> and on *stderr*:
>> sh: 1 mpd: not found
>>
>> I am assuming the messages on the executor's log files appear because
>> after mpiexec is completed the task is finished and the mpd ring is no
>> longer running - so it complains about not finding the mpd command, which
>> normally works fine.
>>
>>
>> - An other thing I would like to ask has to do with the procedure to
>> follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I
>> was used to have an executor shared on HDFS and there was no need to
>> distributed the code to the slaves. With MPI I had to distribute the
>> helloworld executable to slaves, because having it on HDFS didn't work.
>> Moreover I was expecting that the mpd ring would be started from Mesos (in
>> the same way that the hadoop jobtracker is started from Mesos for the
>> HadoopOnMesos implementations). Now I have to first run mpdboot before
>> being able to run mpi on Mesos. Is the above procedure what I should do or
>> I am missing something?
>>
>> - Finally, in order to make MPI to work I had to install the
>> mesos.interface with pip and manually copy the native directory from the
>> python/dist-packages (native doesn't exist on the pip repo). And then I
>> realized there is the mpiexec-mesos.in file that it does all that - I
>> can update the README to be a little more clear if you want - I am guessing
>> someone else might also get confused with this.
>>
>> thanks,
>> Stratos
>>
>
>

Reply via email to