Hi Stratos,

Sorry for the tardy reply. We wrote mesos-hydra as new versions of mpich2
didn't provide mpdtrace (see
https://github.com/apache/mesos/blob/master/mpi/README) and assuming not
having a posix parallel file system available.
mesos-hydra starts mpirun using the 'manual' launcher and the jobs starts
hydra_pmi_proxy with the settings announced by mpirun.

I wrote a different way of doing this (assuming a parallel filesystem being
around - which is usually the case for a MPI environment):
https://github.com/nqn/gasc

This framework just starts ssh daemons contained and in the process tree
and announce the node list after establishing desired allocation. I tested
it with mpich2 but should be able to be used in a generic way (exposing
node list in environment variables, files, etc.)

Hope this helps.

Niklas

On 3 November 2014 12:08, Stratos Dimopoulos <[email protected]>
wrote:

> Hi Adam,
>
> No I haven't used the Mesosphere version - It requires downloading
> packages from AWS and I was hoping to avoid this. I am using the one on
> Apache git repo: https://github.com/apache/mesos/tree/master/mpi but it
> has the problems I mentioned. I'll probably try out the Mesopshere version
> this week - since it doesn't seem I am getting an answer for the other one
> anyway.
>
> thanks,
> Stratos
>
>
> On Mon, Nov 3, 2014 at 12:00 PM, Adam Bordelon <[email protected]> wrote:
>
>> Hi Stratos,
>>
>> Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra
>> That should distribute the binaries to the slaves for you.
>> Try it out and let us know if things go better/worse that way.
>>
>> Thanks,
>> -Adam-
>>
>> On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am having a couple of issues trying to run MPI over Mesos. I am using
>>> Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.
>>>
>>> - I was able to successfully (?) run a helloworld MPI program but still
>>> the task appears as lost in the GUI. Here is the stack trace from the mpi
>>> execution:
>>>
>>> >> We've launched all our MPDs; waiting for them to come up
>>> Got 1 mpd(s), running mpiexec
>>> Running mpiexec
>>>
>>>
>>>  *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
>>> processors ***
>>>
>>> mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
>>> Task 0 in state 5
>>> A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
>>> mpdroot: perror msg: No such file or directory
>>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>>     probable cause:  no mpd daemon on this machine
>>>     possible cause:  unix socket /tmp/mpd2.console_root has been removed
>>> mpdexit (__init__ 1208): forked process failed; status=255
>>> I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
>>> '20141028-203440-1257767434-5050-3638-0006'
>>> 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
>>> Closing zookeeper sessionId=0x14959388d4e0020
>>>
>>>
>>> And also in *executor stdout* I get:
>>> sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
>>> --port=39237'Command exited with status 127 → command not found
>>>
>>> and on *stderr*:
>>> sh: 1 mpd: not found
>>>
>>> I am assuming the messages on the executor's log files appear because
>>> after mpiexec is completed the task is finished and the mpd ring is no
>>> longer running - so it complains about not finding the mpd command, which
>>> normally works fine.
>>>
>>>
>>> - An other thing I would like to ask has to do with the procedure to
>>> follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I
>>> was used to have an executor shared on HDFS and there was no need to
>>> distributed the code to the slaves. With MPI I had to distribute the
>>> helloworld executable to slaves, because having it on HDFS didn't work.
>>> Moreover I was expecting that the mpd ring would be started from Mesos (in
>>> the same way that the hadoop jobtracker is started from Mesos for the
>>> HadoopOnMesos implementations). Now I have to first run mpdboot before
>>> being able to run mpi on Mesos. Is the above procedure what I should do or
>>> I am missing something?
>>>
>>> - Finally, in order to make MPI to work I had to install the
>>> mesos.interface with pip and manually copy the native directory from the
>>> python/dist-packages (native doesn't exist on the pip repo). And then I
>>> realized there is the mpiexec-mesos.in file that it does all that - I
>>> can update the README to be a little more clear if you want - I am guessing
>>> someone else might also get confused with this.
>>>
>>> thanks,
>>> Stratos
>>>
>>
>>
>

Reply via email to