Thank both Adam and Niklas for your answers!
I tried the Mesosphere version for MPI (
https://github.com/mesosphere/mesos-hydra) on my cluster running Mesos 0.20
with HDFS.
The python egg of mesos-hydra seems to be compiled on Mesos 0.16 that has a
different API compared to 0.20. It also seem that is is compiled with
GLIBC_2.16,
while Ubuntu 12.04 that I am running supports up to glibc-2.15. For this
reason, I copied the interface, native and dist python eggs from my mesos
build and changed the PYTHON_PATH to mesos-hydra mrun executable
accordingly. I also changed the imports on mrun.py as follows:
#import mesos
#import mesos_pb2
import mesos.interface
import mesos.native
from mesos.interface import mesos_pb2
but I get an error:
Traceback (most recent call last):
File "mrun.py", line 6, in <module>
import mesos.native
File "build/bdist.linux-x86_64/egg/mesos/native/__init__.py", line 17, in
<module>
# See include/mesos/scheduler.hpp, include/mesos/executor.hpp and
File "build/bdist.linux-x86_64/egg/mesos/native/_mesos.py", line 7, in
<module>
File "build/bdist.linux-x86_64/egg/mesos/native/_mesos.py", line 6, in
__bootstrap__
File "build/bdist.linux-x86_64/egg/mesos/interface/mesos_pb2.py", line 4,
in <module>
ImportError: cannot import name enum_type_wrapper
It seems that this might be related to the python protobuf version but I am
still trying to debug it. If you see there is an easy solution that I miss
please let me know. Seem like a dependency/ python packaging issue that
maybe for someone using Mesos for a while would be trivial...
thanks,
Stratos
On Thu, Nov 6, 2014 at 1:54 PM, Niklas Nielsen <[email protected]> wrote:
> Hi Stratos,
>
> Sorry for the tardy reply. We wrote mesos-hydra as new versions of mpich2
> didn't provide mpdtrace (see
> https://github.com/apache/mesos/blob/master/mpi/README) and assuming not
> having a posix parallel file system available.
> mesos-hydra starts mpirun using the 'manual' launcher and the jobs starts
> hydra_pmi_proxy with the settings announced by mpirun.
>
> I wrote a different way of doing this (assuming a parallel filesystem
> being around - which is usually the case for a MPI environment):
> https://github.com/nqn/gasc
>
> This framework just starts ssh daemons contained and in the process tree
> and announce the node list after establishing desired allocation. I tested
> it with mpich2 but should be able to be used in a generic way (exposing
> node list in environment variables, files, etc.)
>
> Hope this helps.
>
> Niklas
>
> On 3 November 2014 12:08, Stratos Dimopoulos <[email protected]
> > wrote:
>
>> Hi Adam,
>>
>> No I haven't used the Mesosphere version - It requires downloading
>> packages from AWS and I was hoping to avoid this. I am using the one on
>> Apache git repo: https://github.com/apache/mesos/tree/master/mpi but it
>> has the problems I mentioned. I'll probably try out the Mesopshere version
>> this week - since it doesn't seem I am getting an answer for the other one
>> anyway.
>>
>> thanks,
>> Stratos
>>
>>
>> On Mon, Nov 3, 2014 at 12:00 PM, Adam Bordelon <[email protected]>
>> wrote:
>>
>>> Hi Stratos,
>>>
>>> Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra
>>> That should distribute the binaries to the slaves for you.
>>> Try it out and let us know if things go better/worse that way.
>>>
>>> Thanks,
>>> -Adam-
>>>
>>> On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am having a couple of issues trying to run MPI over Mesos. I am using
>>>> Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.
>>>>
>>>> - I was able to successfully (?) run a helloworld MPI program but still
>>>> the task appears as lost in the GUI. Here is the stack trace from the mpi
>>>> execution:
>>>>
>>>> >> We've launched all our MPDs; waiting for them to come up
>>>> Got 1 mpd(s), running mpiexec
>>>> Running mpiexec
>>>>
>>>>
>>>> *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
>>>> processors ***
>>>>
>>>> mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
>>>> Task 0 in state 5
>>>> A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
>>>> mpdroot: perror msg: No such file or directory
>>>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
>>>> probable cause: no mpd daemon on this machine
>>>> possible cause: unix socket /tmp/mpd2.console_root has been removed
>>>> mpdexit (__init__ 1208): forked process failed; status=255
>>>> I1028 22:15:04.774554 4859 sched.cpp:747] Stopping framework
>>>> '20141028-203440-1257767434-5050-3638-0006'
>>>> 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
>>>> Closing zookeeper sessionId=0x14959388d4e0020
>>>>
>>>>
>>>> And also in *executor stdout* I get:
>>>> sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
>>>> --port=39237'Command exited with status 127 → command not found
>>>>
>>>> and on *stderr*:
>>>> sh: 1 mpd: not found
>>>>
>>>> I am assuming the messages on the executor's log files appear because
>>>> after mpiexec is completed the task is finished and the mpd ring is no
>>>> longer running - so it complains about not finding the mpd command, which
>>>> normally works fine.
>>>>
>>>>
>>>> - An other thing I would like to ask has to do with the procedure to
>>>> follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I
>>>> was used to have an executor shared on HDFS and there was no need to
>>>> distributed the code to the slaves. With MPI I had to distribute the
>>>> helloworld executable to slaves, because having it on HDFS didn't work.
>>>> Moreover I was expecting that the mpd ring would be started from Mesos (in
>>>> the same way that the hadoop jobtracker is started from Mesos for the
>>>> HadoopOnMesos implementations). Now I have to first run mpdboot before
>>>> being able to run mpi on Mesos. Is the above procedure what I should do or
>>>> I am missing something?
>>>>
>>>> - Finally, in order to make MPI to work I had to install the
>>>> mesos.interface with pip and manually copy the native directory from the
>>>> python/dist-packages (native doesn't exist on the pip repo). And then I
>>>> realized there is the mpiexec-mesos.in file that it does all that - I
>>>> can update the README to be a little more clear if you want - I am guessing
>>>> someone else might also get confused with this.
>>>>
>>>> thanks,
>>>> Stratos
>>>>
>>>
>>>
>>
>