Hi Stratos, Sorry for the tardy reply. We wrote mesos-hydra as new versions of mpich2 didn't provide mpdtrace (see https://github.com/apache/mesos/blob/master/mpi/README) and assuming not having a posix parallel file system available. mesos-hydra starts mpirun using the 'manual' launcher and the jobs starts hydra_pmi_proxy with the settings announced by mpirun.
I wrote a different way of doing this (assuming a parallel filesystem being around - which is usually the case for a MPI environment): https://github.com/nqn/gasc This framework just starts ssh daemons contained and in the process tree and announce the node list after establishing desired allocation. I tested it with mpich2 but should be able to be used in a generic way (exposing node list in environment variables, files, etc.) Hope this helps. Niklas On 3 November 2014 12:08, Stratos Dimopoulos <[email protected]> wrote: > Hi Adam, > > No I haven't used the Mesosphere version - It requires downloading > packages from AWS and I was hoping to avoid this. I am using the one on > Apache git repo: https://github.com/apache/mesos/tree/master/mpi but it > has the problems I mentioned. I'll probably try out the Mesopshere version > this week - since it doesn't seem I am getting an answer for the other one > anyway. > > thanks, > Stratos > > > On Mon, Nov 3, 2014 at 12:00 PM, Adam Bordelon <[email protected]> wrote: > >> Hi Stratos, >> >> Were you using mesos-hydra? https://github.com/mesosphere/mesos-hydra >> That should distribute the binaries to the slaves for you. >> Try it out and let us know if things go better/worse that way. >> >> Thanks, >> -Adam- >> >> On Tue, Oct 28, 2014 at 10:50 PM, Stratos Dimopoulos < >> [email protected]> wrote: >> >>> Hi, >>> >>> I am having a couple of issues trying to run MPI over Mesos. I am using >>> Mesos 0.20.0 on Ubuntu 12.04 and MPICH2. >>> >>> - I was able to successfully (?) run a helloworld MPI program but still >>> the task appears as lost in the GUI. Here is the stack trace from the mpi >>> execution: >>> >>> >> We've launched all our MPDs; waiting for them to come up >>> Got 1 mpd(s), running mpiexec >>> Running mpiexec >>> >>> >>> *** Hello world from processor euca-10-2-235-206, rank 0 out of 1 >>> processors *** >>> >>> mpiexec completed, calling mpdallexit euca-10-2-248-74_57995 >>> Task 0 in state 5 >>> A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995 >>> mpdroot: perror msg: No such file or directory >>> mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root >>> probable cause: no mpd daemon on this machine >>> possible cause: unix socket /tmp/mpd2.console_root has been removed >>> mpdexit (__init__ 1208): forked process failed; status=255 >>> I1028 22:15:04.774554 4859 sched.cpp:747] Stopping framework >>> '20141028-203440-1257767434-5050-3638-0006' >>> 2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505: >>> Closing zookeeper sessionId=0x14959388d4e0020 >>> >>> >>> And also in *executor stdout* I get: >>> sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74 >>> --port=39237'Command exited with status 127 → command not found >>> >>> and on *stderr*: >>> sh: 1 mpd: not found >>> >>> I am assuming the messages on the executor's log files appear because >>> after mpiexec is completed the task is finished and the mpd ring is no >>> longer running - so it complains about not finding the mpd command, which >>> normally works fine. >>> >>> >>> - An other thing I would like to ask has to do with the procedure to >>> follow for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I >>> was used to have an executor shared on HDFS and there was no need to >>> distributed the code to the slaves. With MPI I had to distribute the >>> helloworld executable to slaves, because having it on HDFS didn't work. >>> Moreover I was expecting that the mpd ring would be started from Mesos (in >>> the same way that the hadoop jobtracker is started from Mesos for the >>> HadoopOnMesos implementations). Now I have to first run mpdboot before >>> being able to run mpi on Mesos. Is the above procedure what I should do or >>> I am missing something? >>> >>> - Finally, in order to make MPI to work I had to install the >>> mesos.interface with pip and manually copy the native directory from the >>> python/dist-packages (native doesn't exist on the pip repo). And then I >>> realized there is the mpiexec-mesos.in file that it does all that - I >>> can update the README to be a little more clear if you want - I am guessing >>> someone else might also get confused with this. >>> >>> thanks, >>> Stratos >>> >> >> >

