CDH5.2.3 on mesos

2014-10-28 Thread Ankur Chauhan
H,

I was trying to setup mesos/hadoop with the latest CDH version (MR1) and it 
seems like the instructions are sort of out of date and I also tried the 
suggestions in https://github.com/mesos/hadoop/issues/25 
https://github.com/mesos/hadoop/issues/25 but after 4 hours of flailing 
around I am still kind of stuck :-/

It seems like the configuration/installation instructions aren't complete and I 
am just too new to hadoop to figure out what's missing or going wrong. Does 
anyone know of a good resource I can use to get going?

-- Ankur

Re: CDH5.2.3 on mesos

2014-10-28 Thread Tom Arnfeld
Hi Ankur,


There aren't any getting started resources other than the documention there as 
far as I know. Could you share your hadoop configuration and perhaps a 
description of the problems you're having?




Tom.

On Tue, Oct 28, 2014 at 8:53 AM, Ankur Chauhan an...@malloc64.com wrote:

 H,
 I was trying to setup mesos/hadoop with the latest CDH version (MR1) and it 
 seems like the instructions are sort of out of date and I also tried the 
 suggestions in https://github.com/mesos/hadoop/issues/25 
 https://github.com/mesos/hadoop/issues/25 but after 4 hours of flailing 
 around I am still kind of stuck :-/
 It seems like the configuration/installation instructions aren't complete and 
 I am just too new to hadoop to figure out what's missing or going wrong. Does 
 anyone know of a good resource I can use to get going?
 -- Ankur

0.21.0-pre Spark latest

2014-10-28 Thread Tim St Clair
Folks - 

We have some automated tests that run the latest Mesos against the latest 
Spark, and we've run across a series of issues in both fine, and course grained 
mode that I believe stem from a series of changes in the 0.21 cycle.  

I'm not certain if anyone owns this integration, but we should probably 
ensure it's fixed before we push out 0.21. 

-- 
Cheers,
Timothy St. Clair
Red Hat Inc.


Re: Does Mesos support Hadoop MR V2

2014-10-28 Thread Brenden Matthews
Porting YARN to run atop Mesos is quite reasonable.  Some folks at eBay
have started some work on this (https://github.com/mesos/myriad).  If
you're interested, you should check it out, and contribute to the project.

On Tue, Oct 28, 2014 at 5:21 AM, Yaneeve Shekel yaneeve.she...@sizmek.com
wrote:

  To quote John below,

 “So excuse my naivety… but…”, I am also confused as to the version/naming 
 convention going on at the hadoop project.

 I would like to run hadoop over mesos as opposed to over yarn. I would also 
 like to use the *“new”* mapreduce packages.

 https://github.com/mesos/hadoop mentions that “The pom.xml included is 
 configured and tested against CDH5 and MRv1. Hadoop on Mesos does not 
 currently support YARN (and MRv2).”  Does this all mean that the mapreduce 
 package is not available. I think it does not, I think I should be able to 
 use the “new” api over any scheduling system just as I could over plain 
 vanilla cdh (where I could configure and use any combination of the the cross 
 product - (mapred, mapreduce) X (MRv1, YARN)). Could anyone verify this?

 Second, has any work been done as pertaining the original thread with regards 
 to what John has suggested below?



 Thanks a lot,

 Yaneeve



 On Jul 27, 2014 7:00 PM, John Omernik j...@omernik.com wrote:



  So excuse my naivety in this space, but my ignorance has never really

  stopped me from asking questions:

 

  I see YARN (Yet another resource negotiator) as very similar to Mesos.

  I.e. something to manage resources on a cluster of machines. So when I hear

  talk of running YARN on Mesos it's seems very redundant indeed, and I ask

  myself, what are we actually getting out of this setup?

 

  So, going to the mapr/reduce question, I see Mapr Reduce V1 and MaprReduce

  V2 like this:  Map Reduce V2 is an application that runs on YARN. I.e. if

  you run a job, it creates an application master, that application master

  requests resources, and the job gets run.  It differs from Map Reduce V1 is

  there is no long running Job Tracker (other than the YARN Resource Manager,

  but that is managing resources for all applications, not just Map Reduce

  Applications).  Ok, so Mesos, why can't there be a Mesos Application that

  is similar to a Map Reduce V2 Application in YARN?  Why do we need to run

  YARN on Mesos? That doesn't really make sense.  Basically, for M/R V2 vs

  M/R V1, the only difference is to mimic M/R V1 we need task trackers and

  job trackers running as Mesos applications (which we have).  So in M/R v2,

  we just need the equivalent of an application master running on Yarn,

  requesting resources across the cluster.

 

  Fundamentally, YARN is confusing because I think they coupled running Map

  Reduce jobs with the resource manager and called it Hadoop v2.  By

  coupling the two, people look at YARN as Map Reduce V2, but it's not

  really.  It's a way to running jobs on a cluster of machines (ala Mesos)

  with a application that is the equivalent of Map Reduce V1.   The names

  being given seem to be confusing to me, it makes people who have invested

  in Hadoop (Map Reduce V1) be very interested in YARN because it's called

  Hadoop V2.  While Mesos is seen as the Other

 

 

  Just for my sake I summarized a TL;DR form so if someone wants to correct

  my understanding they can

 

  Mesos = Tool to manage resources

 

  YARN = Tool to manage resources it's also called Hadoopv2

 

  Map Reduce V1 = Job trackers/Task Trackers it's what we know. It can run

  on Hadoop clusters, and Mesos.  It's also called Hadoopv1

 

  Map Reduce V2 =  Application that can run on YARN that mimics Map Reduce

  V1 on a YARN Cluster. This + YARN has been called Hadoopv2.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  On Sun, Jul 27, 2014 at 4:10 AM, Maxime Brugidou 

  maxime.brugi...@gmail.com wrote:

 

  When I said that running yarn over mesos did not make sense I meant that

  running a resource manager in a resource manager was very sub-optimal. You

  will eventually do static allocation of resources for the Yarn framework in

  Mesos or have complex logic to determine how much resource should be given

  to yarn. You will also have the same burden of managing 2 different

  clusters instead of one, even if yarn is sort of hidden as mesos framework.

 

  However yes I believe its easier to run yarn on mesos than to run mrv2 on

  top of mesos. The solution I was discussing was obviously ideal and I

  looked at the MRAppMaster since and it discouraged me :)

   On Jul 27, 2014 12:41 AM, Rick Richardson rick.richard...@gmail.com

  wrote:

 

  FWIW I also think the fastest approach here is is porting Yarn onto

  Mesos.

 

  In a perfect world, writing an implementation layer for the Yarn

  Interface on Mesos would certainly be the optimal approach, but looking at

  the MRv2 code, it is very very coupled to many Yarn modules.

 

  If someone wanted to take on the project of 

Re: 0.21.0-pre Spark latest

2014-10-28 Thread Brenden Matthews
Since we've recently adopted Spark, I'll second Tim's comment.  We had an
issue with 0.20.1 that was possibly related to Spark[1], so it's important
for us to get this stuff fixed in 0.21.0.

Tim, can you elaborate on the issues you saw?  Have you tested with my
recent Spark patches[2][3]?

[1]: https://issues.apache.org/jira/browse/MESOS-1973
[2]: https://github.com/apache/spark/pull/2401
[3]: https://github.com/apache/spark/pull/2453


On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair tstcl...@redhat.com wrote:

 Folks -

 We have some automated tests that run the latest Mesos against the latest
 Spark, and we've run across a series of issues in both fine, and course
 grained mode that I believe stem from a series of changes in the 0.21 cycle.

 I'm not certain if anyone owns this integration, but we should probably
 ensure it's fixed before we push out 0.21.

 --
 Cheers,
 Timothy St. Clair
 Red Hat Inc.



Re: 0.21.0-pre Spark latest

2014-10-28 Thread Tim St Clair
inline 

- Original Message -

 From: Brenden Matthews brenden.matth...@airbedandbreakfast.com
 To: user@mesos.apache.org
 Cc: mesos-devel d...@mesos.apache.org, RJ Nowling rnowl...@redhat.com,
 Erik Erlandson e...@redhat.com
 Sent: Tuesday, October 28, 2014 9:51:58 AM
 Subject: Re: 0.21.0-pre  Spark latest

 Since we've recently adopted Spark, I'll second Tim's comment. We had an
 issue with 0.20.1 that was possibly related to Spark[1], so it's important
 for us to get this stuff fixed in 0.21.0.

 Tim, can you elaborate on the issues you saw? Have you tested with my recent
 Spark patches[2][3]?

We are building against Spark 1.1.0 unpatched: 
- Fine grained mode appears broken. 

- Course grained mode appears to work via normal runs, but crashes in the REPL. 
http://fpaste.org/145782/14506564/ 

 [1]: https://issues.apache.org/jira/browse/MESOS-1973
 [2]: https://github.com/apache/spark/pull/2401
 [3]: https://github.com/apache/spark/pull/2453

 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair  tstcl...@redhat.com  wrote:

  Folks -
 

  We have some automated tests that run the latest Mesos against the latest
  Spark, and we've run across a series of issues in both fine, and course
  grained mode that I believe stem from a series of changes in the 0.21
  cycle.
 

  I'm not certain if anyone owns this integration, but we should probably
  ensure it's fixed before we push out 0.21.
 

  --
 
  Cheers,
 
  Timothy St. Clair
 
  Red Hat Inc.
 

-- 
Cheers, 
Timothy St. Clair 
Red Hat Inc. 


Re: 0.21.0-pre Spark latest

2014-10-28 Thread Timothy Chen
Hi Tim,

Thanks for doing the integration tests, that's something that I wanted to do 
but never got to yet.

I have great interest ensuring spark and mesos work, and I know Brenden as well 
does.

I have been tracking these spark mesos problems with spark jira and labeling 
them mesos. Can you create these bugs on jira and we can dig more on each one?

Also is this integration test automated? 

Thanks!

Tim

 On Oct 28, 2014, at 8:03 AM, Tim St Clair tstcl...@redhat.com wrote:
 
 inline 
 
 - Original Message -
 
 From: Brenden Matthews brenden.matth...@airbedandbreakfast.com
 To: user@mesos.apache.org
 Cc: mesos-devel d...@mesos.apache.org, RJ Nowling 
 rnowl...@redhat.com,
 Erik Erlandson e...@redhat.com
 Sent: Tuesday, October 28, 2014 9:51:58 AM
 Subject: Re: 0.21.0-pre  Spark latest
 
 Since we've recently adopted Spark, I'll second Tim's comment. We had an
 issue with 0.20.1 that was possibly related to Spark[1], so it's important
 for us to get this stuff fixed in 0.21.0.
 
 Tim, can you elaborate on the issues you saw? Have you tested with my recent
 Spark patches[2][3]?
 
 We are building against Spark 1.1.0 unpatched: 
 - Fine grained mode appears broken. 
 
 - Course grained mode appears to work via normal runs, but crashes in the 
 REPL. 
 http://fpaste.org/145782/14506564/
 
 [1]: https://issues.apache.org/jira/browse/MESOS-1973
 [2]: https://github.com/apache/spark/pull/2401
 [3]: https://github.com/apache/spark/pull/2453
 
 On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair  tstcl...@redhat.com  wrote:
 
 Folks -
 
 We have some automated tests that run the latest Mesos against the latest
 Spark, and we've run across a series of issues in both fine, and course
 grained mode that I believe stem from a series of changes in the 0.21
 cycle.
 
 I'm not certain if anyone owns this integration, but we should probably
 ensure it's fixed before we push out 0.21.
 
 --
 
 Cheers,
 
 Timothy St. Clair
 
 Red Hat Inc.
 
 -- 
 Cheers, 
 Timothy St. Clair 
 Red Hat Inc. 


Re: 0.21.0-pre Spark latest

2014-10-28 Thread Timothy Chen
Hi RJ,

I see, are you or the team on working on this problem already? If not
I'd like to take a look as well.

Tim

On Tue, Oct 28, 2014 at 8:47 AM, RJ Nowling rnowl...@redhat.com wrote:
 Hi Tim,

 The integration test is simply to open the spark shell (Spark 1.1.0) using 
 mesos 0.21 in coarse-grained mode.  We didn't even have to run any commands.

 RJ

 - Original Message -
 From: Timothy Chen tnac...@gmail.com
 To: d...@mesos.apache.org
 Cc: user@mesos.apache.org, RJ Nowling rnowl...@redhat.com, Erik 
 Erlandson e...@redhat.com
 Sent: Tuesday, October 28, 2014 11:40:19 AM
 Subject: Re: 0.21.0-pre  Spark latest

 Hi Tim,

 Thanks for doing the integration tests, that's something that I wanted to do
 but never got to yet.

 I have great interest ensuring spark and mesos work, and I know Brenden as
 well does.

 I have been tracking these spark mesos problems with spark jira and labeling
 them mesos. Can you create these bugs on jira and we can dig more on each
 one?

 Also is this integration test automated?

 Thanks!

 Tim

  On Oct 28, 2014, at 8:03 AM, Tim St Clair tstcl...@redhat.com wrote:
 
  inline
 
  - Original Message -
 
  From: Brenden Matthews brenden.matth...@airbedandbreakfast.com
  To: user@mesos.apache.org
  Cc: mesos-devel d...@mesos.apache.org, RJ Nowling
  rnowl...@redhat.com,
  Erik Erlandson e...@redhat.com
  Sent: Tuesday, October 28, 2014 9:51:58 AM
  Subject: Re: 0.21.0-pre  Spark latest
 
  Since we've recently adopted Spark, I'll second Tim's comment. We had an
  issue with 0.20.1 that was possibly related to Spark[1], so it's important
  for us to get this stuff fixed in 0.21.0.
 
  Tim, can you elaborate on the issues you saw? Have you tested with my
  recent
  Spark patches[2][3]?
 
  We are building against Spark 1.1.0 unpatched:
  - Fine grained mode appears broken.
 
  - Course grained mode appears to work via normal runs, but crashes in the
  REPL.
  http://fpaste.org/145782/14506564/
 
  [1]: https://issues.apache.org/jira/browse/MESOS-1973
  [2]: https://github.com/apache/spark/pull/2401
  [3]: https://github.com/apache/spark/pull/2453
 
  On Tue, Oct 28, 2014 at 7:46 AM, Tim St Clair  tstcl...@redhat.com 
  wrote:
 
  Folks -
 
  We have some automated tests that run the latest Mesos against the latest
  Spark, and we've run across a series of issues in both fine, and course
  grained mode that I believe stem from a series of changes in the 0.21
  cycle.
 
  I'm not certain if anyone owns this integration, but we should probably
  ensure it's fixed before we push out 0.21.
 
  --
 
  Cheers,
 
  Timothy St. Clair
 
  Red Hat Inc.
 
  --
  Cheers,
  Timothy St. Clair
  Red Hat Inc.



Re: Multiple schedulers on one machine?

2014-10-28 Thread Vinod Kone
Sorry for the delay in response.

It looks like master received the registration request from the scheduler
at 20:13, at which point it sent a registration ack back to the scheduler.
What is not clear is why no registration requests were sent/received from
19:47 when the framework apparently started. You can add GLOG_v=1 to the
scheduler's environment to print more verbose logging for the driver. It
will tell us what the driver has been doing. Are you sure you didn't have
any network partitions around that time?

On Fri, Oct 10, 2014 at 1:30 PM, Colleen Lee c...@coursera.org wrote:

 2014-09-19 19:46:49,361 INFO [JobManager]  Running job
 1uZArT-yEeS7gCIACpcfeA
 snip
 2014-09-19 20:13:48,134 INFO [JobScheduler]  Job
 1uZArT-yEeS7gCIACpcfeA: Registered as
 20140818-235718-3165886730-5050-901-1507 to master
 '20140818-235718-3165886730-5050-901'

 The snipped code is for unrelated internals of our client. Going back to
 implementation, we call for the Running job ... log line to be output
 immediately before calling driver.run(), and our implementation of the
 registered() method in the scheduler is simply to print out the second log
 line above. During this time, from the mesos master logs, the master
 continues to function as normal, sending offers to (other) frameworks,
 processing the replies, adding/launching tasks, completing/removing tasks,
 unregistering/removing frameworks, etc. Here are the log lines that may be
 suspicious during that window:

 W0919 19:47:00.258894   938 master.cpp:2718] Ignoring unknown exited
 executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
 W0919 19:47:00.260349   939 master.cpp:2718] Ignoring unknown exited
 executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
 I0919 20:07:02.690067   940 master.cpp:1041] Received registration
 request from scheduler(316)@10.151.31.120:36446
 I0919 20:07:02.690192   940 master.cpp:1059] Registering framework
 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@
 10.151.31.120:36446


 The log line for framework in master is for a different framework
 (20140818-235718-3165886730-5050-901-1502) than the problematic
 job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for
 the log lines corresponding to the problematic framework (grep for ip:port)
 in the master logs? That should tell us what's happening.


 Good catch, here are the relevant lines from the master logs (I grepped
 for the framework id -- how is the port determined?).

 I0919 20:13:48.122954   941 master.cpp:1041] Received registration request
 from scheduler(13)@10.45.199.181:43573
 I0919 20:13:48.123080   941 master.cpp:1059] Registering framework
 20140818-235718-3165886730-5050-901-1507 at scheduler(13)@
 10.45.199.181:43573
 I0919 20:13:48.123914   941 hierarchical_allocator_process.hpp:331] Added
 framework 20140818-235718-3165886730-5050-901-1507
 I0919 20:13:48.124739   941 master.cpp:2933] Sending 2 offers to framework
 20140818-235718-3165886730-5050-901-1507
 I0919 20:13:48.129932   940 master.cpp:1889] Processing reply for offers:
 [ 20140818-235718-3165886730-5050-901-486420 ] on slave
 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
 (ip-10-101-195-45.ec2.internal) for framework
 20140818-235718-3165886730-5050-901-1507
 I0919 20:13:48.130028   940 master.hpp:655] Adding task
 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave
 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal)
 I0919 20:13:48.130630   940 master.cpp:3111] Launching task
 1uZArT-yEeS7gCIACpcfeA of framework
 20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1;
 mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
 I0919 20:13:48.131531   940 master.cpp:1889] Processing reply for offers:
 [ 20140818-235718-3165886730-5050-901-486421 ] on slave
 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
 (ip-10-51-165-231.ec2.internal) for framework
 20140818-235718-3165886730-5050-901-1507
 I0919 20:13:48.131589   937 hierarchical_allocator_process.hpp:589]
 Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
 20140818-235718-3165886730-5050-901-5 for 5secs
 I0919 20:13:48.133910   937 hierarchical_allocator_process.hpp:589]
 Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
 20140818-235718-3165886730-5050-901-4 for 5secs
 I0919 20:13:49.833616   940 master.cpp:2628] Status update TASK_RUNNING
 (UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task
 1uZArT-yEeS7gCIACpcfeA of framework
 20140818-235718-3165886730-5050-901-1507 from slave
 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
 (ip-10-101-195-45.ec2.internal)
 I0919 20:13:53.184873   941 master.cpp:2933] Sending 2 offers to framework
 

Re: Running mesos-slave in Docker

2014-10-28 Thread Alessandro Siragusa
Hi Tim,

Yes you are right, I was running the last automated build from the master 
version, with v0.7.1 everything works fine.

I am running Marathon, Mesos master and slave and zookeeper in a container in 
CoreOS.

My plan is to create the needed unit files to run it in CoreOS and to write a 
simple http endpoint (probably written in node) to use hipache to route the 
requests (as I will have to run a lot of containers).

Are you interested in putting this in the main repo or I create a dedicated 
side project?

Thanks again,

Alessandro

 Le 28 oct. 2014 à 22:19, Tim Chen t...@mesosphere.io a écrit :
 
 Hi Alessrando,
 
 I think Mesos is running your task fine, but Marathon is killing your task.
 
 Are you launching Marathon through a docker container as well? And what 
 version of Marathon are you using?
 
 Tim
 
 On Tue, Oct 28, 2014 at 2:07 PM, Alessandro Siragusa 
 alessandro.sirag...@gmail.com mailto:alessandro.sirag...@gmail.com wrote:
 Hi guys,
 
 I still have a problem running mesos-slave in a Docker container. It 
 continuously kills and starts the containers on all the three slave nodes. In 
 the Marathon UI I can see multiple instances at the same time on all the 
 nodes.
 
 I1028 20:43:19.572377 8 slave.cpp:1002] Got assigned task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for framework 
 20141027-004948-352326060-38124-1-
 I1028 20:43:19.572691 8 slave.cpp:1112] Launching task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for framework 
 20141027-004948-352326060-38124-1-
 I1028 20:43:19.573457 8 slave.cpp:1222] Queuing task 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' for executor 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 '20141027-004948-352326060-38124-1-
 I1028 20:43:19.57545113 docker.cpp:743] Starting container 
 '7f23db8e-9fb5-4e20-9f06-eb4caf361d86' for task 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' (and executor 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c') of framework 
 '20141027-004948-352326060-38124-1-'
 I1028 20:43:20.936192 8 slave.cpp:2538] Monitoring executor 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework 
 '20141027-004948-352326060-38124-1-' in container 
 '7f23db8e-9fb5-4e20-9f06-eb4caf361d86'
 I1028 20:43:20.94739113 slave.cpp:1733] Got registration for executor 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework 
 20141027-004948-352326060-38124-1- from executor(1)@176.31.235.180:42593 
 http://176.31.235.180:42593/
 I1028 20:43:20.94798613 slave.cpp:1853] Flushing queued task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c for executor 
 'ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c' of framework 
 20141027-004948-352326060-38124-1-
 I1028 20:43:20.949553 9 slave.cpp:2088] Handling status update 
 TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 20141027-004948-352326060-38124-1- from executor(1)@176.31.235.180:42593 
 http://176.31.235.180:42593/
 I1028 20:43:20.949733 6 status_update_manager.cpp:320] Received status 
 update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 20141027-004948-352326060-38124-1-
 I1028 20:43:20.949831 6 status_update_manager.cpp:373] Forwarding status 
 update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 20141027-004948-352326060-38124-1- to master@176.31.225.8 
 mailto:master@176.31.225.8:5050
 I1028 20:43:20.949935 6 slave.cpp:2252] Sending acknowledgement for 
 status update TASK_RUNNING (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for 
 task ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 20141027-004948-352326060-38124-1- to executor(1)@176.31.235.180:42593 
 http://176.31.235.180:42593/
 I1028 20:43:20.95590510 status_update_manager.cpp:398] Received status 
 update acknowledgement (UUID: ebb06849-c0ed-470e-95f0-3c652f6a2eee) for task 
 ubuntu.70687acd-5ee3-11e4-8b6f-42c8cb288d5c of framework 
 20141027-004948-352326060-38124-1-
 I1028 20:43:21.938161 9 docker.cpp:1286] Updated 'cpu.shares' to 614 at 
 /sys/fs/cgroup/cpu,cpuacct/system.slice/docker-e84f6042adeb23101e6a147f6f5ed1a01f748b432d7e33c8c8d1e9c091487095.scope
  for container 7f23db8e-9fb5-4e20-9f06-eb4caf361d86
 I1028 20:43:21.938460 9 docker.cpp:1321] Updated 
 'memory.soft_limit_in_bytes' to 544MB for container 
 7f23db8e-9fb5-4e20-9f06-eb4caf361d86
 I1028 20:43:21.938865 9 docker.cpp:1347] Updated 'memory.limit_in_bytes' 
 to 544MB at 
 /sys/fs/cgroup/memory/system.slice/docker-e84f6042adeb23101e6a147f6f5ed1a01f748b432d7e33c8c8d1e9c091487095.scope
  for container 7f23db8e-9fb5-4e20-9f06-eb4caf361d86
 I1028 20:43:25.57190710 slave.cpp:1002] Got assigned task 
 ubuntu.73fb8c9e-5ee3-11e4-8b6f-42c8cb288d5c for framework 
 20141027-004948-352326060-38124-1-
 

MPI on Mesos

2014-10-28 Thread Stratos Dimopoulos
Hi,

I am having a couple of issues trying to run MPI over Mesos. I am using
Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.

- I was able to successfully (?) run a helloworld MPI program but still the
task appears as lost in the GUI. Here is the stack trace from the mpi
execution:

 We've launched all our MPDs; waiting for them to come up
Got 1 mpd(s), running mpiexec
Running mpiexec


 *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
processors ***

mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
Task 0 in state 5
A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
mpdroot: perror msg: No such file or directory
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
probable cause:  no mpd daemon on this machine
possible cause:  unix socket /tmp/mpd2.console_root has been removed
mpdexit (__init__ 1208): forked process failed; status=255
I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
'20141028-203440-1257767434-5050-3638-0006'
2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
Closing zookeeper sessionId=0x14959388d4e0020


And also in *executor stdout* I get:
sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
--port=39237'Command exited with status 127 → command not found

and on *stderr*:
sh: 1 mpd: not found

I am assuming the messages on the executor's log files appear because after
mpiexec is completed the task is finished and the mpd ring is no longer
running - so it complains about not finding the mpd command, which normally
works fine.


- An other thing I would like to ask has to do with the procedure to follow
for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I was
used to have an executor shared on HDFS and there was no need to
distributed the code to the slaves. With MPI I had to distribute the
helloworld executable to slaves, because having it on HDFS didn't work.
Moreover I was expecting that the mpd ring would be started from Mesos (in
the same way that the hadoop jobtracker is started from Mesos for the
HadoopOnMesos implementations). Now I have to first run mpdboot before
being able to run mpi on Mesos. Is the above procedure what I should do or
I am missing something?

- Finally, in order to make MPI to work I had to install the
mesos.interface with pip and manually copy the native directory from the
python/dist-packages (native doesn't exist on the pip repo). And then I
realized there is the mpiexec-mesos.in file that it does all that - I can
update the README to be a little more clear if you want - I am guessing
someone else might also get confused with this.

thanks,
Stratos