Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-25 Thread Alex Rukletsov
It looks like we can have a better error message here.

@Jay, mind filing a JIRA ticket for with description, status update, and
your fix attached? Thanks!

On Fri, Aug 21, 2015 at 7:36 PM, Jay Taylor  wrote:

> Eventually I was able to isolate what was going on; in this case the
> FrameworkInfo.User was set to an invalid value and setting it to "root" did
> the trick.
>
> My scheduler is now working [in a basic form]!!!
>
> Cheers,
> Jay
>
> On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor  wrote:
>
>> Hey Tim,
>>
>> Thank you for the quick response!
>>
>> Just checked the sandbox logs and they are all empty (stdout and stderr
>> are both 0 bytes).
>>
>> I have discovered a little bit more information from the StatusUpdate
>> event posted back to my scheduler:
>>
>> &TaskStatus{
>> TaskId: &TaskID{
>> Value:*fluxCapacitor-test-1,XXX_unrecognized:[],
>> },
>> State: *TASK_FAILED,
>> Message: *Abnormal executor termination,
>> Source: *SOURCE_SLAVE,
>> Reason: *REASON_COMMAND_EXECUTOR_FAILED,
>> Data:nil,
>> SlaveId: &SlaveID{
>> Value: *20150804-211459-1407297728-5050-5855-S1,
>> XXX_unrecognized: [],
>> },
>> ExecutorId: nil,
>> Timestamp: *1.440112075509318e+09,
>> Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166],
>> Healthy: nil,
>> XXX_unrecognized: [],
>> }
>>
>> How can I find out what why the command executor is failing?
>>
>>
>> On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen  wrote:
>>
>>> It received a TASK_FAILED from the executor, so you'll need to look at
>>> the sandbox logs of your task stdout and stderr files to see what went
>>> wrong.
>>>
>>> These files should be reachable by the Mesos UI.
>>>
>>> Tim
>>>
>>> On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor  wrote:
>>>
 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating
 from the slave logs.  Now I'm not sure how to determine exactly what is
 causing the failure.

 The most informative log messages I've found were in the slave log:

 ==> /var/log/mesos/mesos-slave.INFO <==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 
 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to
 be turned on to show more information?

 Your assistance is greatly appreciated!

 Jay

>>>
>>>
>>
>


Re: SSL in Mesos 0.23

2015-08-25 Thread Joris Van Remoortere
@Carlos
Mesosphere currently doesn't build packages with ssl enabled.

On Tue, Aug 25, 2015 at 3:12 PM, Carlos Sanchez  wrote:

> Hi Joris,
>
> I did build from sources, following instructions in
> http://mesos.apache.org/gettingstarted/
>
> Is the mesosphere binary compiled with libevent and ssl enabled as
> mentioned previously? would make debugging easier if I don't have to rebuild
>
>
>
> On Tue, Aug 25, 2015 at 8:52 PM, Joris Van Remoortere  > wrote:
>
>> @carlos
>> Are you building 0.23.0 from source?
>> Just so we don't miss anything: Can you make sure to run ./bootstrap,
>> and build in a clean directory with your configuration similar to this:
>>
>> ../configure --enable-libevent --enable-ssl
>>
>> Here  is the
>> document I am using as a reference
>>
>> When you start up a master, if you just specify SSL_ENABLED=true it
>> should error out and notify you that other required flags such as 
>> SSL_KEY_FILE
>> are not provided. Can you verify this? If that is not happening, then the
>> 2 options are:
>> 1. Your environment variables are not making it to the binary: See Jeff
>> Schroeder's comments
>> 2. The binary is not actually the one you expect. Double check the
>> checksum with the binary you built after configuring with SSL.
>>
>>
>>
>> On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez 
>> wrote:
>>
>>> looking forward to it, thanks!
>>> running out of ideas here on what am I doing wrong
>>>
>>> On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio 
>>> wrote:
>>> > FYI - Joris is out this week, he'll be probably able to get back to you
>>> > early next (modulo MesosCon craziness :)
>>> >
>>> > Marco Massenzio
>>> > Distributed Systems Engineer
>>> >
>>> > On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez 
>>> wrote:
>>> >>
>>> >> no suggestions?
>>> >>
>>> >> On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone 
>>> wrote:
>>> >> > @joris, can you help out here?
>>> >> >
>>> >> > On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez 
>>> >> > wrote:
>>> >> >>
>>> >> >> I have tried to enable SSL with no success, even compiling from
>>> source
>>> >> >> with the ssl flags --enable-libevent --enable-ssl
>>> >> >>
>>> >> >> export SSL_ENABLED=true
>>> >> >> export SSL_SUPPORT_DOWNGRADE=false
>>> >> >> export SSL_REQUIRE_CERT=true
>>> >> >> export SSL_CERT_FILE=/etc/mesos/...
>>> >> >> export SSL_KEY_FILE=/etc/mesos/...
>>> >> >> export SSL_CA_FILE=/etc/mesos/...
>>> >> >>
>>> >> >>
>>> >> >> /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master
>>> >> >> --work_dir="/var/lib/mesos"
>>> >> >>
>>> >> >> Port 5050 is still served as plain http, no SSL
>>> >> >>
>>> >> >> Nothing about ssl shows up in the logs, any ideas?
>>> >> >>
>>> >> >> Thanks
>>> >> >>
>>> >> >>
>>> >> >> >
>>> >> >> > From: Dharmit Shah 
>>> >> >> > To: user@mesos.apache.org
>>> >> >> > Cc:
>>> >> >> > Date: Mon, 10 Aug 2015 14:13:04 +0530
>>> >> >> > Subject: Re: SSL in Mesos 0.23
>>> >> >> > Hi Jeff,
>>> >> >> >
>>> >> >> > Thanks for the suggestion.
>>> >> >> >
>>> >> >> > I modified the systemd service file to use
>>> >> >> > `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as
>>> >> >> > environment files for master and slave services respectively. In
>>> >> >> > these
>>> >> >> > files, I specified the environment variables that I used to
>>> specify
>>> >> >> > on
>>> >> >> > the command line.
>>> >> >> >
>>> >> >> > Now if I check `strings /proc//environ | grep SSL` for pids
>>> of
>>> >> >> > master and slave services, I see the environment variables that
>>> I set
>>> >> >> > in the /etc/sysconfig/.
>>> >> >> >
>>> >> >> > Now that it looks like I have started the master and slave
>>> services
>>> >> >> > with SSL enabled, how do I really confirm that communication
>>> between
>>> >> >> > master and slaves is really happening over SSL?
>>> >> >> >
>>> >> >> > Also, how do I enable SSL communication for a framework like
>>> >> >> > Marathon?
>>> >> >> >
>>> >> >> > Regards,
>>> >> >> > Dharmit.
>>> >> >> >
>>> >> >> > On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder
>>> >> >> >  wrote:
>>> >> >> > > The sudo command defaults to envreset (look for that in the man
>>> >> >> > > page)
>>> >> >> > > which
>>> >> >> > > strips all env variables sans a select few. I'd almost bet that
>>> >> >> > > your
>>> >> >> > > SSL_*
>>> >> >> > > variables are not present and were not passed to the slave.
>>> Just
>>> >> >> > > sudo
>>> >> >> > > -i and
>>> >> >> > > start the slaves *as root* without sudo. There is no benefit to
>>> >> >> > > starting
>>> >> >> > > them with sudo. You can verify what I'm saying with something
>>> along
>>> >> >> > > the
>>> >> >> > > lines of:
>>> >> >> > >
>>> >> >> > > strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Friday, August 7, 2015, Dharmit Shah >> >
>>> >> >> > > wrote:
>>> >> >> > >>
>>> >> >> > >> Hello again,
>>> >> >> > >>
>>> >> >> > >> Thanks for your responses. I will sha

RE: slave_ping_timeout <1secs

2015-08-25 Thread Nastooh Avessta (navesta)
I see. Thank you for the clarification. Can I just change the boundaries in the 
source code, to suit my needs, or there is more to it?
Cheers,

[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527

Cisco Systems Limited
595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
VANCOUVER
BRITISH COLUMBIA
V7X 1J1
CA
Cisco.com





[Think before you print.]Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. 
Phone: 416-306-7000; Fax: 416-306-7099. 
Preferences - 
Unsubscribe – 
Privacy

From: Yan Xu [mailto:y...@jxu.me]
Sent: Tuesday, August 25, 2015 5:49 PM
To: user@mesos.apache.org
Subject: Re: slave_ping_timeout <1secs

Yes: 
https://github.com/apache/mesos/blob/5de7ea455ec577e19c67a75b1cf98493b40c53fb/src/master/flags.cpp#L383

Was the error message not shown in stderr?

--
Jiang Yan Xu mailto:y...@jxu.me>> 
@xujyan

On Tue, Aug 25, 2015 at 5:41 PM, Nastooh Avessta (navesta) 
mailto:nave...@cisco.com>> wrote:
Hi
Running Mesos 0.23.0 and noted that cannot start mesos-master with 
slave_ping_timeout less than 1 second,  tried 0.5secs, 500ms and 50us, etc. 
Is this by design or am I missing something?
Cheers,

[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527

Cisco Systems Limited
595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
VANCOUVER
BRITISH COLUMBIA
V7X 1J1
CA
Cisco.com





[Think before you print.]Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. 
Phone: 416-306-7000; Fax: 416-306-7099. 
Preferences - 
Unsubscribe – 
Privacy




Re: slave_ping_timeout <1secs

2015-08-25 Thread Yan Xu
Yes:
https://github.com/apache/mesos/blob/5de7ea455ec577e19c67a75b1cf98493b40c53fb/src/master/flags.cpp#L383

Was the error message not shown in stderr?

--
Jiang Yan Xu  @xujyan 

On Tue, Aug 25, 2015 at 5:41 PM, Nastooh Avessta (navesta) <
nave...@cisco.com> wrote:

> Hi
>
> Running Mesos 0.23.0 and noted that cannot start mesos-master with
> slave_ping_timeout less than 1 second,  tried 0.5secs, 500ms and 50us,
> etc. Is this by design or am I missing something?
>
> Cheers,
>
>
>
> [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
>
> *Nastooh Avessta*
> ENGINEER.SOFTWARE ENGINEERING
> nave...@cisco.com
> Phone: *+1 604 647 1527 <%2B1%20604%20647%201527>*
>
> *Cisco Systems Limited*
> 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
> VANCOUVER
> BRITISH COLUMBIA
> V7X 1J1
> CA
> Cisco.com 
>
>
>
> [image: Think before you print.]Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/index.html
>
> Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J
> 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
>  - Unsubscribe
>  – Privacy
> *
>
>
>


slave_ping_timeout <1secs

2015-08-25 Thread Nastooh Avessta (navesta)
Hi
Running Mesos 0.23.0 and noted that cannot start mesos-master with 
slave_ping_timeout less than 1 second,  tried 0.5secs, 500ms and 50us, etc. 
Is this by design or am I missing something?
Cheers,

[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527

Cisco Systems Limited
595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
VANCOUVER
BRITISH COLUMBIA
V7X 1J1
CA
Cisco.com





[Think before you print.]Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. 
Phone: 416-306-7000; Fax: 416-306-7099. 
Preferences - 
Unsubscribe - 
Privacy



Re: Mesos/Marathon/HAProxy Logging

2015-08-25 Thread John Omernik
So I agree that is how it should be done, however the current
implementation on Mesos, requires me to manually code something like. In
addition, this is only for http traffic, not tcp... what happens when the
service running on Mesos isn't HTTP? I was hoping for some discussion
beyond just manually editing the ha proxy script to make it http and add
the headers...


On Tue, Aug 25, 2015 at 12:46 PM, Jeff Schroeder  wrote:

> This is the header that should be passed:
>
> https://en.m.wikipedia.org/wiki/X-Forwarded-For
>
> Most of the modern internet routes through reverse proxies and this is how
> we log the actual source clients to solve similar auditing and compliance
> needs.
>
>
> On Tuesday, August 25, 2015, John Omernik  wrote:
>
>> I have been playing with an application that is a very simple app: A
>> webservice running in Python. I've created a docker container, it runs in
>> the container, I setup marathon to run it, I use mesos-dns and ha proxy and
>> I can access the service just fine anywhere in the cluster.
>>
>> First let me say this is VERY cool. The capabilities here awesome.
>>
>> Now the challenge: the security guy in me wants to take good logs from my
>> app.  It was setup to do it's own logging through a custom module. I am
>> very happy with it.  I setup the app in the container to mount a volume
>> that's in my MapRFS via NFS so I can log directly to a clustered
>> filesystem. THis is awesome, I can read my logs in Apache Drill as they are
>> written!!!
>>
>> However, the haproxy through me for a loop. Once I started running the
>> app in Marathon with a service port and routed around via haproxy, I
>> realized something:  I lost my source IPs in my logs?
>>
>> Why?
>>
>> Because once HAProxy takes over, it no longer needs to keep the source
>> IP, and instead the next hop only sees the previous connection IP.  From a
>> service discovery perspective it works great, but with this setup, I'd lose
>> the previous hop. Perhaps I manually add something in haproxy to add an
>> X-forwarded-for header, that would be nice, however, that only works for
>> http apps, what about other TCP apps that are not HTTP?
>>
>> This is an interesting problem, because apps should have good logging,
>> security, performance, troubleshooting, and if I can't get the source IP it
>> could be a problem.
>>
>> So, my question is this, anyone ran into this? How are you handling it?
>> Any brainstorms here we may be able to work off of?
>>
>> One thing I thought was why are we using HAproxy? Couldn't the same
>> HAProxy script, actually put in forwarding rules in IPtables?  This sounds
>> messy, but could it work? Has anyone explored that? If the data was
>> forwarded, than it wouldn't lose the IP information (and timeouts wouldn't
>> be a concern either (I think I posted before on how long running TCP
>> connections can be closed down by HAProxy if they don't implement TCP Keep
>> alives).
>>
>> Other ideas?  This is interesting to me, and likely others.
>>
>
>
> --
> Text by Jeff, typos by iPhone
>


Re: Not getting resource offers for 20 min

2015-08-25 Thread Hans van den Bogert
Wanted to add that, even if there wasn’t a preview package, you can clone from 
GIT, and checkout a tag, where in this case v1.5.0-rc1 is tagged. Then 
proceeded normally as you would’ve had a source distro as described in the 
already mentioned http://spark.apache.org/docs/latest/building-spark.html

> On 25 Aug 2015, at 19:26, CCAAT  wrote:
> 
> THANKS, as I have not kept up on the spark lists
> 
> James
> 
> 
> On 08/25/2015 04:28 AM, Iulian Dragoș wrote:
>> 
>> 
>> On Mon, Aug 24, 2015 at 7:16 PM, CCAAT > > wrote:
>> 
>>On 08/24/2015 05:33 AM, Iulian Dragoș wrote:
>> 
>> 
>>Hello Iulian,
>> 
>>Ok, so I eventually build spark from 100% sources, after some
>>intermediate builds on gentoo.   Gentoo is not the best platform for
>>Java development, but those issues related to spark builds are
>>slowly being fixed on gentoo. Where (how) do you download the
>>spark-1.5.x complete source tree, as it does not seen available on
>>this page::
>> 
>>http://spark.apache.org/downloads.html
>> 
>> 
>> It's not yet a final release, but there's a preview:
>> 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html
>> 
>> Building Spark from sources isn't too hard, there's a
>> `make-distribution.sh` script in the root directory. There are a few
>> parameters (like the dependency Hadoop version), but it should be fairly
>> straight forward. More info here:
>> 
>> http://spark.apache.org/docs/latest/building-spark.html
>> 
>> iulian
>> 
>> 
>> 
>>Any other related information or tips on building out spark from sources
>>are keenly received.
>> 
>>James
>> 
>>Unfortunately I don't have access to the cluster anymore, but I
>>think
>>Chronos wasn't the culprit. After updating Spark to 1.5 and
>>setting a
>>framework role offers started to come (while still using Chronos).
>> 
>>iulian
>> 
>> 
>>Iulian Dragos
>> 
>>--
>>Reactive Apps on the JVM
>>www.typesafe.com  
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> --
>> Iulian Dragos
>> 
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com 
>> 
> 



Re: SSL in Mesos 0.23

2015-08-25 Thread Carlos Sanchez
Hi Joris,

I did build from sources, following instructions in
http://mesos.apache.org/gettingstarted/

Is the mesosphere binary compiled with libevent and ssl enabled as
mentioned previously? would make debugging easier if I don't have to rebuild



On Tue, Aug 25, 2015 at 8:52 PM, Joris Van Remoortere 
wrote:

> @carlos
> Are you building 0.23.0 from source?
> Just so we don't miss anything: Can you make sure to run ./bootstrap, and
> build in a clean directory with your configuration similar to this:
>
> ../configure --enable-libevent --enable-ssl
>
> Here  is the
> document I am using as a reference
>
> When you start up a master, if you just specify SSL_ENABLED=true it
> should error out and notify you that other required flags such as SSL_KEY_FILE
> are not provided. Can you verify this? If that is not happening, then the
> 2 options are:
> 1. Your environment variables are not making it to the binary: See Jeff
> Schroeder's comments
> 2. The binary is not actually the one you expect. Double check the
> checksum with the binary you built after configuring with SSL.
>
>
>
> On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez 
> wrote:
>
>> looking forward to it, thanks!
>> running out of ideas here on what am I doing wrong
>>
>> On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio 
>> wrote:
>> > FYI - Joris is out this week, he'll be probably able to get back to you
>> > early next (modulo MesosCon craziness :)
>> >
>> > Marco Massenzio
>> > Distributed Systems Engineer
>> >
>> > On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez 
>> wrote:
>> >>
>> >> no suggestions?
>> >>
>> >> On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone 
>> wrote:
>> >> > @joris, can you help out here?
>> >> >
>> >> > On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez 
>> >> > wrote:
>> >> >>
>> >> >> I have tried to enable SSL with no success, even compiling from
>> source
>> >> >> with the ssl flags --enable-libevent --enable-ssl
>> >> >>
>> >> >> export SSL_ENABLED=true
>> >> >> export SSL_SUPPORT_DOWNGRADE=false
>> >> >> export SSL_REQUIRE_CERT=true
>> >> >> export SSL_CERT_FILE=/etc/mesos/...
>> >> >> export SSL_KEY_FILE=/etc/mesos/...
>> >> >> export SSL_CA_FILE=/etc/mesos/...
>> >> >>
>> >> >>
>> >> >> /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master
>> >> >> --work_dir="/var/lib/mesos"
>> >> >>
>> >> >> Port 5050 is still served as plain http, no SSL
>> >> >>
>> >> >> Nothing about ssl shows up in the logs, any ideas?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > From: Dharmit Shah 
>> >> >> > To: user@mesos.apache.org
>> >> >> > Cc:
>> >> >> > Date: Mon, 10 Aug 2015 14:13:04 +0530
>> >> >> > Subject: Re: SSL in Mesos 0.23
>> >> >> > Hi Jeff,
>> >> >> >
>> >> >> > Thanks for the suggestion.
>> >> >> >
>> >> >> > I modified the systemd service file to use
>> >> >> > `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as
>> >> >> > environment files for master and slave services respectively. In
>> >> >> > these
>> >> >> > files, I specified the environment variables that I used to
>> specify
>> >> >> > on
>> >> >> > the command line.
>> >> >> >
>> >> >> > Now if I check `strings /proc//environ | grep SSL` for pids
>> of
>> >> >> > master and slave services, I see the environment variables that I
>> set
>> >> >> > in the /etc/sysconfig/.
>> >> >> >
>> >> >> > Now that it looks like I have started the master and slave
>> services
>> >> >> > with SSL enabled, how do I really confirm that communication
>> between
>> >> >> > master and slaves is really happening over SSL?
>> >> >> >
>> >> >> > Also, how do I enable SSL communication for a framework like
>> >> >> > Marathon?
>> >> >> >
>> >> >> > Regards,
>> >> >> > Dharmit.
>> >> >> >
>> >> >> > On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder
>> >> >> >  wrote:
>> >> >> > > The sudo command defaults to envreset (look for that in the man
>> >> >> > > page)
>> >> >> > > which
>> >> >> > > strips all env variables sans a select few. I'd almost bet that
>> >> >> > > your
>> >> >> > > SSL_*
>> >> >> > > variables are not present and were not passed to the slave. Just
>> >> >> > > sudo
>> >> >> > > -i and
>> >> >> > > start the slaves *as root* without sudo. There is no benefit to
>> >> >> > > starting
>> >> >> > > them with sudo. You can verify what I'm saying with something
>> along
>> >> >> > > the
>> >> >> > > lines of:
>> >> >> > >
>> >> >> > > strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_
>> >> >> > >
>> >> >> > >
>> >> >> > > On Friday, August 7, 2015, Dharmit Shah 
>> >> >> > > wrote:
>> >> >> > >>
>> >> >> > >> Hello again,
>> >> >> > >>
>> >> >> > >> Thanks for your responses. I will share what I tried after your
>> >> >> > >> suggestions.
>> >> >> > >>
>> >> >> > >> 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave`
>> >> >> > >> returned similar output as one suggested by Craig. So, I guess,
>> >> >> > >> the
>> >> >> > >> Mesosphere repo binaries have SSL ena

Re: SSL in Mesos 0.23

2015-08-25 Thread Joris Van Remoortere
@Dharmit

If you want to be really sure that the communication is happening over SSL,
you can use a packet sniffing tool like wireshark, or depending on your
operating system you can dump the packet streams directly to a file. For
example TCP dump.
Another thing you can do is to try and hit the HTTP endpoints from curl
using http as opposed to https.

Remember that if you have SSL_SUPPORT_DOWNGRADE=true you should be able to
connect even without SSL. If it is false (the default) you will not be able
to connect.

On Mon, Aug 10, 2015 at 4:43 AM, Dharmit Shah  wrote:

> Hi Jeff,
>
> Thanks for the suggestion.
>
> I modified the systemd service file to use
> `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as
> environment files for master and slave services respectively. In these
> files, I specified the environment variables that I used to specify on
> the command line.
>
> Now if I check `strings /proc//environ | grep SSL` for pids of
> master and slave services, I see the environment variables that I set
> in the /etc/sysconfig/.
>
> Now that it looks like I have started the master and slave services
> with SSL enabled, how do I really confirm that communication between
> master and slaves is really happening over SSL?
>
> Also, how do I enable SSL communication for a framework like Marathon?
>
> Regards,
> Dharmit.
>
> On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder
>  wrote:
> > The sudo command defaults to envreset (look for that in the man page)
> which
> > strips all env variables sans a select few. I'd almost bet that your
> SSL_*
> > variables are not present and were not passed to the slave. Just sudo -i
> and
> > start the slaves *as root* without sudo. There is no benefit to starting
> > them with sudo. You can verify what I'm saying with something along the
> > lines of:
> >
> > strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_
> >
> >
> > On Friday, August 7, 2015, Dharmit Shah  wrote:
> >>
> >> Hello again,
> >>
> >> Thanks for your responses. I will share what I tried after your
> >> suggestions.
> >>
> >> 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave`
> >> returned similar output as one suggested by Craig. So, I guess, the
> >> Mesosphere repo binaries have SSL enabled. Right?
> >>
> >> 2. I created SSL private key and cert on one system in my cluster by
> >> referring this guide on DO [1]. Admittedly, my knowledge of SSL is
> >> limited.
> >>
> >> 3. Next, I copied the key and cert to all three mesos-master nodes and
> >> four mesos-slave nodes. Shouldn't slave nodes be provided only with
> >> the cert and not the private key? Whereas all master nodes may have
> >> the private key and cert both. Or am I understanding SSL incorrectly
> >> here?
> >>
> >> 4. After copying the cert and key, I started the mesos-master service
> >> on master nodes with below command:
> >>
> >> $ sudo SSL_ENABLED=true SSL_KEY_FILE=~/ssl/mesos.key
> >> SSL_CERT_FILE=~/ssl/mesos.crt /usr/sbin/mesos-master
> >> --zk=zk://172.19.10.111:2181,172.19.10.112:2181,
> 172.19.10.193:2181/mesos
> >> --port=5050 --log_dir=/var/log/mesos --acls=file:///root/acls.json
> >> --credentials=/home/isys/mesos --quorum=2 --work_dir=/var/lib/mesos
> >>
> >> I check web UI and things look good. I am not completely sure if
> >> "https" should have worked for mesos web UI but, it didn't.
> >>
> >> 5. Next, I start slave nodes with below command:
> >>
> >>   $ sudo SSL_ENABLED=true SSL_CERT_FILE=~/mesos.crt
> >> SSL_KEY_FILE=~/mesos.key /usr/sbin/mesos-slave
> >>
> >> --master=zk://172.19.10.111:2181,172.19.10.112:2181,
> 172.19.10.193:2181/mesos
> >> --log_dir=/var/log/mesos --containerizers=docker,mesos
> >> --executor_registration_timeout=15mins
> >>
> >> Mesos web UI reported four mesos-slave nodes in "Activated" mode. So
> >> far so good. I am still wondering how I should verify if communication
> >> is happening over SSL.
> >>
> >> 6. To check if SSL is indeed working, I stopped one slave node and
> >> started it without SSL using `systemctl start mesos-slave`. I was
> >> expecting it to not get into "Activated" state on Mesos web UI but it
> >> did. So, I think SSL is not configured properly by me.
> >>
> >> I am attaching logs from the master nodes. These logs were generated
> >> after starting masters with command specified in point 4.
> >>
> >> Let me know if I am doing something wrong or if you need more logs or
> >> need me to execute some specific commands.
> >>
> >> [1]
> >>
> https://www.digitalocean.com/community/tutorials/openssl-essentials-working-with-ssl-certificates-private-keys-and-csrs
> >>
> >> Regards,
> >> Dharmit.
> >>
> >> On Fri, Aug 7, 2015 at 2:52 AM, Michael Park  wrote:
> >> > Hi Dharmit,
> >> >
> >> > I'm not certain whether the Mesosphere deb packages have SSL enabled
> or
> >> > not,
> >> > although based on Craig's observation it looks like it is.
> >> >
> >> > I think the correct way to enable SSL is to set the SSL_ENABLED
> >> > environment
> >> > variable, ra

Re: SSL in Mesos 0.23

2015-08-25 Thread Joris Van Remoortere
@carlos
Are you building 0.23.0 from source?
Just so we don't miss anything: Can you make sure to run ./bootstrap, and
build in a clean directory with your configuration similar to this:

../configure --enable-libevent --enable-ssl

Here  is the
document I am using as a reference

When you start up a master, if you just specify SSL_ENABLED=true it should
error out and notify you that other required flags such as SSL_KEY_FILE are
not provided. Can you verify this? If that is not happening, then the 2
options are:
1. Your environment variables are not making it to the binary: See Jeff
Schroeder's comments
2. The binary is not actually the one you expect. Double check the checksum
with the binary you built after configuring with SSL.



On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez  wrote:

> looking forward to it, thanks!
> running out of ideas here on what am I doing wrong
>
> On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio 
> wrote:
> > FYI - Joris is out this week, he'll be probably able to get back to you
> > early next (modulo MesosCon craziness :)
> >
> > Marco Massenzio
> > Distributed Systems Engineer
> >
> > On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez 
> wrote:
> >>
> >> no suggestions?
> >>
> >> On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone 
> wrote:
> >> > @joris, can you help out here?
> >> >
> >> > On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez 
> >> > wrote:
> >> >>
> >> >> I have tried to enable SSL with no success, even compiling from
> source
> >> >> with the ssl flags --enable-libevent --enable-ssl
> >> >>
> >> >> export SSL_ENABLED=true
> >> >> export SSL_SUPPORT_DOWNGRADE=false
> >> >> export SSL_REQUIRE_CERT=true
> >> >> export SSL_CERT_FILE=/etc/mesos/...
> >> >> export SSL_KEY_FILE=/etc/mesos/...
> >> >> export SSL_CA_FILE=/etc/mesos/...
> >> >>
> >> >>
> >> >> /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master
> >> >> --work_dir="/var/lib/mesos"
> >> >>
> >> >> Port 5050 is still served as plain http, no SSL
> >> >>
> >> >> Nothing about ssl shows up in the logs, any ideas?
> >> >>
> >> >> Thanks
> >> >>
> >> >>
> >> >> >
> >> >> > From: Dharmit Shah 
> >> >> > To: user@mesos.apache.org
> >> >> > Cc:
> >> >> > Date: Mon, 10 Aug 2015 14:13:04 +0530
> >> >> > Subject: Re: SSL in Mesos 0.23
> >> >> > Hi Jeff,
> >> >> >
> >> >> > Thanks for the suggestion.
> >> >> >
> >> >> > I modified the systemd service file to use
> >> >> > `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as
> >> >> > environment files for master and slave services respectively. In
> >> >> > these
> >> >> > files, I specified the environment variables that I used to specify
> >> >> > on
> >> >> > the command line.
> >> >> >
> >> >> > Now if I check `strings /proc//environ | grep SSL` for pids of
> >> >> > master and slave services, I see the environment variables that I
> set
> >> >> > in the /etc/sysconfig/.
> >> >> >
> >> >> > Now that it looks like I have started the master and slave services
> >> >> > with SSL enabled, how do I really confirm that communication
> between
> >> >> > master and slaves is really happening over SSL?
> >> >> >
> >> >> > Also, how do I enable SSL communication for a framework like
> >> >> > Marathon?
> >> >> >
> >> >> > Regards,
> >> >> > Dharmit.
> >> >> >
> >> >> > On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder
> >> >> >  wrote:
> >> >> > > The sudo command defaults to envreset (look for that in the man
> >> >> > > page)
> >> >> > > which
> >> >> > > strips all env variables sans a select few. I'd almost bet that
> >> >> > > your
> >> >> > > SSL_*
> >> >> > > variables are not present and were not passed to the slave. Just
> >> >> > > sudo
> >> >> > > -i and
> >> >> > > start the slaves *as root* without sudo. There is no benefit to
> >> >> > > starting
> >> >> > > them with sudo. You can verify what I'm saying with something
> along
> >> >> > > the
> >> >> > > lines of:
> >> >> > >
> >> >> > > strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_
> >> >> > >
> >> >> > >
> >> >> > > On Friday, August 7, 2015, Dharmit Shah 
> >> >> > > wrote:
> >> >> > >>
> >> >> > >> Hello again,
> >> >> > >>
> >> >> > >> Thanks for your responses. I will share what I tried after your
> >> >> > >> suggestions.
> >> >> > >>
> >> >> > >> 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave`
> >> >> > >> returned similar output as one suggested by Craig. So, I guess,
> >> >> > >> the
> >> >> > >> Mesosphere repo binaries have SSL enabled. Right?
> >> >> > >>
> >> >> > >> 2. I created SSL private key and cert on one system in my
> cluster
> >> >> > >> by
> >> >> > >> referring this guide on DO [1]. Admittedly, my knowledge of SSL
> is
> >> >> > >> limited.
> >> >> > >>
> >> >> > >> 3. Next, I copied the key and cert to all three mesos-master
> nodes
> >> >> > >> and
> >> >> > >> four mesos-slave nodes. Shouldn't slave nodes be provided only
> >> >> > >> with
> >> >> > >> the cert and not the private k

Re: Allocation algorithm

2015-08-25 Thread Vinod Kone
The hierarchical allocator looks at one agent's resource at a time. For
each agent, it runs DRF to figure out the candidate framework.

More details here:

https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L935

Regarding starvation you observed, yes that is possible with DRF. We plan
to address this by optimistic offers (no implementation yet) and quotas
(WIP).


On Mon, Aug 24, 2015 at 8:42 AM, Hans van den Bogert 
wrote:

> Can anyone tell how the Mesos allocation algorithm works:
> Does Mesos offer every free resource it has to one framework at a time? Or
> does the allocator divide the max offer size by the amount of
> active/registered frameworks?
>   and
> in case of:
>   FW1 has a high dominant resource fraction (>50%), which it does not
> release. FW2 and FW3 have a lot of churn for their tasks, both have
> outstanding short lived tasks in their queue (shorter than the mesos
> allocation interval), these 2 FWs accept all resources Mesos has to offer -
> if they get the offer.
>
> Reading the DRF paper and presentation, am I to assume the online DRF
> algorithm would favour FW2 and FW3 always before FW1? As one of the two
> (FW2/3) will always (or at least more likely to,) have a lower dominant
> resource than FW1. According to the presentation on DRF, the framework with
> the lowest dominant resource gets the offer. But this is a potential
> starvation e.g., if a framework has allocated memory, but needs a new offer
> with CPUs to actually do something. You might wonder why the framework
> didn’t use memory AND cpu from the same offer, but Spark for example does
> exactly this.
>
> To give some context, I think I’m seeing this behaviour with Spark in
> fine-grained mode. I have 4 spark instances which are long-lived, emulating
> interactive queries. The first Spark instance to get an offer “installs”
> executors (with high memory demand) on every slave node it sees. The next
> framework tries to do the same, but for these later instances, theres not
> always enough executor memory, that’s why I end up with an instance, which
> was first to get the offer, with a lot of memory it doesn’t let go, but it
> also gets way less offers for CPU afterwards. In contrast the later spark
> instances with less long-living executors do not have a high memory usage,
> and get relatively more CPU offers.
> Of course setting a max amount  of  Spark executors per framework instance
> would mitigate this, but then I’m basically back to static allocation of
> resources.
>
> Thanks in advance,
>
> Hans
>
>
>
>
>


Re: Are the resource options documented?

2015-08-25 Thread haosdent
Also have disk resource. It is documented in attributes-resources.md
https://github.com/apache/mesos/blob/master/docs/attributes-resources.md

On Wed, Aug 26, 2015 at 1:31 AM, craig w  wrote:

> When configuring a mesos-slave with "--resources", I know "cpu", "mem" and
> "ports" are available. Are there others? Are these documented somewhere?
>
> I've found some examples here
> https://open.mesosphere.com/reference/mesos-slave/ and the configuration
> page (http://mesos.apache.org/documentation/latest/configuration/) is
> generic with it's description of "--resources".
>
> Thanks
> craig
>



-- 
Best Regards,
Haosdent Huang


Re: Are the resource options documented?

2015-08-25 Thread Alex Rukletsov
>From Mesos point of view, a resource is just a string, your agents may
advertise "gpu", "bananas", "pandas" and so on. However, some resources are
known to Mesos, and for them isolation is possible. A good example is a
cgroups isolator for "mem" resources, which will invoke OOM killer if
necessary. Compare with GPU resources: if your agent advertises, say, 1GB
"gpu" to the master, a task may accept 100MB, but the agent will have no
control, whether a task uses no more than 100MB, because there is no
isolator for this resource. Good news is that you can write an isolator for
your resource, wrap it into a Mesos module, and let Mesos agent use it!

P.S. "cpu" is not a known resource, but "cpus" is.

On Tue, Aug 25, 2015 at 7:31 PM, craig w  wrote:

> When configuring a mesos-slave with "--resources", I know "cpu", "mem" and
> "ports" are available. Are there others? Are these documented somewhere?
>
> I've found some examples here
> https://open.mesosphere.com/reference/mesos-slave/ and the configuration
> page (http://mesos.apache.org/documentation/latest/configuration/) is
> generic with it's description of "--resources".
>
> Thanks
> craig
>


Re: Mesos/Marathon/HAProxy Logging

2015-08-25 Thread Jeff Schroeder
This is the header that should be passed:

https://en.m.wikipedia.org/wiki/X-Forwarded-For

Most of the modern internet routes through reverse proxies and this is how
we log the actual source clients to solve similar auditing and compliance
needs.

On Tuesday, August 25, 2015, John Omernik  wrote:

> I have been playing with an application that is a very simple app: A
> webservice running in Python. I've created a docker container, it runs in
> the container, I setup marathon to run it, I use mesos-dns and ha proxy and
> I can access the service just fine anywhere in the cluster.
>
> First let me say this is VERY cool. The capabilities here awesome.
>
> Now the challenge: the security guy in me wants to take good logs from my
> app.  It was setup to do it's own logging through a custom module. I am
> very happy with it.  I setup the app in the container to mount a volume
> that's in my MapRFS via NFS so I can log directly to a clustered
> filesystem. THis is awesome, I can read my logs in Apache Drill as they are
> written!!!
>
> However, the haproxy through me for a loop. Once I started running the app
> in Marathon with a service port and routed around via haproxy, I realized
> something:  I lost my source IPs in my logs?
>
> Why?
>
> Because once HAProxy takes over, it no longer needs to keep the source IP,
> and instead the next hop only sees the previous connection IP.  From a
> service discovery perspective it works great, but with this setup, I'd lose
> the previous hop. Perhaps I manually add something in haproxy to add an
> X-forwarded-for header, that would be nice, however, that only works for
> http apps, what about other TCP apps that are not HTTP?
>
> This is an interesting problem, because apps should have good logging,
> security, performance, troubleshooting, and if I can't get the source IP it
> could be a problem.
>
> So, my question is this, anyone ran into this? How are you handling it?
> Any brainstorms here we may be able to work off of?
>
> One thing I thought was why are we using HAproxy? Couldn't the same
> HAProxy script, actually put in forwarding rules in IPtables?  This sounds
> messy, but could it work? Has anyone explored that? If the data was
> forwarded, than it wouldn't lose the IP information (and timeouts wouldn't
> be a concern either (I think I posted before on how long running TCP
> connections can be closed down by HAProxy if they don't implement TCP Keep
> alives).
>
> Other ideas?  This is interesting to me, and likely others.
>


-- 
Text by Jeff, typos by iPhone


Are the resource options documented?

2015-08-25 Thread craig w
When configuring a mesos-slave with "--resources", I know "cpu", "mem" and
"ports" are available. Are there others? Are these documented somewhere?

I've found some examples here
https://open.mesosphere.com/reference/mesos-slave/ and the configuration
page (http://mesos.apache.org/documentation/latest/configuration/) is
generic with it's description of "--resources".

Thanks
craig


Re: Mesos/Marathon/HAProxy Logging

2015-08-25 Thread Ankur Chauhan
This may help: 

http://serverfault.com/questions/331079/haproxy-and-forwarding-client-ip-address-to-servers

We use similar options to ensure we have the remote ip.
> On 25 Aug 2015, at 09:30, John Omernik  wrote:
> 
> I have been playing with an application that is a very simple app: A 
> webservice running in Python. I've created a docker container, it runs in the 
> container, I setup marathon to run it, I use mesos-dns and ha proxy and I can 
> access the service just fine anywhere in the cluster. 
> 
> First let me say this is VERY cool. The capabilities here awesome.
> 
> Now the challenge: the security guy in me wants to take good logs from my 
> app.  It was setup to do it's own logging through a custom module. I am very 
> happy with it.  I setup the app in the container to mount a volume that's in 
> my MapRFS via NFS so I can log directly to a clustered filesystem. THis is 
> awesome, I can read my logs in Apache Drill as they are written!!!
> 
> However, the haproxy through me for a loop. Once I started running the app in 
> Marathon with a service port and routed around via haproxy, I realized 
> something:  I lost my source IPs in my logs? 
> 
> Why?
> 
> Because once HAProxy takes over, it no longer needs to keep the source IP, 
> and instead the next hop only sees the previous connection IP.  From a 
> service discovery perspective it works great, but with this setup, I'd lose 
> the previous hop. Perhaps I manually add something in haproxy to add an 
> X-forwarded-for header, that would be nice, however, that only works for http 
> apps, what about other TCP apps that are not HTTP? 
> 
> This is an interesting problem, because apps should have good logging, 
> security, performance, troubleshooting, and if I can't get the source IP it 
> could be a problem. 
> 
> So, my question is this, anyone ran into this? How are you handling it?  Any 
> brainstorms here we may be able to work off of? 
> 
> One thing I thought was why are we using HAproxy? Couldn't the same HAProxy 
> script, actually put in forwarding rules in IPtables?  This sounds messy, but 
> could it work? Has anyone explored that? If the data was forwarded, than it 
> wouldn't lose the IP information (and timeouts wouldn't be a concern either 
> (I think I posted before on how long running TCP connections can be closed 
> down by HAProxy if they don't implement TCP Keep alives). 
> 
> Other ideas?  This is interesting to me, and likely others. 



Mesos/Marathon/HAProxy Logging

2015-08-25 Thread John Omernik
I have been playing with an application that is a very simple app: A
webservice running in Python. I've created a docker container, it runs in
the container, I setup marathon to run it, I use mesos-dns and ha proxy and
I can access the service just fine anywhere in the cluster.

First let me say this is VERY cool. The capabilities here awesome.

Now the challenge: the security guy in me wants to take good logs from my
app.  It was setup to do it's own logging through a custom module. I am
very happy with it.  I setup the app in the container to mount a volume
that's in my MapRFS via NFS so I can log directly to a clustered
filesystem. THis is awesome, I can read my logs in Apache Drill as they are
written!!!

However, the haproxy through me for a loop. Once I started running the app
in Marathon with a service port and routed around via haproxy, I realized
something:  I lost my source IPs in my logs?

Why?

Because once HAProxy takes over, it no longer needs to keep the source IP,
and instead the next hop only sees the previous connection IP.  From a
service discovery perspective it works great, but with this setup, I'd lose
the previous hop. Perhaps I manually add something in haproxy to add an
X-forwarded-for header, that would be nice, however, that only works for
http apps, what about other TCP apps that are not HTTP?

This is an interesting problem, because apps should have good logging,
security, performance, troubleshooting, and if I can't get the source IP it
could be a problem.

So, my question is this, anyone ran into this? How are you handling it?
Any brainstorms here we may be able to work off of?

One thing I thought was why are we using HAproxy? Couldn't the same HAProxy
script, actually put in forwarding rules in IPtables?  This sounds messy,
but could it work? Has anyone explored that? If the data was forwarded,
than it wouldn't lose the IP information (and timeouts wouldn't be a
concern either (I think I posted before on how long running TCP connections
can be closed down by HAProxy if they don't implement TCP Keep alives).

Other ideas?  This is interesting to me, and likely others.


Re: Not getting resource offers for 20 min

2015-08-25 Thread CCAAT

THANKS, as I have not kept up on the spark lists

James


On 08/25/2015 04:28 AM, Iulian Dragoș wrote:



On Mon, Aug 24, 2015 at 7:16 PM, CCAAT mailto:cc...@tampabay.rr.com>> wrote:

On 08/24/2015 05:33 AM, Iulian Dragoș wrote:


Hello Iulian,

Ok, so I eventually build spark from 100% sources, after some
intermediate builds on gentoo.   Gentoo is not the best platform for
Java development, but those issues related to spark builds are
slowly being fixed on gentoo. Where (how) do you download the
spark-1.5.x complete source tree, as it does not seen available on
this page::

http://spark.apache.org/downloads.html


It's not yet a final release, but there's a preview:

http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html

Building Spark from sources isn't too hard, there's a
`make-distribution.sh` script in the root directory. There are a few
parameters (like the dependency Hadoop version), but it should be fairly
straight forward. More info here:

http://spark.apache.org/docs/latest/building-spark.html

iulian



Any other related information or tips on building out spark from sources
are keenly received.

James

Unfortunately I don't have access to the cluster anymore, but I
think
Chronos wasn't the culprit. After updating Spark to 1.5 and
setting a
framework role offers started to come (while still using Chronos).

iulian


Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com  





--

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com 





Re: Not getting resource offers for 20 min

2015-08-25 Thread Iulian Dragoș
On Mon, Aug 24, 2015 at 7:16 PM, CCAAT  wrote:

> On 08/24/2015 05:33 AM, Iulian Dragoș wrote:
>
>
> Hello Iulian,
>
> Ok, so I eventually build spark from 100% sources, after some intermediate
> builds on gentoo.   Gentoo is not the best platform for Java development,
> but those issues related to spark builds are slowly being fixed on gentoo.
> Where (how) do you download the spark-1.5.x complete source tree, as it
> does not seen available on this page::
>
> http://spark.apache.org/downloads.html


It's not yet a final release, but there's a preview:

http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html

Building Spark from sources isn't too hard, there's a
`make-distribution.sh` script in the root directory. There are a few
parameters (like the dependency Hadoop version), but it should be fairly
straight forward. More info here:

http://spark.apache.org/docs/latest/building-spark.html

iulian


>
>
> Any other related information or tips on building out spark from sources
> are keenly received.
>
> James
>
> Unfortunately I don't have access to the cluster anymore, but I think
>> Chronos wasn't the culprit. After updating Spark to 1.5 and setting a
>> framework role offers started to come (while still using Chronos).
>>
>> iulian
>>
>
> Iulian Dragos
>>
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com 
>>
>>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: Allocation algorithm

2015-08-25 Thread Iulian Dragoș
On Mon, Aug 24, 2015 at 5:42 PM, Hans van den Bogert 
wrote:

> Can anyone tell how the Mesos allocation algorithm works:
> Does Mesos offer every free resource it has to one framework at a time? Or
> does the allocator divide the max offer size by the amount of
> active/registered frameworks?
>   and
> in case of:
>   FW1 has a high dominant resource fraction (>50%), which it does not
> release. FW2 and FW3 have a lot of churn for their tasks, both have
> outstanding short lived tasks in their queue (shorter than the mesos
> allocation interval), these 2 FWs accept all resources Mesos has to offer -
> if they get the offer.
>
> Reading the DRF paper and presentation, am I to assume the online DRF
> algorithm would favour FW2 and FW3 always before FW1? As one of the two
> (FW2/3) will always (or at least more likely to,) have a lower dominant
> resource than FW1. According to the presentation on DRF, the framework with
> the lowest dominant resource gets the offer. But this is a potential
> starvation e.g., if a framework has allocated memory, but needs a new offer
> with CPUs to actually do something. You might wonder why the framework
> didn’t use memory AND cpu from the same offer, but Spark for example does
> exactly this.
>

I'd love to learn more from Mesos devs about the allocation algorithm. In
my limited understanding, you are correct.


>
> To give some context, I think I’m seeing this behaviour with Spark in
> fine-grained mode. I have 4 spark instances which are long-lived, emulating
> interactive queries. The first Spark instance to get an offer “installs”
> executors (with high memory demand) on every slave node it sees. The next
> framework tries to do the same, but for these later instances, theres not
> always enough executor memory, that’s why I end up with an instance, which
> was first to get the offer, with a lot of memory it doesn’t let go, but it
> also gets way less offers for CPU afterwards. In contrast the later spark
> instances with less long-living executors do not have a high memory usage,
> and get relatively more CPU offers.
> Of course setting a max amount  of  Spark executors per framework instance
> would mitigate this, but then I’m basically back to static allocation of
> resources.
>

I've seen similar behavior with Spark's fine-grained mode. See my thread
from a couple of days ago.

I would recommend using coarse-grained mode with dynamic allocation
(available in the future 1.5 version). We worked around this by using Mesos
roles, and assigning Spark to a specific role. It seems Mesos will allocate
resources based on roles, if configured. Unfortunately, `spark.mesos.role`
is a new configuration parameter to be added in 1.5 as well, so we needed
to use Spark 1.5 preview.

iulian


>
> Thanks in advance,
>
> Hans
>
>
>
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: Master UI - Tasks section is empty

2015-08-25 Thread Rogier Dikkes

Thank you Craig, just ran into this myself when updating to 0.23

On 8/23/15 7:59 PM, craig w wrote:


See https://issues.apache.org/jira/browse/MESOS-3282

On Aug 23, 2015 12:28 PM, "Jeremy Olexa" > wrote:


Hi all,

On a new cluster, the tasks section of the left sidebar is
populated as jobs are staged, started, killed, etc. I've noticed
that after a rolling restart of the cluster, like taking a node
out for maintenance - or restarted instances in an ASG, that the
Tasks section of the UI stops working. There are no longer any
value in the UI.

It appears that this part of the UI is in js/controllers.js, but I
don't understand the internals quite yet. Is this issue related to
MESOS-527? Any other insight into this problem?

Thanks,
Jeremy



--
Rogier Dikkes
Systeem Programmeur Hadoop & HPC Cloud
e-mail: rogier.dik...@surfsara.nl | M: +31 6 47 48 93 28
SURFsara | Science Park 140 | 1098 XG Amsterdam