Are the resource options documented?
When configuring a mesos-slave with --resources, I know cpu, mem and ports are available. Are there others? Are these documented somewhere? I've found some examples here https://open.mesosphere.com/reference/mesos-slave/ and the configuration page (http://mesos.apache.org/documentation/latest/configuration/) is generic with it's description of --resources. Thanks craig
Mesos/Marathon/HAProxy Logging
I have been playing with an application that is a very simple app: A webservice running in Python. I've created a docker container, it runs in the container, I setup marathon to run it, I use mesos-dns and ha proxy and I can access the service just fine anywhere in the cluster. First let me say this is VERY cool. The capabilities here awesome. Now the challenge: the security guy in me wants to take good logs from my app. It was setup to do it's own logging through a custom module. I am very happy with it. I setup the app in the container to mount a volume that's in my MapRFS via NFS so I can log directly to a clustered filesystem. THis is awesome, I can read my logs in Apache Drill as they are written!!! However, the haproxy through me for a loop. Once I started running the app in Marathon with a service port and routed around via haproxy, I realized something: I lost my source IPs in my logs? Why? Because once HAProxy takes over, it no longer needs to keep the source IP, and instead the next hop only sees the previous connection IP. From a service discovery perspective it works great, but with this setup, I'd lose the previous hop. Perhaps I manually add something in haproxy to add an X-forwarded-for header, that would be nice, however, that only works for http apps, what about other TCP apps that are not HTTP? This is an interesting problem, because apps should have good logging, security, performance, troubleshooting, and if I can't get the source IP it could be a problem. So, my question is this, anyone ran into this? How are you handling it? Any brainstorms here we may be able to work off of? One thing I thought was why are we using HAproxy? Couldn't the same HAProxy script, actually put in forwarding rules in IPtables? This sounds messy, but could it work? Has anyone explored that? If the data was forwarded, than it wouldn't lose the IP information (and timeouts wouldn't be a concern either (I think I posted before on how long running TCP connections can be closed down by HAProxy if they don't implement TCP Keep alives). Other ideas? This is interesting to me, and likely others.
Re: Not getting resource offers for 20 min
THANKS, as I have not kept up on the spark lists James On 08/25/2015 04:28 AM, Iulian Dragoș wrote: On Mon, Aug 24, 2015 at 7:16 PM, CCAAT cc...@tampabay.rr.com mailto:cc...@tampabay.rr.com wrote: On 08/24/2015 05:33 AM, Iulian Dragoș wrote: Hello Iulian, Ok, so I eventually build spark from 100% sources, after some intermediate builds on gentoo. Gentoo is not the best platform for Java development, but those issues related to spark builds are slowly being fixed on gentoo. Where (how) do you download the spark-1.5.x complete source tree, as it does not seen available on this page:: http://spark.apache.org/downloads.html It's not yet a final release, but there's a preview: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html Building Spark from sources isn't too hard, there's a `make-distribution.sh` script in the root directory. There are a few parameters (like the dependency Hadoop version), but it should be fairly straight forward. More info here: http://spark.apache.org/docs/latest/building-spark.html iulian Any other related information or tips on building out spark from sources are keenly received. James Unfortunately I don't have access to the cluster anymore, but I think Chronos wasn't the culprit. After updating Spark to 1.5 and setting a framework role offers started to come (while still using Chronos). iulian Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com http://www.typesafe.com http://www.typesafe.com -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com http://www.typesafe.com
Re: Mesos/Marathon/HAProxy Logging
This is the header that should be passed: https://en.m.wikipedia.org/wiki/X-Forwarded-For Most of the modern internet routes through reverse proxies and this is how we log the actual source clients to solve similar auditing and compliance needs. On Tuesday, August 25, 2015, John Omernik j...@omernik.com wrote: I have been playing with an application that is a very simple app: A webservice running in Python. I've created a docker container, it runs in the container, I setup marathon to run it, I use mesos-dns and ha proxy and I can access the service just fine anywhere in the cluster. First let me say this is VERY cool. The capabilities here awesome. Now the challenge: the security guy in me wants to take good logs from my app. It was setup to do it's own logging through a custom module. I am very happy with it. I setup the app in the container to mount a volume that's in my MapRFS via NFS so I can log directly to a clustered filesystem. THis is awesome, I can read my logs in Apache Drill as they are written!!! However, the haproxy through me for a loop. Once I started running the app in Marathon with a service port and routed around via haproxy, I realized something: I lost my source IPs in my logs? Why? Because once HAProxy takes over, it no longer needs to keep the source IP, and instead the next hop only sees the previous connection IP. From a service discovery perspective it works great, but with this setup, I'd lose the previous hop. Perhaps I manually add something in haproxy to add an X-forwarded-for header, that would be nice, however, that only works for http apps, what about other TCP apps that are not HTTP? This is an interesting problem, because apps should have good logging, security, performance, troubleshooting, and if I can't get the source IP it could be a problem. So, my question is this, anyone ran into this? How are you handling it? Any brainstorms here we may be able to work off of? One thing I thought was why are we using HAproxy? Couldn't the same HAProxy script, actually put in forwarding rules in IPtables? This sounds messy, but could it work? Has anyone explored that? If the data was forwarded, than it wouldn't lose the IP information (and timeouts wouldn't be a concern either (I think I posted before on how long running TCP connections can be closed down by HAProxy if they don't implement TCP Keep alives). Other ideas? This is interesting to me, and likely others. -- Text by Jeff, typos by iPhone
Re: Are the resource options documented?
From Mesos point of view, a resource is just a string, your agents may advertise gpu, bananas, pandas and so on. However, some resources are known to Mesos, and for them isolation is possible. A good example is a cgroups isolator for mem resources, which will invoke OOM killer if necessary. Compare with GPU resources: if your agent advertises, say, 1GB gpu to the master, a task may accept 100MB, but the agent will have no control, whether a task uses no more than 100MB, because there is no isolator for this resource. Good news is that you can write an isolator for your resource, wrap it into a Mesos module, and let Mesos agent use it! P.S. cpu is not a known resource, but cpus is. On Tue, Aug 25, 2015 at 7:31 PM, craig w codecr...@gmail.com wrote: When configuring a mesos-slave with --resources, I know cpu, mem and ports are available. Are there others? Are these documented somewhere? I've found some examples here https://open.mesosphere.com/reference/mesos-slave/ and the configuration page (http://mesos.apache.org/documentation/latest/configuration/) is generic with it's description of --resources. Thanks craig
Re: Mesos/Marathon/HAProxy Logging
This may help: http://serverfault.com/questions/331079/haproxy-and-forwarding-client-ip-address-to-servers We use similar options to ensure we have the remote ip. On 25 Aug 2015, at 09:30, John Omernik j...@omernik.com wrote: I have been playing with an application that is a very simple app: A webservice running in Python. I've created a docker container, it runs in the container, I setup marathon to run it, I use mesos-dns and ha proxy and I can access the service just fine anywhere in the cluster. First let me say this is VERY cool. The capabilities here awesome. Now the challenge: the security guy in me wants to take good logs from my app. It was setup to do it's own logging through a custom module. I am very happy with it. I setup the app in the container to mount a volume that's in my MapRFS via NFS so I can log directly to a clustered filesystem. THis is awesome, I can read my logs in Apache Drill as they are written!!! However, the haproxy through me for a loop. Once I started running the app in Marathon with a service port and routed around via haproxy, I realized something: I lost my source IPs in my logs? Why? Because once HAProxy takes over, it no longer needs to keep the source IP, and instead the next hop only sees the previous connection IP. From a service discovery perspective it works great, but with this setup, I'd lose the previous hop. Perhaps I manually add something in haproxy to add an X-forwarded-for header, that would be nice, however, that only works for http apps, what about other TCP apps that are not HTTP? This is an interesting problem, because apps should have good logging, security, performance, troubleshooting, and if I can't get the source IP it could be a problem. So, my question is this, anyone ran into this? How are you handling it? Any brainstorms here we may be able to work off of? One thing I thought was why are we using HAproxy? Couldn't the same HAProxy script, actually put in forwarding rules in IPtables? This sounds messy, but could it work? Has anyone explored that? If the data was forwarded, than it wouldn't lose the IP information (and timeouts wouldn't be a concern either (I think I posted before on how long running TCP connections can be closed down by HAProxy if they don't implement TCP Keep alives). Other ideas? This is interesting to me, and likely others.
Re: Custom Scheduler: Diagnosing cause of container task failures
It looks like we can have a better error message here. @Jay, mind filing a JIRA ticket for with description, status update, and your fix attached? Thanks! On Fri, Aug 21, 2015 at 7:36 PM, Jay Taylor j...@jaytaylor.com wrote: Eventually I was able to isolate what was going on; in this case the FrameworkInfo.User was set to an invalid value and setting it to root did the trick. My scheduler is now working [in a basic form]!!! Cheers, Jay On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote: Hey Tim, Thank you for the quick response! Just checked the sandbox logs and they are all empty (stdout and stderr are both 0 bytes). I have discovered a little bit more information from the StatusUpdate event posted back to my scheduler: TaskStatus{ TaskId: TaskID{ Value:*fluxCapacitor-test-1,XXX_unrecognized:[], }, State: *TASK_FAILED, Message: *Abnormal executor termination, Source: *SOURCE_SLAVE, Reason: *REASON_COMMAND_EXECUTOR_FAILED, Data:nil, SlaveId: SlaveID{ Value: *20150804-211459-1407297728-5050-5855-S1, XXX_unrecognized: [], }, ExecutorId: nil, Timestamp: *1.440112075509318e+09, Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166], Healthy: nil, XXX_unrecognized: [], } How can I find out what why the command executor is failing? On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote: It received a TASK_FAILED from the executor, so you'll need to look at the sandbox logs of your task stdout and stderr files to see what went wrong. These files should be reachable by the Mesos UI. Tim On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote: Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay
Re: Master UI - Tasks section is empty
Thank you Craig, just ran into this myself when updating to 0.23 On 8/23/15 7:59 PM, craig w wrote: See https://issues.apache.org/jira/browse/MESOS-3282 On Aug 23, 2015 12:28 PM, Jeremy Olexa jol...@spscommerce.com mailto:jol...@spscommerce.com wrote: Hi all, On a new cluster, the tasks section of the left sidebar is populated as jobs are staged, started, killed, etc. I've noticed that after a rolling restart of the cluster, like taking a node out for maintenance - or restarted instances in an ASG, that the Tasks section of the UI stops working. There are no longer any value in the UI. It appears that this part of the UI is in js/controllers.js, but I don't understand the internals quite yet. Is this issue related to MESOS-527? Any other insight into this problem? Thanks, Jeremy -- Rogier Dikkes Systeem Programmeur Hadoop HPC Cloud e-mail: rogier.dik...@surfsara.nl | M: +31 6 47 48 93 28 SURFsara | Science Park 140 | 1098 XG Amsterdam
Re: SSL in Mesos 0.23
@Dharmit If you want to be really sure that the communication is happening over SSL, you can use a packet sniffing tool like wireshark, or depending on your operating system you can dump the packet streams directly to a file. For example TCP dump. Another thing you can do is to try and hit the HTTP endpoints from curl using http as opposed to https. Remember that if you have SSL_SUPPORT_DOWNGRADE=true you should be able to connect even without SSL. If it is false (the default) you will not be able to connect. On Mon, Aug 10, 2015 at 4:43 AM, Dharmit Shah shahdhar...@gmail.com wrote: Hi Jeff, Thanks for the suggestion. I modified the systemd service file to use `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as environment files for master and slave services respectively. In these files, I specified the environment variables that I used to specify on the command line. Now if I check `strings /proc/pid/environ | grep SSL` for pids of master and slave services, I see the environment variables that I set in the /etc/sysconfig/environment-file. Now that it looks like I have started the master and slave services with SSL enabled, how do I really confirm that communication between master and slaves is really happening over SSL? Also, how do I enable SSL communication for a framework like Marathon? Regards, Dharmit. On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder jeffschroe...@computer.org wrote: The sudo command defaults to envreset (look for that in the man page) which strips all env variables sans a select few. I'd almost bet that your SSL_* variables are not present and were not passed to the slave. Just sudo -i and start the slaves *as root* without sudo. There is no benefit to starting them with sudo. You can verify what I'm saying with something along the lines of: strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_ On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com wrote: Hello again, Thanks for your responses. I will share what I tried after your suggestions. 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave` returned similar output as one suggested by Craig. So, I guess, the Mesosphere repo binaries have SSL enabled. Right? 2. I created SSL private key and cert on one system in my cluster by referring this guide on DO [1]. Admittedly, my knowledge of SSL is limited. 3. Next, I copied the key and cert to all three mesos-master nodes and four mesos-slave nodes. Shouldn't slave nodes be provided only with the cert and not the private key? Whereas all master nodes may have the private key and cert both. Or am I understanding SSL incorrectly here? 4. After copying the cert and key, I started the mesos-master service on master nodes with below command: $ sudo SSL_ENABLED=true SSL_KEY_FILE=~/ssl/mesos.key SSL_CERT_FILE=~/ssl/mesos.crt /usr/sbin/mesos-master --zk=zk://172.19.10.111:2181,172.19.10.112:2181, 172.19.10.193:2181/mesos --port=5050 --log_dir=/var/log/mesos --acls=file:///root/acls.json --credentials=/home/isys/mesos --quorum=2 --work_dir=/var/lib/mesos I check web UI and things look good. I am not completely sure if https should have worked for mesos web UI but, it didn't. 5. Next, I start slave nodes with below command: $ sudo SSL_ENABLED=true SSL_CERT_FILE=~/mesos.crt SSL_KEY_FILE=~/mesos.key /usr/sbin/mesos-slave --master=zk://172.19.10.111:2181,172.19.10.112:2181, 172.19.10.193:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=15mins Mesos web UI reported four mesos-slave nodes in Activated mode. So far so good. I am still wondering how I should verify if communication is happening over SSL. 6. To check if SSL is indeed working, I stopped one slave node and started it without SSL using `systemctl start mesos-slave`. I was expecting it to not get into Activated state on Mesos web UI but it did. So, I think SSL is not configured properly by me. I am attaching logs from the master nodes. These logs were generated after starting masters with command specified in point 4. Let me know if I am doing something wrong or if you need more logs or need me to execute some specific commands. [1] https://www.digitalocean.com/community/tutorials/openssl-essentials-working-with-ssl-certificates-private-keys-and-csrs Regards, Dharmit. On Fri, Aug 7, 2015 at 2:52 AM, Michael Park mcyp...@gmail.com wrote: Hi Dharmit, I'm not certain whether the Mesosphere deb packages have SSL enabled or not, although based on Craig's observation it looks like it is. I think the correct way to enable SSL is to set the SSL_ENABLED environment variable, rather than /etc/mesos-master/ssl_enabled. Of course, along with the rest of the SSL_ environment variables. e.g. SSL_ENABLED=true SSL_KEY_FILE=path-to-your-private-key
Re: SSL in Mesos 0.23
@carlos Are you building 0.23.0 from source? Just so we don't miss anything: Can you make sure to run ./bootstrap, and build in a clean directory with your configuration similar to this: ../configure --enable-libevent --enable-ssl Here http://mesos.apache.org/documentation/latest/mesos-ssl/ is the document I am using as a reference When you start up a master, if you just specify SSL_ENABLED=true it should error out and notify you that other required flags such as SSL_KEY_FILE are not provided. Can you verify this? If that is not happening, then the 2 options are: 1. Your environment variables are not making it to the binary: See Jeff Schroeder's comments 2. The binary is not actually the one you expect. Double check the checksum with the binary you built after configuring with SSL. On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez car...@apache.org wrote: looking forward to it, thanks! running out of ideas here on what am I doing wrong On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio ma...@mesosphere.io wrote: FYI - Joris is out this week, he'll be probably able to get back to you early next (modulo MesosCon craziness :) Marco Massenzio Distributed Systems Engineer On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez car...@apache.org wrote: no suggestions? On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone vinodk...@apache.org wrote: @joris, can you help out here? On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez car...@apache.org wrote: I have tried to enable SSL with no success, even compiling from source with the ssl flags --enable-libevent --enable-ssl export SSL_ENABLED=true export SSL_SUPPORT_DOWNGRADE=false export SSL_REQUIRE_CERT=true export SSL_CERT_FILE=/etc/mesos/... export SSL_KEY_FILE=/etc/mesos/... export SSL_CA_FILE=/etc/mesos/... /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master --work_dir=/var/lib/mesos Port 5050 is still served as plain http, no SSL Nothing about ssl shows up in the logs, any ideas? Thanks From: Dharmit Shah shahdhar...@gmail.com To: user@mesos.apache.org Cc: Date: Mon, 10 Aug 2015 14:13:04 +0530 Subject: Re: SSL in Mesos 0.23 Hi Jeff, Thanks for the suggestion. I modified the systemd service file to use `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as environment files for master and slave services respectively. In these files, I specified the environment variables that I used to specify on the command line. Now if I check `strings /proc/pid/environ | grep SSL` for pids of master and slave services, I see the environment variables that I set in the /etc/sysconfig/environment-file. Now that it looks like I have started the master and slave services with SSL enabled, how do I really confirm that communication between master and slaves is really happening over SSL? Also, how do I enable SSL communication for a framework like Marathon? Regards, Dharmit. On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder jeffschroe...@computer.org wrote: The sudo command defaults to envreset (look for that in the man page) which strips all env variables sans a select few. I'd almost bet that your SSL_* variables are not present and were not passed to the slave. Just sudo -i and start the slaves *as root* without sudo. There is no benefit to starting them with sudo. You can verify what I'm saying with something along the lines of: strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_ On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com wrote: Hello again, Thanks for your responses. I will share what I tried after your suggestions. 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave` returned similar output as one suggested by Craig. So, I guess, the Mesosphere repo binaries have SSL enabled. Right? 2. I created SSL private key and cert on one system in my cluster by referring this guide on DO [1]. Admittedly, my knowledge of SSL is limited. 3. Next, I copied the key and cert to all three mesos-master nodes and four mesos-slave nodes. Shouldn't slave nodes be provided only with the cert and not the private key? Whereas all master nodes may have the private key and cert both. Or am I understanding SSL incorrectly here? 4. After copying the cert and key, I started the mesos-master service on master nodes with below command: $ sudo SSL_ENABLED=true SSL_KEY_FILE=~/ssl/mesos.key SSL_CERT_FILE=~/ssl/mesos.crt /usr/sbin/mesos-master --zk=zk://172.19.10.111:2181,172.19.10.112:2181, 172.19.10.193:2181/mesos --port=5050
Re: SSL in Mesos 0.23
Hi Joris, I did build from sources, following instructions in http://mesos.apache.org/gettingstarted/ Is the mesosphere binary compiled with libevent and ssl enabled as mentioned previously? would make debugging easier if I don't have to rebuild On Tue, Aug 25, 2015 at 8:52 PM, Joris Van Remoortere jo...@mesosphere.io wrote: @carlos Are you building 0.23.0 from source? Just so we don't miss anything: Can you make sure to run ./bootstrap, and build in a clean directory with your configuration similar to this: ../configure --enable-libevent --enable-ssl Here http://mesos.apache.org/documentation/latest/mesos-ssl/ is the document I am using as a reference When you start up a master, if you just specify SSL_ENABLED=true it should error out and notify you that other required flags such as SSL_KEY_FILE are not provided. Can you verify this? If that is not happening, then the 2 options are: 1. Your environment variables are not making it to the binary: See Jeff Schroeder's comments 2. The binary is not actually the one you expect. Double check the checksum with the binary you built after configuring with SSL. On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez car...@apache.org wrote: looking forward to it, thanks! running out of ideas here on what am I doing wrong On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio ma...@mesosphere.io wrote: FYI - Joris is out this week, he'll be probably able to get back to you early next (modulo MesosCon craziness :) Marco Massenzio Distributed Systems Engineer On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez car...@apache.org wrote: no suggestions? On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone vinodk...@apache.org wrote: @joris, can you help out here? On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez car...@apache.org wrote: I have tried to enable SSL with no success, even compiling from source with the ssl flags --enable-libevent --enable-ssl export SSL_ENABLED=true export SSL_SUPPORT_DOWNGRADE=false export SSL_REQUIRE_CERT=true export SSL_CERT_FILE=/etc/mesos/... export SSL_KEY_FILE=/etc/mesos/... export SSL_CA_FILE=/etc/mesos/... /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master --work_dir=/var/lib/mesos Port 5050 is still served as plain http, no SSL Nothing about ssl shows up in the logs, any ideas? Thanks From: Dharmit Shah shahdhar...@gmail.com To: user@mesos.apache.org Cc: Date: Mon, 10 Aug 2015 14:13:04 +0530 Subject: Re: SSL in Mesos 0.23 Hi Jeff, Thanks for the suggestion. I modified the systemd service file to use `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as environment files for master and slave services respectively. In these files, I specified the environment variables that I used to specify on the command line. Now if I check `strings /proc/pid/environ | grep SSL` for pids of master and slave services, I see the environment variables that I set in the /etc/sysconfig/environment-file. Now that it looks like I have started the master and slave services with SSL enabled, how do I really confirm that communication between master and slaves is really happening over SSL? Also, how do I enable SSL communication for a framework like Marathon? Regards, Dharmit. On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder jeffschroe...@computer.org wrote: The sudo command defaults to envreset (look for that in the man page) which strips all env variables sans a select few. I'd almost bet that your SSL_* variables are not present and were not passed to the slave. Just sudo -i and start the slaves *as root* without sudo. There is no benefit to starting them with sudo. You can verify what I'm saying with something along the lines of: strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_ On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com wrote: Hello again, Thanks for your responses. I will share what I tried after your suggestions. 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave` returned similar output as one suggested by Craig. So, I guess, the Mesosphere repo binaries have SSL enabled. Right? 2. I created SSL private key and cert on one system in my cluster by referring this guide on DO [1]. Admittedly, my knowledge of SSL is limited. 3. Next, I copied the key and cert to all three mesos-master nodes and four mesos-slave nodes. Shouldn't slave nodes be provided only with the cert and not the private key? Whereas all master nodes may have the private key and cert both. Or am I understanding SSL incorrectly here? 4. After
Re: Are the resource options documented?
Also have disk resource. It is documented in attributes-resources.md https://github.com/apache/mesos/blob/master/docs/attributes-resources.md On Wed, Aug 26, 2015 at 1:31 AM, craig w codecr...@gmail.com wrote: When configuring a mesos-slave with --resources, I know cpu, mem and ports are available. Are there others? Are these documented somewhere? I've found some examples here https://open.mesosphere.com/reference/mesos-slave/ and the configuration page (http://mesos.apache.org/documentation/latest/configuration/) is generic with it's description of --resources. Thanks craig -- Best Regards, Haosdent Huang
Re: Allocation algorithm
The hierarchical allocator looks at one agent's resource at a time. For each agent, it runs DRF to figure out the candidate framework. More details here: https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L935 Regarding starvation you observed, yes that is possible with DRF. We plan to address this by optimistic offers (no implementation yet) and quotas (WIP). On Mon, Aug 24, 2015 at 8:42 AM, Hans van den Bogert hansbog...@gmail.com wrote: Can anyone tell how the Mesos allocation algorithm works: Does Mesos offer every free resource it has to one framework at a time? Or does the allocator divide the max offer size by the amount of active/registered frameworks? and in case of: FW1 has a high dominant resource fraction (50%), which it does not release. FW2 and FW3 have a lot of churn for their tasks, both have outstanding short lived tasks in their queue (shorter than the mesos allocation interval), these 2 FWs accept all resources Mesos has to offer - if they get the offer. Reading the DRF paper and presentation, am I to assume the online DRF algorithm would favour FW2 and FW3 always before FW1? As one of the two (FW2/3) will always (or at least more likely to,) have a lower dominant resource than FW1. According to the presentation on DRF, the framework with the lowest dominant resource gets the offer. But this is a potential starvation e.g., if a framework has allocated memory, but needs a new offer with CPUs to actually do something. You might wonder why the framework didn’t use memory AND cpu from the same offer, but Spark for example does exactly this. To give some context, I think I’m seeing this behaviour with Spark in fine-grained mode. I have 4 spark instances which are long-lived, emulating interactive queries. The first Spark instance to get an offer “installs” executors (with high memory demand) on every slave node it sees. The next framework tries to do the same, but for these later instances, theres not always enough executor memory, that’s why I end up with an instance, which was first to get the offer, with a lot of memory it doesn’t let go, but it also gets way less offers for CPU afterwards. In contrast the later spark instances with less long-living executors do not have a high memory usage, and get relatively more CPU offers. Of course setting a max amount of Spark executors per framework instance would mitigate this, but then I’m basically back to static allocation of resources. Thanks in advance, Hans
Re: Allocation algorithm
On Mon, Aug 24, 2015 at 5:42 PM, Hans van den Bogert hansbog...@gmail.com wrote: Can anyone tell how the Mesos allocation algorithm works: Does Mesos offer every free resource it has to one framework at a time? Or does the allocator divide the max offer size by the amount of active/registered frameworks? and in case of: FW1 has a high dominant resource fraction (50%), which it does not release. FW2 and FW3 have a lot of churn for their tasks, both have outstanding short lived tasks in their queue (shorter than the mesos allocation interval), these 2 FWs accept all resources Mesos has to offer - if they get the offer. Reading the DRF paper and presentation, am I to assume the online DRF algorithm would favour FW2 and FW3 always before FW1? As one of the two (FW2/3) will always (or at least more likely to,) have a lower dominant resource than FW1. According to the presentation on DRF, the framework with the lowest dominant resource gets the offer. But this is a potential starvation e.g., if a framework has allocated memory, but needs a new offer with CPUs to actually do something. You might wonder why the framework didn’t use memory AND cpu from the same offer, but Spark for example does exactly this. I'd love to learn more from Mesos devs about the allocation algorithm. In my limited understanding, you are correct. To give some context, I think I’m seeing this behaviour with Spark in fine-grained mode. I have 4 spark instances which are long-lived, emulating interactive queries. The first Spark instance to get an offer “installs” executors (with high memory demand) on every slave node it sees. The next framework tries to do the same, but for these later instances, theres not always enough executor memory, that’s why I end up with an instance, which was first to get the offer, with a lot of memory it doesn’t let go, but it also gets way less offers for CPU afterwards. In contrast the later spark instances with less long-living executors do not have a high memory usage, and get relatively more CPU offers. Of course setting a max amount of Spark executors per framework instance would mitigate this, but then I’m basically back to static allocation of resources. I've seen similar behavior with Spark's fine-grained mode. See my thread from a couple of days ago. I would recommend using coarse-grained mode with dynamic allocation (available in the future 1.5 version). We worked around this by using Mesos roles, and assigning Spark to a specific role. It seems Mesos will allocate resources based on roles, if configured. Unfortunately, `spark.mesos.role` is a new configuration parameter to be added in 1.5 as well, so we needed to use Spark 1.5 preview. iulian Thanks in advance, Hans -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: Not getting resource offers for 20 min
On Mon, Aug 24, 2015 at 7:16 PM, CCAAT cc...@tampabay.rr.com wrote: On 08/24/2015 05:33 AM, Iulian Dragoș wrote: Hello Iulian, Ok, so I eventually build spark from 100% sources, after some intermediate builds on gentoo. Gentoo is not the best platform for Java development, but those issues related to spark builds are slowly being fixed on gentoo. Where (how) do you download the spark-1.5.x complete source tree, as it does not seen available on this page:: http://spark.apache.org/downloads.html It's not yet a final release, but there's a preview: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html Building Spark from sources isn't too hard, there's a `make-distribution.sh` script in the root directory. There are a few parameters (like the dependency Hadoop version), but it should be fairly straight forward. More info here: http://spark.apache.org/docs/latest/building-spark.html iulian Any other related information or tips on building out spark from sources are keenly received. James Unfortunately I don't have access to the cluster anymore, but I think Chronos wasn't the culprit. After updating Spark to 1.5 and setting a framework role offers started to come (while still using Chronos). iulian Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com http://www.typesafe.com -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: Mesos/Marathon/HAProxy Logging
So I agree that is how it should be done, however the current implementation on Mesos, requires me to manually code something like. In addition, this is only for http traffic, not tcp... what happens when the service running on Mesos isn't HTTP? I was hoping for some discussion beyond just manually editing the ha proxy script to make it http and add the headers... On Tue, Aug 25, 2015 at 12:46 PM, Jeff Schroeder jeffschroe...@computer.org wrote: This is the header that should be passed: https://en.m.wikipedia.org/wiki/X-Forwarded-For Most of the modern internet routes through reverse proxies and this is how we log the actual source clients to solve similar auditing and compliance needs. On Tuesday, August 25, 2015, John Omernik j...@omernik.com wrote: I have been playing with an application that is a very simple app: A webservice running in Python. I've created a docker container, it runs in the container, I setup marathon to run it, I use mesos-dns and ha proxy and I can access the service just fine anywhere in the cluster. First let me say this is VERY cool. The capabilities here awesome. Now the challenge: the security guy in me wants to take good logs from my app. It was setup to do it's own logging through a custom module. I am very happy with it. I setup the app in the container to mount a volume that's in my MapRFS via NFS so I can log directly to a clustered filesystem. THis is awesome, I can read my logs in Apache Drill as they are written!!! However, the haproxy through me for a loop. Once I started running the app in Marathon with a service port and routed around via haproxy, I realized something: I lost my source IPs in my logs? Why? Because once HAProxy takes over, it no longer needs to keep the source IP, and instead the next hop only sees the previous connection IP. From a service discovery perspective it works great, but with this setup, I'd lose the previous hop. Perhaps I manually add something in haproxy to add an X-forwarded-for header, that would be nice, however, that only works for http apps, what about other TCP apps that are not HTTP? This is an interesting problem, because apps should have good logging, security, performance, troubleshooting, and if I can't get the source IP it could be a problem. So, my question is this, anyone ran into this? How are you handling it? Any brainstorms here we may be able to work off of? One thing I thought was why are we using HAproxy? Couldn't the same HAProxy script, actually put in forwarding rules in IPtables? This sounds messy, but could it work? Has anyone explored that? If the data was forwarded, than it wouldn't lose the IP information (and timeouts wouldn't be a concern either (I think I posted before on how long running TCP connections can be closed down by HAProxy if they don't implement TCP Keep alives). Other ideas? This is interesting to me, and likely others. -- Text by Jeff, typos by iPhone
Re: Not getting resource offers for 20 min
Wanted to add that, even if there wasn’t a preview package, you can clone from GIT, and checkout a tag, where in this case v1.5.0-rc1 is tagged. Then proceeded normally as you would’ve had a source distro as described in the already mentioned http://spark.apache.org/docs/latest/building-spark.html On 25 Aug 2015, at 19:26, CCAAT cc...@tampabay.rr.com wrote: THANKS, as I have not kept up on the spark lists James On 08/25/2015 04:28 AM, Iulian Dragoș wrote: On Mon, Aug 24, 2015 at 7:16 PM, CCAAT cc...@tampabay.rr.com mailto:cc...@tampabay.rr.com wrote: On 08/24/2015 05:33 AM, Iulian Dragoș wrote: Hello Iulian, Ok, so I eventually build spark from 100% sources, after some intermediate builds on gentoo. Gentoo is not the best platform for Java development, but those issues related to spark builds are slowly being fixed on gentoo. Where (how) do you download the spark-1.5.x complete source tree, as it does not seen available on this page:: http://spark.apache.org/downloads.html It's not yet a final release, but there's a preview: http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-ANNOUNCE-Spark-1-5-0-preview-package-td13683.html Building Spark from sources isn't too hard, there's a `make-distribution.sh` script in the root directory. There are a few parameters (like the dependency Hadoop version), but it should be fairly straight forward. More info here: http://spark.apache.org/docs/latest/building-spark.html iulian Any other related information or tips on building out spark from sources are keenly received. James Unfortunately I don't have access to the cluster anymore, but I think Chronos wasn't the culprit. After updating Spark to 1.5 and setting a framework role offers started to come (while still using Chronos). iulian Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com http://www.typesafe.com http://www.typesafe.com -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com http://www.typesafe.com
slave_ping_timeout 1secs
Hi Running Mesos 0.23.0 and noted that cannot start mesos-master with slave_ping_timeout less than 1 second, tried 0.5secs, 500ms and 50us, etc. Is this by design or am I missing something? Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.comhttp://www.cisco.com/ [Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferenceshttp://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribehttp://www.cisco.com/offer/unsubscribe/?sid=000478327 - Privacyhttp://www.cisco.com/web/siteassets/legal/privacy.html
Re: slave_ping_timeout 1secs
Yes: https://github.com/apache/mesos/blob/5de7ea455ec577e19c67a75b1cf98493b40c53fb/src/master/flags.cpp#L383 Was the error message not shown in stderr? -- Jiang Yan Xu y...@jxu.me @xujyan http://twitter.com/xujyan On Tue, Aug 25, 2015 at 5:41 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi Running Mesos 0.23.0 and noted that cannot start mesos-master with slave_ping_timeout less than 1 second, tried 0.5secs, 500ms and 50us, etc. Is this by design or am I missing something? Cheers, [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] *Nastooh Avessta* ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: *+1 604 647 1527 %2B1%20604%20647%201527* *Cisco Systems Limited* 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com http://www.cisco.com/ [image: Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html*
RE: slave_ping_timeout 1secs
I see. Thank you for the clarification. Can I just change the boundaries in the source code, to suit my needs, or there is more to it? Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.comhttp://www.cisco.com/ [Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferenceshttp://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribehttp://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacyhttp://www.cisco.com/web/siteassets/legal/privacy.html From: Yan Xu [mailto:y...@jxu.me] Sent: Tuesday, August 25, 2015 5:49 PM To: user@mesos.apache.org Subject: Re: slave_ping_timeout 1secs Yes: https://github.com/apache/mesos/blob/5de7ea455ec577e19c67a75b1cf98493b40c53fb/src/master/flags.cpp#L383 Was the error message not shown in stderr? -- Jiang Yan Xu y...@jxu.memailto:y...@jxu.me @xujyanhttp://twitter.com/xujyan On Tue, Aug 25, 2015 at 5:41 PM, Nastooh Avessta (navesta) nave...@cisco.commailto:nave...@cisco.com wrote: Hi Running Mesos 0.23.0 and noted that cannot start mesos-master with slave_ping_timeout less than 1 second, tried 0.5secs, 500ms and 50us, etc. Is this by design or am I missing something? Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.commailto:nave...@cisco.com Phone: +1 604 647 1527tel:%2B1%20604%20647%201527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.comhttp://www.cisco.com/ [Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000tel:416-306-7000; Fax: 416-306-7099tel:416-306-7099. Preferenceshttp://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribehttp://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacyhttp://www.cisco.com/web/siteassets/legal/privacy.html
Re: SSL in Mesos 0.23
@Carlos Mesosphere currently doesn't build packages with ssl enabled. On Tue, Aug 25, 2015 at 3:12 PM, Carlos Sanchez car...@apache.org wrote: Hi Joris, I did build from sources, following instructions in http://mesos.apache.org/gettingstarted/ Is the mesosphere binary compiled with libevent and ssl enabled as mentioned previously? would make debugging easier if I don't have to rebuild On Tue, Aug 25, 2015 at 8:52 PM, Joris Van Remoortere jo...@mesosphere.io wrote: @carlos Are you building 0.23.0 from source? Just so we don't miss anything: Can you make sure to run ./bootstrap, and build in a clean directory with your configuration similar to this: ../configure --enable-libevent --enable-ssl Here http://mesos.apache.org/documentation/latest/mesos-ssl/ is the document I am using as a reference When you start up a master, if you just specify SSL_ENABLED=true it should error out and notify you that other required flags such as SSL_KEY_FILE are not provided. Can you verify this? If that is not happening, then the 2 options are: 1. Your environment variables are not making it to the binary: See Jeff Schroeder's comments 2. The binary is not actually the one you expect. Double check the checksum with the binary you built after configuring with SSL. On Fri, Aug 14, 2015 at 12:55 PM, Carlos Sanchez car...@apache.org wrote: looking forward to it, thanks! running out of ideas here on what am I doing wrong On Fri, Aug 14, 2015 at 6:53 PM, Marco Massenzio ma...@mesosphere.io wrote: FYI - Joris is out this week, he'll be probably able to get back to you early next (modulo MesosCon craziness :) Marco Massenzio Distributed Systems Engineer On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez car...@apache.org wrote: no suggestions? On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone vinodk...@apache.org wrote: @joris, can you help out here? On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez car...@apache.org wrote: I have tried to enable SSL with no success, even compiling from source with the ssl flags --enable-libevent --enable-ssl export SSL_ENABLED=true export SSL_SUPPORT_DOWNGRADE=false export SSL_REQUIRE_CERT=true export SSL_CERT_FILE=/etc/mesos/... export SSL_KEY_FILE=/etc/mesos/... export SSL_CA_FILE=/etc/mesos/... /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master --work_dir=/var/lib/mesos Port 5050 is still served as plain http, no SSL Nothing about ssl shows up in the logs, any ideas? Thanks From: Dharmit Shah shahdhar...@gmail.com To: user@mesos.apache.org Cc: Date: Mon, 10 Aug 2015 14:13:04 +0530 Subject: Re: SSL in Mesos 0.23 Hi Jeff, Thanks for the suggestion. I modified the systemd service file to use `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as environment files for master and slave services respectively. In these files, I specified the environment variables that I used to specify on the command line. Now if I check `strings /proc/pid/environ | grep SSL` for pids of master and slave services, I see the environment variables that I set in the /etc/sysconfig/environment-file. Now that it looks like I have started the master and slave services with SSL enabled, how do I really confirm that communication between master and slaves is really happening over SSL? Also, how do I enable SSL communication for a framework like Marathon? Regards, Dharmit. On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder jeffschroe...@computer.org wrote: The sudo command defaults to envreset (look for that in the man page) which strips all env variables sans a select few. I'd almost bet that your SSL_* variables are not present and were not passed to the slave. Just sudo -i and start the slaves *as root* without sudo. There is no benefit to starting them with sudo. You can verify what I'm saying with something along the lines of: strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_ On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com wrote: Hello again, Thanks for your responses. I will share what I tried after your suggestions. 1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave` returned similar output as one suggested by Craig. So, I guess, the Mesosphere repo binaries have SSL enabled. Right? 2. I created SSL private key and cert on one system in my cluster by referring this guide on DO [1]. Admittedly, my knowledge of SSL is limited. 3. Next, I copied the key and cert to all three mesos-master nodes and four mesos-slave nodes. Shouldn't slave nodes be provided only with the cert and not the