Re: mesos cluster can't fit federation cluster
Hi Marco, I want to fault tolerance slave nodes over multi datacenter. but i found the possible setup methods is not production way. 2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io: Hi Tommy, not sure what your use-case is, but you are correct, the master/slave nodes need to have bi-directional connectivity. However, there is no fundamental reason why those have to be public IPs - so long as they are routable (either via DNS discovery and / or VPN or other network-layer mechanisms) that will work. (I mean, without even thinking too hard about this - so I may be entirely wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs, one visible to the Slaves, the other in the VPC subnet, and forward all traffic? I'm sure I'm missing something here :) When you launch the master nodes, you specify the NICs they need to listen to via the --ip option, while the slave nodes have the --master flag that should have either a hostname:port of ip:port argument: so long as they are routable, this *should* work (although, admittedly, I've never tried this personally). One concern I would have in such an arrangement though, would be about network partitioning: if the DC/DC connectivity were to drop, you'd suddenly lose all master/slave connectivity; it's also not clear to me that having sectioned the Masters from the Slaves would give you better availability and/or reliability and/or security? It would be great to understand the use-case, so we could see what could be added (if anything) to Mesos going forward. *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote: Hello, I would like to deploy master nodes in a private zone, and setup mesos slaves in another datacenter. But the multi-datacenter mode can't work. it need slave node can reach master node in public network ip. But in production zone, the gateway ip is not belong to master nodes. Does anyone have same experience on multi-datacenter deployment case? I prefer kubernets cluster proposal. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: Mesos Slave Port Change Fails Recovery
Checkpointing has been enabled since 0.18 on these slaves. The only other setting that changed during the upgrade was that we added --gc_delay=1days. Otherwise, it's an in-place upgrade without any changes to the work directory... Philippe On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote: It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
It is surprising that the slave didn't bail out during the initial phase of recovery when the port changed. I'm assuming you enabled checkpointing in 0.20.0 and that you didn't wipe the meta data directory or anything when upgrading to 21.0? On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com wrote: Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling
On 07/02/2015 12:10 PM, Carlos Torres wrote: From: CCAAT cc...@tampabay.rr.com Sent: Thursday, July 2, 2015 12:00 PM To: user@mesos.apache.org Cc: cc...@tampabay.rr.com Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James Thanks for your feedback. I like your idea about having the ability to swap out the different components (e.g. load generators) and perhaps even providing an abstraction on the charting, and data reporting mechanism. I'll probably start with the simplest way possible, though, having the framework deploy Gatling across the cluster, in a scale-out fashion, and retrieve each instance results. Once I got that working then I'll start experimenting with abstracting out certain functionality. I know Twitter has a distributed load generator, called Iago, that apparently works in Mesos, it'd be awesome, if any of its contributors chime in, and share what things worked great, good, and not so good. The few things I'm concern in terms of implementing such a framework in Mesos is: * Noisy neighbors, or resource isolation. - Rationale: It can introduce noise to the results if load generator competes for shared resources (e.g. network) with others tasks. * Coordination of execution - Rationale: Need the ability to control execution of groups of related tasks. User A submits simulation that might create 5 load clients (tasks?), right after that, User B submits a different simulation that creates 10 load clients. Ideally, all of User A load clients should be on independent nodes, and should not share the same slaves with User B load clients, if not enough slaves are available on the cluster, then User B's simulation queues, until slaves are available. There might be enough resources to
Re: [Question] Distributed Load Testing with Mesos and Gatling
On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James
Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling
Yes, I agree, I think starting out with the scale-out approach, while naive, it will be a good starting point. I actually have this automated with Jenkins, and a bunch of dedicated slaves, using the Workflow plugin, it works kind of OK since I can't really control their execution. If you are interested, here's my workflow script for Jenkins: https://github.com/meteorfox/gatling-workflow/blob/master/gatling_flow.groovy -- Carlos From: Joao Ribeiro jonnyb...@gmail.com Sent: Thursday, July 2, 2015 11:33 AM To: user@mesos.apache.org Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling This sounds like a really cool project. I am still a very green user of mesos and never used gatling at all but a quick search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html With this it sound’t be took difficult to create a master/slave or scheduler/executors approach where you would have the master launch several slaves to do the work, wait for it to finish, collect logs and generate the report. For better synchronisation you could make the slaves register to zookeeper while master waits for all slaves to be up and trigger a “start test” command on all slaves simultaneously. You then could easily time out if it takes too long to get all slaves up or use other more fault tolerant strategies. i.e.: run slaves that you got; bump each slave that is up with more load to try to make up for missing slaves; It might be a naive approach but would be a starting point in my opinion. On 02 Jul 2015, at 18:00, CCAAT cc...@tampabay.rr.commailto:cc...@tampabay.rr.com wrote: On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James
Re: [Question] Distributed Load Testing with Mesos and Gatling
This sounds like a really cool project. I am still a very green user of mesos and never used gatling at all but a quick search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html http://gatling.io/docs/2.1.6/cookbook/scaling_out.html With this it sound’t be took difficult to create a master/slave or scheduler/executors approach where you would have the master launch several slaves to do the work, wait for it to finish, collect logs and generate the report. For better synchronisation you could make the slaves register to zookeeper while master waits for all slaves to be up and trigger a “start test” command on all slaves simultaneously. You then could easily time out if it takes too long to get all slaves up or use other more fault tolerant strategies. i.e.: run slaves that you got; bump each slave that is up with more load to try to make up for missing slaves; It might be a naive approach but would be a starting point in my opinion. On 02 Jul 2015, at 18:00, CCAAT cc...@tampabay.rr.com wrote: On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James
Re: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesosand Gatling
From: CCAAT cc...@tampabay.rr.com Sent: Thursday, July 2, 2015 2:03 PM To: user@mesos.apache.org Cc: cc...@tampabay.rr.com Subject: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesosand Gatling On 07/02/2015 12:10 PM, Carlos Torres wrote: From: CCAAT cc...@tampabay.rr.com Sent: Thursday, July 2, 2015 12:00 PM To: user@mesos.apache.org Cc: cc...@tampabay.rr.com Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James Thanks for your feedback. I like your idea about having the ability to swap out the different components (e.g. load generators) and perhaps even providing an abstraction on the charting, and data reporting mechanism. I'll probably start with the simplest way possible, though, having the framework deploy Gatling across the cluster, in a scale-out fashion, and retrieve each instance results. Once I got that working then I'll start experimenting with abstracting out certain functionality. I know Twitter has a distributed load generator, called Iago, that apparently works in Mesos, it'd be awesome, if any of its contributors chime in, and share what things worked great, good, and not so good. The few things I'm concern in terms of implementing such a framework in Mesos is: * Noisy neighbors, or resource isolation. - Rationale: It can introduce noise to the results if load generator competes for shared resources (e.g. network) with others tasks. * Coordination of execution - Rationale: Need the ability to control execution of groups of related tasks. User A submits simulation that might create 5 load clients (tasks?), right after that, User B
Re: Apache Mesos Community Sync
Reminder: The Mesos Community Developer Sync will be happening today at 3pm Pacific. To participate remotely, join the Google hangout: https://plus.google.com/hangouts/_/twitter.com/mesos-sync On Thu, Jun 18, 2015 at 7:22 AM, Adam Bordelon a...@mesosphere.io wrote: Reminder: We're hosting a developer community sync at Mesosphere HQ this morning from 9-11am Pacific. The agenda is pretty bare, so please add more topics you would like to discuss: https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit If you want to join in person, just show up to 88 Stevenson St, ring the buzzer, take the elevator up to 2nd floor, and then you can take the stairs up to the 3rd floor dining room, or ask somebody to let you up the elevator to the 3rd floor. To participate remotely, join the Google hangout: https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer On Mon, Jun 15, 2015 at 10:46 AM, Adam Bordelon a...@mesosphere.io wrote: As previously mentioned, we would like to host additional Mesos developer syncs at our new Mesosphere HQ at 88 Stevenson St (tucked behind Market 2nd), starting this Thursday from 9-11am Pacific. We opted for an earlier slot so that the European developer community can participate. Now that we are having these more frequently, it would be great to dive deeper into designs for upcoming features as well as discuss longstanding issues. While high-level status updates are useful, they should be a small part of these meetings so that we can address issues currently facing our developers. Please add agenda items to the same doc we've been using for previous meetings' Agenda/Notes: https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit Join in person if you can, or join remotely via hangout: https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer Thanks, -Adam- On Thu, May 28, 2015 at 10:08 AM, Vinod Kone vinodk...@gmail.com wrote: Cool. Here's the agenda doc https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit# for next week that folks can fill in. On Thu, May 28, 2015 at 9:52 AM, Adam Bordelon a...@mesosphere.io wrote: Looks like next week, Thursday June 4th on my calendar. I thought it was always the first Thursday of the month. On Thu, May 28, 2015 at 9:33 AM, Vinod Kone vinodk...@gmail.com wrote: Do we have community sync today or next week? I'm a bit confused. @vinodkone On Apr 1, 2015, at 3:18 AM, Adam Bordelon a...@mesosphere.io wrote: Reminder: We're having another Mesos Developer Community Sync this Thursday, April 2nd from 3-5pm Pacific. Agenda: https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing To Join: follow the BlueJeans instructions from the recurring meeting invite at the start of this thread. On Fri, Mar 6, 2015 at 11:11 AM, Vinod Kone vinodk...@apache.org wrote: Hi folks, We are planning to do monthly Mesos community meetings. Tentatively these are scheduled to occur on 1st Thursday of every month at 3 PM PST. See below for details to join the meeting remotely. This is a forum to ask questions/discuss about upcoming features, process etc. Everyone is welcome to join. Feel free to add items to the agenda for the next meeting here https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing . Cheers, On Thu, Mar 5, 2015 at 11:23 AM, Vinod Kone via Blue Jeans Network inv...@bluejeans.com wrote: [image: Blue Jeans] http://bluejeans.com Vinod Kone vi...@twitter.com has invited you to a video meeting. Meeting Title: Apache Mesos Community Sync Meeting Time: Every 4th week on Thursday • from March 5, 2015 • 3 p.m. PST / 2 hrs Join Meeting https://bluejeans.com/272369669?ll=eng=mrsxmqdnmvzw64zomfygcy3imuxg64th -- Connecting directly from a room system? 1) Dial: 199.48.152.152 or bjn.vc 2) Enter Meeting ID: 272369669 -or- use the pairing code Just want to dial in? (all numbers http://bluejeans.com/premium-numbers ) 1) Direct-dial with my iPhone +14087407256,,#272369669%23,%23 or +1 408 740 7256 +1%20408%20740%207256+1 408 740 7256 +1 888 240 2560 +1%20888%20240%202560+1 888 240 2560 (US Toll Free) +1 408 317 9253 +1%20408%20317%209253+1 408 317 9253 (Alternate Number) 2) Enter Meeting ID: 272369669 -- Description: We will try BlueJeans VC this time for our monthly community sync. If BlueJeans *doesn't* work out we will use the Google Hangout link (https://plus.google.com/hangouts/_/twitter.com/mesos-sync) instead. *Note:* No moderator is required to start this
Re: mesos cluster can't fit federation cluster
On Wed, Jul 1, 2015 at 11:38 PM, tommy xiao xia...@gmail.com wrote: Hi Marco, I want to fault tolerance slave nodes over multi datacenter. but i found the possible setup methods is not production way. what kind of fault-tolerance are you looking for here? Against one (or either) of the DC going away or network partitioning? or one (or more) of the racks in one DC to go away? Depending on what you want to protect yourself against there may be different ways to achieve that. I'm sorry I haven't been around Mesos long enough to really be knowledgeable about the specifics here; but have built HA systems before around VPCs and On-Prem solutions, and I know bi-di routing can be achieved using gateways and/or VPN (dedicated) links (we also solved that very issue at Google too, but I can't talk about that :). I'm sure the Twitter folks have solved that same problem too, but I'm guessing they may not be able to share much either? 2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io: Hi Tommy, not sure what your use-case is, but you are correct, the master/slave nodes need to have bi-directional connectivity. However, there is no fundamental reason why those have to be public IPs - so long as they are routable (either via DNS discovery and / or VPN or other network-layer mechanisms) that will work. (I mean, without even thinking too hard about this - so I may be entirely wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs, one visible to the Slaves, the other in the VPC subnet, and forward all traffic? I'm sure I'm missing something here :) When you launch the master nodes, you specify the NICs they need to listen to via the --ip option, while the slave nodes have the --master flag that should have either a hostname:port of ip:port argument: so long as they are routable, this *should* work (although, admittedly, I've never tried this personally). One concern I would have in such an arrangement though, would be about network partitioning: if the DC/DC connectivity were to drop, you'd suddenly lose all master/slave connectivity; it's also not clear to me that having sectioned the Masters from the Slaves would give you better availability and/or reliability and/or security? It would be great to understand the use-case, so we could see what could be added (if anything) to Mesos going forward. *Marco Massenzio* *Distributed Systems Engineer* On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote: Hello, I would like to deploy master nodes in a private zone, and setup mesos slaves in another datacenter. But the multi-datacenter mode can't work. it need slave node can reach master node in public network ip. But in production zone, the gateway ip is not belong to master nodes. Does anyone have same experience on multi-datacenter deployment case? I prefer kubernets cluster proposal. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com
Re: [Question] Distributed Load Testing with Mesos and Gatling
Hi Carlos, This sounds like a great idea and something that many people would like to be able to leverage. Mesosphere has a pretty good starting example that could be used to bootstrap the process of authoring your Gatling Framework. The project is called Rendler[1] and has several different language implementations. Rendler is essentially a distributed web page renderer. Hope this helps, Ben Whitehead [1] https://github.com/mesosphere/RENDLER/tree/master/java On Thu, Jul 2, 2015 at 9:33 AM, Joao Ribeiro jonnyb...@gmail.com wrote: This sounds like a really cool project. I am still a very green user of mesos and never used gatling at all but a quick search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html With this it sound’t be took difficult to create a master/slave or scheduler/executors approach where you would have the master launch several slaves to do the work, wait for it to finish, collect logs and generate the report. For better synchronisation you could make the slaves register to zookeeper while master waits for all slaves to be up and trigger a “start test” command on all slaves simultaneously. You then could easily time out if it takes too long to get all slaves up or use other more fault tolerant strategies. i.e.: run slaves that you got; bump each slave that is up with more load to try to make up for missing slaves; It might be a naive approach but would be a starting point in my opinion. On 02 Jul 2015, at 18:00, CCAAT cc...@tampabay.rr.com wrote: On 07/01/2015 01:17 PM, Carlos Torres wrote: Hi all, In the past weeks, I've been thinking in leveraging Mesos to schedule distributed load tests. An excellent idea. One problem, at least for me, with this approach is that the load testing tool needs to coordinate the distributed scenario, and combine the data, if it doesn't, then the load clients will trigger at different times, and then later an aggregation step of the data would be handled by the user, or some external batch job, or script. This is not a problem for load generators like Tsung, or Locust, but could be a little more complicated for Gatling, since they already provide a distributed model, and coordinate the distributed tasks, and Gatling does not. To me, the approach the Kubernetes team suggests is really a hack using the 'Replication Controller' to spawn multiple replicas, which could be easily achieved using the same approach with Marathon (or Kubernetes on Mesos). I was thinking of building a Mesos framework, that would take the input, or load simulation file, and would schedule jobs across the cluster (perhaps with dedicated resources too minimize variance) using Gatling. A Mesos framework will be able to provide a UI/API to take the input jobs, and report status of multiple jobs. It can also provide a way to sync/orchestrate the simulation, and finally provide a way to aggregate the simulation data in one place, and serve the generated HTML report. Boiled down to its primitive parts, it would spin multiple Gatling (java) processes across the cluster, use something like a barrier (not sure what to use here) to wait for all processes to be ready to execute, and finally copy, and rename the generated simulations logs from each Gatling process to one node/place, that is finally aggregated and compiled to HTML report by a single Gatling process. First of all, is there anything in the Mesos community that does this already? If not, do you think this is feasible to accomplish with a Mesos framework, and would you recommend to go with this approach? Does Mesos offers a barrier-like features to coordinate jobs, and can I somehow move files to a single node to be processed? This all sounds workable, but, I do not have all the experiences necessary to qualify your ideas. What I would suggest is a solution that lends itself to testing similarly configured cloud/cluster offerings, so we the cloud/cluster community has a way to test and evaluate new releases, substitute component codes, forks and even competitive offerings. A ubiquitous and robust testing semantic based on your ideas does seem to be an overwhelmingly positive idea, imho. As such some organizational structures to allow results to be maintained and quickly compared to other 'test-runs' would greatly encourage usage. Hopefully 'Gatling' and such have many, if not most of the features needed to automate the evaluation of results. Finally, I've never written a non-trivial Mesos framework, how should I go about, or find more documentation, to get started? I'm looking for best practices, pitfalls, etc. Thank you for your time, Carlos hth, James
Re: Mesos Slave Port Change Fails Recovery
For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Mesos Slave Port Change Fails Recovery
Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe
Re: Mesos Slave Port Change Fails Recovery
Here you are: https://gist.github.com/plaflamme/9cd056dc959e0597fb1c You can see in the mesos-master.INFO log that it re-registers the slave using port :5050 (line 9) and fails the health checks on port :5051 (line 10). So it might be the slave that re-uses the old configuration? Thanks, Philippe On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote: Can you paste some logs? On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com wrote: Ok, that's reasonable, but I'm not sure why it would successfully re-register with the master if it's not supposed to in the first place. I think changing the resources (for example) will dump the old configuration in the logs and tell you why recovery is bailing out. It's not doing that in this case. I looks as though this doesn't work only because the master can't ping the slave on the old port, because the whole recovery process was successful otherwise. I'm not sure if the slave could have picked up its configuration change and failed the recovery early, but that would definitely be a better experience. Philippe On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote: For slave recovery to work, it is expected to not change its config. On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com wrote: Hi, I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves configured with checkpointing and with reconnect recovery. I was investigating why the slaves would successfully re-register with the master and recover, but would subsequently be asked to shutdown (health check timeout). It turns out that our slaves had been unintentionally configured to use port 5050 in the previous configuration. We decided to fix that during the upgrade and have them use the default 5051 port. This change seems to make the health checks fail and eventually kills the slave due to inactivity. I've confirmed that leaving the port to what it was in the previous configuration makes the slave successfully re-register and is not asked to shutdown later on. Is this a known issue? I haven't been able to find a JIRA ticket for this. Maybe it's the expected behaviour? Should I create a ticket? Thanks, Philippe