Re: mesos cluster can't fit federation cluster

2015-07-02 Thread tommy xiao
Hi Marco,

I want to fault tolerance slave nodes over multi datacenter.  but i found
the possible setup methods is not production way.

2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io:

 Hi Tommy,

 not sure what your use-case is, but you are correct, the master/slave
 nodes need to have bi-directional connectivity.
 However, there is no fundamental reason why those have to be public IPs
 - so long as they are routable (either via DNS discovery and / or VPN or
 other network-layer mechanisms) that will work.
 (I mean, without even thinking too hard about this - so I may be entirely
 wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs,
 one visible to the Slaves, the other in the VPC subnet, and forward all
 traffic? I'm sure I'm missing something here :)

 When you launch the master nodes, you specify the NICs they need to listen
 to via the --ip option, while the slave nodes have the --master flag that
 should have either a hostname:port of ip:port argument: so long as they are
 routable, this *should* work (although, admittedly, I've never tried this
 personally).

 One concern I would have in such an arrangement though, would be about
 network partitioning: if the DC/DC connectivity were to drop, you'd
 suddenly lose all master/slave connectivity; it's also not clear to me that
 having sectioned the Masters from the Slaves would give you better
 availability and/or reliability and/or security?
 It would be great to understand the use-case, so we could see what could
 be added (if anything) to Mesos going forward.


 *Marco Massenzio*
 *Distributed Systems Engineer*

 On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote:

 Hello,

 I would like to deploy master nodes in a private zone, and setup mesos
 slaves in another datacenter. But the multi-datacenter mode can't work. it
 need slave node can reach master node in public network ip. But in
 production zone, the gateway ip is not belong to master nodes. Does anyone
 have same experience on multi-datacenter deployment case?

 I prefer kubernets cluster proposal.

 https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png


 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com





-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com


Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Checkpointing has been enabled since 0.18 on these slaves. The only other
setting that changed during the upgrade was that we added --gc_delay=1days.
Otherwise, it's an in-place upgrade without any changes to the work
directory...

Philippe

On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone vinodk...@gmail.com wrote:

 It is surprising that the slave didn't bail out during the initial phase
 of recovery when the port changed. I'm assuming you enabled checkpointing
 in 0.20.0 and that you didn't wipe the meta data directory or anything when
 upgrading to 21.0?

 On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
  wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register
 with the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that 
 during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked 
 to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe









Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Vinod Kone
It is surprising that the slave didn't bail out during the initial phase of
recovery when the port changed. I'm assuming you enabled checkpointing in
0.20.0 and that you didn't wipe the meta data directory or anything when
upgrading to 21.0?

On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme phili...@hopper.com
wrote:

 Here you are:

 https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

 You can see in the mesos-master.INFO log that it re-registers the slave
 using port :5050 (line 9) and fails the health checks on port :5051 (line
 10). So it might be the slave that re-uses the old configuration?

 Thanks,
 Philippe

 On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with
 the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to
 use port 5050 in the previous configuration. We decided to fix that during
 the upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe








Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

2015-07-02 Thread CCAAT

On 07/02/2015 12:10 PM, Carlos Torres wrote:


From: CCAAT cc...@tampabay.rr.com
Sent: Thursday, July 2, 2015 12:00 PM
To: user@mesos.apache.org
Cc: cc...@tampabay.rr.com
Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and 
Gatling

On 07/01/2015 01:17 PM, Carlos Torres wrote:

Hi all,

In the past weeks, I've been thinking in leveraging Mesos to schedule 
distributed load tests.


An excellent idea.


One problem, at least for me, with this approach is that the load testing tool 
needs to coordinate
the distributed scenario, and combine the data, if it doesn't, then the load 
clients will trigger at
different times, and then later an aggregation step of the data would be 
handled by the user, or
some external batch job, or script. This is not a problem for load generators 
like Tsung, or Locust,
but could be a little more complicated for Gatling, since they already provide 
a distributed model,
and coordinate the distributed tasks, and Gatling does not. To me, the approach 
the Kubernetes team
suggests is really a hack using the 'Replication Controller' to spawn multiple 
replicas, which could
be easily achieved using the same approach with Marathon (or Kubernetes on 
Mesos).



I was thinking of building a Mesos framework, that would take the input, or 
load simulation file,
and would schedule jobs across the cluster (perhaps with dedicated resources 
too minimize variance)
using Gatling.  A Mesos framework will be able to provide a UI/API to take the 
input jobs, and
report status of multiple jobs. It can also provide a way to sync/orchestrate 
the simulation, and
finally provide a way to aggregate the simulation data in one place, and serve 
the generated HTML
report.



Boiled down to its primitive parts, it would spin multiple Gatling (java) 
processes across the
cluster, use something like a barrier (not sure what to use here) to wait for 
all processes to
be ready to execute, and finally copy, and rename the generated simulations 
logs from each
Gatling process to one node/place, that is finally aggregated and compiled to 
HTML report by a
single Gatling process.



First of all, is there anything in the Mesos community that does this already? 
If not, do you
think this is feasible to accomplish with a Mesos framework, and would you 
recommend to go with this
approach? Does Mesos offers a barrier-like features to coordinate jobs, and can 
I somehow move
files to a single node to be processed?


This all sounds workable, but, I do not have all the experiences
necessary to qualify your ideas. What I would suggest is a solution that
lends itself to testing similarly configured cloud/cluster offerings, so
we the cloud/cluster community has a way to test and evaluate   new
releases, substitute component codes, forks and even competitive
offerings. A ubiquitous  and robust testing semantic based on your ideas
does seem to be an overwhelmingly positive idea, imho. As such some
organizational structures to allow results to be maintained and quickly
compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features
needed to automate the evaluation of results.



Finally, I've never written a non-trivial Mesos framework, how should I go 
about, or find more
documentation, to get started? I'm looking for best practices, pitfalls, etc.


Thank you for your time,
Carlos


hth,
James


Thanks for your feedback.

I like your idea about having the ability to swap out the different components 
(e.g. load generators) and perhaps even providing an abstraction on the 
charting, and data reporting mechanism.

I'll probably start with the simplest way possible, though, having the 
framework deploy Gatling across the cluster, in a scale-out fashion, and 
retrieve each instance results. Once I got that working then I'll start 
experimenting with abstracting out certain functionality.

I know Twitter has a distributed load generator, called Iago, that apparently 
works in Mesos, it'd be awesome, if any of its contributors chime in, and share 
what things worked great, good, and not so good.


The few things I'm concern in terms of implementing such a framework in Mesos 
is:

* Noisy neighbors, or resource isolation.
 - Rationale: It can introduce noise to the results if load generator 
competes for shared resources (e.g. network) with others tasks.

* Coordination of execution
 - Rationale: Need the ability to control execution of groups of related tasks. User 
A submits simulation that might create 5 load clients (tasks?), right after that, User B 
submits a different simulation that creates 10 load clients. Ideally, all of User A load 
clients should be on independent nodes, and should not share the same slaves with User B 
load clients, if not enough slaves are available on the cluster, then User B's simulation 
queues, until slaves are available. There might be enough resources to 

Re: [Question] Distributed Load Testing with Mesos and Gatling

2015-07-02 Thread CCAAT

On 07/01/2015 01:17 PM, Carlos Torres wrote:

Hi all,

In the past weeks, I've been thinking in leveraging Mesos to schedule 
distributed load tests.


An excellent idea.


One problem, at least for me, with this approach is that the load testing tool 
needs to coordinate
the distributed scenario, and combine the data, if it doesn't, then the load 
clients will trigger at
different times, and then later an aggregation step of the data would be 
handled by the user, or
some external batch job, or script. This is not a problem for load generators 
like Tsung, or Locust,
but could be a little more complicated for Gatling, since they already provide 
a distributed model,
and coordinate the distributed tasks, and Gatling does not. To me, the approach 
the Kubernetes team
suggests is really a hack using the 'Replication Controller' to spawn multiple 
replicas, which could
be easily achieved using the same approach with Marathon (or Kubernetes on 
Mesos).



I was thinking of building a Mesos framework, that would take the input, or 
load simulation file,
and would schedule jobs across the cluster (perhaps with dedicated resources 
too minimize variance)
using Gatling.  A Mesos framework will be able to provide a UI/API to take the 
input jobs, and
report status of multiple jobs. It can also provide a way to sync/orchestrate 
the simulation, and
finally provide a way to aggregate the simulation data in one place, and serve 
the generated HTML
report.



Boiled down to its primitive parts, it would spin multiple Gatling (java) 
processes across the
cluster, use something like a barrier (not sure what to use here) to wait for 
all processes to
be ready to execute, and finally copy, and rename the generated simulations 
logs from each
Gatling process to one node/place, that is finally aggregated and compiled to 
HTML report by a
single Gatling process.



First of all, is there anything in the Mesos community that does this already? 
If not, do you
think this is feasible to accomplish with a Mesos framework, and would you 
recommend to go with this
approach? Does Mesos offers a barrier-like features to coordinate jobs, and can 
I somehow move
files to a single node to be processed?


This all sounds workable, but, I do not have all the experiences 
necessary to qualify your ideas. What I would suggest is a solution that 
lends itself to testing similarly configured cloud/cluster offerings, so 
we the cloud/cluster community has a way to test and evaluate   new 
releases, substitute component codes, forks and even competitive 
offerings. A ubiquitous  and robust testing semantic based on your ideas 
does seem to be an overwhelmingly positive idea, imho. As such some 
organizational structures to allow results to be maintained and quickly 
compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features 
needed to automate the evaluation of results.




Finally, I've never written a non-trivial Mesos framework, how should I go 
about, or find more
documentation, to get started? I'm looking for best practices, pitfalls, etc.


Thank you for your time,
Carlos


hth,
James



Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and Gatling

2015-07-02 Thread Carlos Torres
Yes, I agree, I think starting out with the scale-out approach, while naive, it 
will be a good starting point.


I actually have this automated with Jenkins, and a bunch of dedicated slaves, 
using the Workflow plugin, it works kind of OK since I can't really control 
their execution.


If you are interested, here's my workflow script for Jenkins: 
https://github.com/meteorfox/gatling-workflow/blob/master/gatling_flow.groovy


-- Carlos


From: Joao Ribeiro jonnyb...@gmail.com
Sent: Thursday, July 2, 2015 11:33 AM
To: user@mesos.apache.org
Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and 
Gatling

This sounds like a really cool project.

I am still a very green user of mesos and never used gatling at all but a quick 
search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html

With this it sound’t be took difficult to create a master/slave or 
scheduler/executors approach where you would have the master launch several 
slaves to do the work, wait for it to finish, collect logs and generate the 
report.
For better synchronisation you could make the slaves register to zookeeper 
while master waits for all slaves to be up and trigger a “start test” command 
on all slaves simultaneously.
You then could easily time out if it takes too long to get all slaves up or use 
other more fault tolerant strategies. i.e.: run slaves that you got; bump each 
slave that is up with more load to try to make up for missing slaves;

It might be a naive approach but would be a starting point in my opinion.

On 02 Jul 2015, at 18:00, CCAAT 
cc...@tampabay.rr.commailto:cc...@tampabay.rr.com wrote:

On 07/01/2015 01:17 PM, Carlos Torres wrote:
Hi all,

In the past weeks, I've been thinking in leveraging Mesos to schedule 
distributed load tests.

An excellent idea.

One problem, at least for me, with this approach is that the load testing tool 
needs to coordinate
the distributed scenario, and combine the data, if it doesn't, then the load 
clients will trigger at
different times, and then later an aggregation step of the data would be 
handled by the user, or
some external batch job, or script. This is not a problem for load generators 
like Tsung, or Locust,
but could be a little more complicated for Gatling, since they already provide 
a distributed model,
and coordinate the distributed tasks, and Gatling does not. To me, the approach 
the Kubernetes team
suggests is really a hack using the 'Replication Controller' to spawn multiple 
replicas, which could
be easily achieved using the same approach with Marathon (or Kubernetes on 
Mesos).

I was thinking of building a Mesos framework, that would take the input, or 
load simulation file,
and would schedule jobs across the cluster (perhaps with dedicated resources 
too minimize variance)
using Gatling.  A Mesos framework will be able to provide a UI/API to take the 
input jobs, and
report status of multiple jobs. It can also provide a way to sync/orchestrate 
the simulation, and
finally provide a way to aggregate the simulation data in one place, and serve 
the generated HTML
report.

Boiled down to its primitive parts, it would spin multiple Gatling (java) 
processes across the
cluster, use something like a barrier (not sure what to use here) to wait for 
all processes to
be ready to execute, and finally copy, and rename the generated simulations 
logs from each
Gatling process to one node/place, that is finally aggregated and compiled to 
HTML report by a
single Gatling process.

First of all, is there anything in the Mesos community that does this already? 
If not, do you
think this is feasible to accomplish with a Mesos framework, and would you 
recommend to go with this
approach? Does Mesos offers a barrier-like features to coordinate jobs, and can 
I somehow move
files to a single node to be processed?

This all sounds workable, but, I do not have all the experiences necessary to 
qualify your ideas. What I would suggest is a solution that lends itself to 
testing similarly configured cloud/cluster offerings, so we the cloud/cluster 
community has a way to test and evaluate   new releases, substitute component 
codes, forks and even competitive offerings. A ubiquitous  and robust testing 
semantic based on your ideas does seem to be an overwhelmingly positive idea, 
imho. As such some organizational structures to allow results to be maintained 
and quickly compared to other 'test-runs' would greatly encourage usage.
Hopefully 'Gatling' and such have many, if not most of the features needed to 
automate the evaluation of results.


Finally, I've never written a non-trivial Mesos framework, how should I go 
about, or find more
documentation, to get started? I'm looking for best practices, pitfalls, etc.


Thank you for your time,
Carlos

hth,
James




Re: [Question] Distributed Load Testing with Mesos and Gatling

2015-07-02 Thread Joao Ribeiro
This sounds like a really cool project.

I am still a very green user of mesos and never used gatling at all but a quick 
search took me to http://gatling.io/docs/2.1.6/cookbook/scaling_out.html 
http://gatling.io/docs/2.1.6/cookbook/scaling_out.html

With this it sound’t be took difficult to create a master/slave or 
scheduler/executors approach where you would have the master launch several 
slaves to do the work, wait for it to finish, collect logs and generate the 
report.
For better synchronisation you could make the slaves register to zookeeper 
while master waits for all slaves to be up and trigger a “start test” command 
on all slaves simultaneously.
You then could easily time out if it takes too long to get all slaves up or use 
other more fault tolerant strategies. i.e.: run slaves that you got; bump each 
slave that is up with more load to try to make up for missing slaves;

It might be a naive approach but would be a starting point in my opinion.

 On 02 Jul 2015, at 18:00, CCAAT cc...@tampabay.rr.com wrote:
 
 On 07/01/2015 01:17 PM, Carlos Torres wrote:
 Hi all,
 
 In the past weeks, I've been thinking in leveraging Mesos to schedule 
 distributed load tests.
 
 An excellent idea.
 
 One problem, at least for me, with this approach is that the load testing 
 tool needs to coordinate
 the distributed scenario, and combine the data, if it doesn't, then the load 
 clients will trigger at
 different times, and then later an aggregation step of the data would be 
 handled by the user, or
 some external batch job, or script. This is not a problem for load 
 generators like Tsung, or Locust,
 but could be a little more complicated for Gatling, since they already 
 provide a distributed model,
 and coordinate the distributed tasks, and Gatling does not. To me, the 
 approach the Kubernetes team
 suggests is really a hack using the 'Replication Controller' to spawn 
 multiple replicas, which could
 be easily achieved using the same approach with Marathon (or Kubernetes on 
 Mesos).
 
 I was thinking of building a Mesos framework, that would take the input, or 
 load simulation file,
 and would schedule jobs across the cluster (perhaps with dedicated resources 
 too minimize variance)
 using Gatling.  A Mesos framework will be able to provide a UI/API to take 
 the input jobs, and
 report status of multiple jobs. It can also provide a way to 
 sync/orchestrate the simulation, and
 finally provide a way to aggregate the simulation data in one place, and 
 serve the generated HTML
 report.
 
 Boiled down to its primitive parts, it would spin multiple Gatling (java) 
 processes across the
 cluster, use something like a barrier (not sure what to use here) to wait 
 for all processes to
 be ready to execute, and finally copy, and rename the generated simulations 
 logs from each
 Gatling process to one node/place, that is finally aggregated and compiled 
 to HTML report by a
 single Gatling process.
 
 First of all, is there anything in the Mesos community that does this 
 already? If not, do you
 think this is feasible to accomplish with a Mesos framework, and would you 
 recommend to go with this
 approach? Does Mesos offers a barrier-like features to coordinate jobs, and 
 can I somehow move
 files to a single node to be processed?
 
 This all sounds workable, but, I do not have all the experiences necessary to 
 qualify your ideas. What I would suggest is a solution that lends itself to 
 testing similarly configured cloud/cluster offerings, so we the cloud/cluster 
 community has a way to test and evaluate   new releases, substitute component 
 codes, forks and even competitive offerings. A ubiquitous  and robust testing 
 semantic based on your ideas does seem to be an overwhelmingly positive idea, 
 imho. As such some organizational structures to allow results to be 
 maintained and quickly compared to other 'test-runs' would greatly encourage 
 usage.
 Hopefully 'Gatling' and such have many, if not most of the features needed to 
 automate the evaluation of results.
 
 
 Finally, I've never written a non-trivial Mesos framework, how should I go 
 about, or find more
 documentation, to get started? I'm looking for best practices, pitfalls, etc.
 
 
 Thank you for your time,
 Carlos
 
 hth,
 James
 



Re: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesosand Gatling

2015-07-02 Thread Carlos Torres


From: CCAAT cc...@tampabay.rr.com
Sent: Thursday, July 2, 2015 2:03 PM
To: user@mesos.apache.org
Cc: cc...@tampabay.rr.com
Subject: COMMERCIAL:Re: COMMERCIAL:Re: [Question] Distributed Load Testing with 
Mesosand Gatling

On 07/02/2015 12:10 PM, Carlos Torres wrote:
 
 From: CCAAT cc...@tampabay.rr.com
 Sent: Thursday, July 2, 2015 12:00 PM
 To: user@mesos.apache.org
 Cc: cc...@tampabay.rr.com
 Subject: COMMERCIAL:Re: [Question] Distributed Load Testing with Mesos and 
 Gatling

 On 07/01/2015 01:17 PM, Carlos Torres wrote:
 Hi all,

 In the past weeks, I've been thinking in leveraging Mesos to schedule 
 distributed load tests.

 An excellent idea.

 One problem, at least for me, with this approach is that the load testing 
 tool needs to coordinate
 the distributed scenario, and combine the data, if it doesn't, then the load 
 clients will trigger at
 different times, and then later an aggregation step of the data would be 
 handled by the user, or
 some external batch job, or script. This is not a problem for load 
 generators like Tsung, or Locust,
 but could be a little more complicated for Gatling, since they already 
 provide a distributed model,
 and coordinate the distributed tasks, and Gatling does not. To me, the 
 approach the Kubernetes team
 suggests is really a hack using the 'Replication Controller' to spawn 
 multiple replicas, which could
 be easily achieved using the same approach with Marathon (or Kubernetes on 
 Mesos).

 I was thinking of building a Mesos framework, that would take the input, or 
 load simulation file,
 and would schedule jobs across the cluster (perhaps with dedicated resources 
 too minimize variance)
 using Gatling.  A Mesos framework will be able to provide a UI/API to take 
 the input jobs, and
 report status of multiple jobs. It can also provide a way to 
 sync/orchestrate the simulation, and
 finally provide a way to aggregate the simulation data in one place, and 
 serve the generated HTML
 report.

 Boiled down to its primitive parts, it would spin multiple Gatling (java) 
 processes across the
 cluster, use something like a barrier (not sure what to use here) to wait 
 for all processes to
 be ready to execute, and finally copy, and rename the generated simulations 
 logs from each
 Gatling process to one node/place, that is finally aggregated and compiled 
 to HTML report by a
 single Gatling process.

 First of all, is there anything in the Mesos community that does this 
 already? If not, do you
 think this is feasible to accomplish with a Mesos framework, and would you 
 recommend to go with this
 approach? Does Mesos offers a barrier-like features to coordinate jobs, and 
 can I somehow move
 files to a single node to be processed?

 This all sounds workable, but, I do not have all the experiences
 necessary to qualify your ideas. What I would suggest is a solution that
 lends itself to testing similarly configured cloud/cluster offerings, so
 we the cloud/cluster community has a way to test and evaluate   new
 releases, substitute component codes, forks and even competitive
 offerings. A ubiquitous  and robust testing semantic based on your ideas
 does seem to be an overwhelmingly positive idea, imho. As such some
 organizational structures to allow results to be maintained and quickly
 compared to other 'test-runs' would greatly encourage usage.
 Hopefully 'Gatling' and such have many, if not most of the features
 needed to automate the evaluation of results.


 Finally, I've never written a non-trivial Mesos framework, how should I go 
 about, or find more
 documentation, to get started? I'm looking for best practices, pitfalls, etc.


 Thank you for your time,
 Carlos

 hth,
 James


 Thanks for your feedback.

 I like your idea about having the ability to swap out the different 
 components (e.g. load generators) and perhaps even providing an abstraction 
 on the charting, and data reporting mechanism.

 I'll probably start with the simplest way possible, though, having the 
 framework deploy Gatling across the cluster, in a scale-out fashion, and 
 retrieve each instance results. Once I got that working then I'll start 
 experimenting with abstracting out certain functionality.

 I know Twitter has a distributed load generator, called Iago, that apparently 
 works in Mesos, it'd be awesome, if any of its contributors chime in, and 
 share what things worked great, good, and not so good.


 The few things I'm concern in terms of implementing such a framework in Mesos 
 is:

 * Noisy neighbors, or resource isolation.
  - Rationale: It can introduce noise to the results if load generator 
 competes for shared resources (e.g. network) with others tasks.

 * Coordination of execution
  - Rationale: Need the ability to control execution of groups of related 
 tasks. User A submits simulation that might create 5 load clients (tasks?), 
 right after that, User B 

Re: Apache Mesos Community Sync

2015-07-02 Thread Joris Van Remoortere
Reminder: The Mesos Community Developer Sync will be happening today at 3pm
Pacific.

To participate remotely, join the Google hangout:
https://plus.google.com/hangouts/_/twitter.com/mesos-sync

On Thu, Jun 18, 2015 at 7:22 AM, Adam Bordelon a...@mesosphere.io wrote:

 Reminder: We're hosting a developer community sync at Mesosphere HQ this
 morning from 9-11am Pacific.

 The agenda is pretty bare, so please add more topics you would like to
 discuss:

 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit

 If you want to join in person, just show up to 88 Stevenson St, ring the
 buzzer, take the elevator up to 2nd floor, and then you can take the stairs
 up to the 3rd floor dining room, or ask somebody to let you up the elevator
 to the 3rd floor.

 To participate remotely, join the Google hangout:
 https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer

 On Mon, Jun 15, 2015 at 10:46 AM, Adam Bordelon a...@mesosphere.io
 wrote:

 As previously mentioned, we would like to host additional Mesos developer
 syncs at our new Mesosphere HQ at 88 Stevenson St (tucked behind Market 
 2nd), starting this Thursday from 9-11am Pacific. We opted for an earlier
 slot so that the European developer community can participate.

 Now that we are having these more frequently, it would be great to dive
 deeper into designs for upcoming features as well as discuss longstanding
 issues. While high-level status updates are useful, they should be a small
 part of these meetings so that we can address issues currently facing our
 developers.

 Please add agenda items to the same doc we've been using for previous
 meetings' Agenda/Notes:

 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit

 Join in person if you can, or join remotely via hangout:
 https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer

 Thanks,
 -Adam-


 On Thu, May 28, 2015 at 10:08 AM, Vinod Kone vinodk...@gmail.com wrote:

 Cool.

 Here's the agenda doc
 
 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit#
 
 for next week that folks can fill in.

 On Thu, May 28, 2015 at 9:52 AM, Adam Bordelon a...@mesosphere.io
 wrote:

  Looks like next week, Thursday June 4th on my calendar.
  I thought it was always the first Thursday of the month.
 
  On Thu, May 28, 2015 at 9:33 AM, Vinod Kone vinodk...@gmail.com
 wrote:
 
   Do we have community sync today or next week? I'm a bit confused.
  
   @vinodkone
  
On Apr 1, 2015, at 3:18 AM, Adam Bordelon a...@mesosphere.io
 wrote:
   
Reminder: We're having another Mesos Developer Community Sync this
Thursday, April 2nd from 3-5pm Pacific.
   
Agenda:
   
  
 
 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
To Join: follow the BlueJeans instructions from the recurring
 meeting
invite at the start of this thread.
   
On Fri, Mar 6, 2015 at 11:11 AM, Vinod Kone vinodk...@apache.org
 
   wrote:
   
Hi folks,
   
We are planning to do monthly Mesos community meetings.
 Tentatively
   these
are scheduled to occur on 1st Thursday of every month at 3 PM
 PST. See
below for details to join the meeting remotely.
   
This is a forum to ask questions/discuss about upcoming features,
   process
etc. Everyone is welcome to join. Feel free to add items to the
 agenda
   for
the next meeting here

   
  
 
 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
.
   
Cheers,
   
On Thu, Mar 5, 2015 at 11:23 AM, Vinod Kone via Blue Jeans
 Network 
inv...@bluejeans.com wrote:
   
   [image: Blue Jeans] http://bluejeans.com   Vinod Kone
vi...@twitter.com has invited you to a video meeting.
  Meeting
Title: Apache Mesos Community Sync
 Meeting Time: Every 4th week on Thursday • from March 5, 2015 •
 3
  p.m.
PST / 2 hrs  Join Meeting

   
  
 https://bluejeans.com/272369669?ll=eng=mrsxmqdnmvzw64zomfygcy3imuxg64th
  
--
 Connecting directly from a room system?
   
1) Dial: 199.48.152.152 or bjn.vc
2) Enter Meeting ID: 272369669 -or- use the pairing code
   
   
Just want to dial in? (all numbers 
   http://bluejeans.com/premium-numbers
)
1) Direct-dial with my iPhone +14087407256,,#272369669%23,%23
 or
+1 408 740 7256 +1%20408%20740%207256+1 408 740 7256
+1 888 240 2560 +1%20888%20240%202560+1 888 240 2560 (US Toll
  Free)
+1 408 317 9253 +1%20408%20317%209253+1 408 317 9253
 (Alternate
   Number)
   
2) Enter Meeting ID: 272369669
   
 --
 Description:
We will try BlueJeans VC this time for our monthly community
 sync.
   
If BlueJeans *doesn't* work out we will use the Google Hangout
 link
(https://plus.google.com/hangouts/_/twitter.com/mesos-sync)
 instead.
*Note:* No moderator is required to start this 

Re: mesos cluster can't fit federation cluster

2015-07-02 Thread Marco Massenzio
On Wed, Jul 1, 2015 at 11:38 PM, tommy xiao xia...@gmail.com wrote:

 Hi Marco,

 I want to fault tolerance slave nodes over multi datacenter.  but i found
 the possible setup methods is not production way.


what kind of fault-tolerance are you looking for here?
Against one (or either) of the DC going away or network partitioning? or
one (or more) of the racks in one DC to go away?

Depending on what you want to protect yourself against there may be
different ways to achieve that.
I'm sorry I haven't been around Mesos long enough to really be
knowledgeable about the specifics here; but have built HA systems before
around VPCs and On-Prem solutions, and I know bi-di routing can be achieved
using gateways and/or VPN (dedicated) links (we also solved that very issue
at Google too, but I can't talk about that :).

I'm sure the Twitter folks have solved that same problem too, but I'm
guessing they may not be able to share much either?


 2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io:

 Hi Tommy,

 not sure what your use-case is, but you are correct, the master/slave
 nodes need to have bi-directional connectivity.
 However, there is no fundamental reason why those have to be public IPs
 - so long as they are routable (either via DNS discovery and / or VPN or
 other network-layer mechanisms) that will work.
 (I mean, without even thinking too hard about this - so I may be entirely
 wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs,
 one visible to the Slaves, the other in the VPC subnet, and forward all
 traffic? I'm sure I'm missing something here :)

 When you launch the master nodes, you specify the NICs they need to
 listen to via the --ip option, while the slave nodes have the --master flag
 that should have either a hostname:port of ip:port argument: so long as
 they are routable, this *should* work (although, admittedly, I've never
 tried this personally).

 One concern I would have in such an arrangement though, would be about
 network partitioning: if the DC/DC connectivity were to drop, you'd
 suddenly lose all master/slave connectivity; it's also not clear to me that
 having sectioned the Masters from the Slaves would give you better
 availability and/or reliability and/or security?
 It would be great to understand the use-case, so we could see what could
 be added (if anything) to Mesos going forward.


 *Marco Massenzio*
 *Distributed Systems Engineer*

 On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote:

 Hello,

 I would like to deploy master nodes in a private zone, and setup mesos
 slaves in another datacenter. But the multi-datacenter mode can't work. it
 need slave node can reach master node in public network ip. But in
 production zone, the gateway ip is not belong to master nodes. Does anyone
 have same experience on multi-datacenter deployment case?

 I prefer kubernets cluster proposal.

 https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png


 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com





 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



Re: [Question] Distributed Load Testing with Mesos and Gatling

2015-07-02 Thread Ben Whitehead
Hi Carlos,

This sounds like a great idea and something that many people would like to
be able to leverage.

Mesosphere has a pretty good starting example that could be used to
bootstrap the process of authoring your Gatling Framework.  The project is
called Rendler[1] and has several different language implementations.
Rendler is essentially a distributed web page renderer.

Hope this helps,
Ben Whitehead


[1] https://github.com/mesosphere/RENDLER/tree/master/java

On Thu, Jul 2, 2015 at 9:33 AM, Joao Ribeiro jonnyb...@gmail.com wrote:

 This sounds like a really cool project.

 I am still a very green user of mesos and never used gatling at all but a
 quick search took me to
 http://gatling.io/docs/2.1.6/cookbook/scaling_out.html

 With this it sound’t be took difficult to create a master/slave or
 scheduler/executors approach where you would have the master launch several
 slaves to do the work, wait for it to finish, collect logs and generate the
 report.
 For better synchronisation you could make the slaves register to zookeeper
 while master waits for all slaves to be up and trigger a “start test”
 command on all slaves simultaneously.
 You then could easily time out if it takes too long to get all slaves up
 or use other more fault tolerant strategies. i.e.: run slaves that you got;
 bump each slave that is up with more load to try to make up for missing
 slaves;

 It might be a naive approach but would be a starting point in my opinion.

 On 02 Jul 2015, at 18:00, CCAAT cc...@tampabay.rr.com wrote:

 On 07/01/2015 01:17 PM, Carlos Torres wrote:

 Hi all,

 In the past weeks, I've been thinking in leveraging Mesos to schedule
 distributed load tests.


 An excellent idea.


 One problem, at least for me, with this approach is that the load testing
 tool needs to coordinate
 the distributed scenario, and combine the data, if it doesn't, then the
 load clients will trigger at
 different times, and then later an aggregation step of the data would be
 handled by the user, or
 some external batch job, or script. This is not a problem for load
 generators like Tsung, or Locust,
 but could be a little more complicated for Gatling, since they already
 provide a distributed model,
 and coordinate the distributed tasks, and Gatling does not. To me, the
 approach the Kubernetes team
 suggests is really a hack using the 'Replication Controller' to spawn
 multiple replicas, which could
 be easily achieved using the same approach with Marathon (or Kubernetes on
 Mesos).


 I was thinking of building a Mesos framework, that would take the input,
 or load simulation file,
 and would schedule jobs across the cluster (perhaps with dedicated
 resources too minimize variance)
 using Gatling.  A Mesos framework will be able to provide a UI/API to take
 the input jobs, and
 report status of multiple jobs. It can also provide a way to
 sync/orchestrate the simulation, and
 finally provide a way to aggregate the simulation data in one place, and
 serve the generated HTML
 report.


 Boiled down to its primitive parts, it would spin multiple Gatling (java)
 processes across the
 cluster, use something like a barrier (not sure what to use here) to wait
 for all processes to
 be ready to execute, and finally copy, and rename the generated
 simulations logs from each
 Gatling process to one node/place, that is finally aggregated and compiled
 to HTML report by a
 single Gatling process.


 First of all, is there anything in the Mesos community that does this
 already? If not, do you
 think this is feasible to accomplish with a Mesos framework, and would you
 recommend to go with this
 approach? Does Mesos offers a barrier-like features to coordinate jobs,
 and can I somehow move
 files to a single node to be processed?


 This all sounds workable, but, I do not have all the experiences necessary
 to qualify your ideas. What I would suggest is a solution that lends itself
 to testing similarly configured cloud/cluster offerings, so we the
 cloud/cluster community has a way to test and evaluate   new releases,
 substitute component codes, forks and even competitive offerings. A
 ubiquitous  and robust testing semantic based on your ideas does seem to be
 an overwhelmingly positive idea, imho. As such some organizational
 structures to allow results to be maintained and quickly compared to other
 'test-runs' would greatly encourage usage.
 Hopefully 'Gatling' and such have many, if not most of the features needed
 to automate the evaluation of results.


 Finally, I've never written a non-trivial Mesos framework, how should I go
 about, or find more
 documentation, to get started? I'm looking for best practices, pitfalls,
 etc.


 Thank you for your time,
 Carlos


 hth,
 James





Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Vinod Kone
For slave recovery to work, it is expected to not change its config.

On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with the
 master and recover, but would subsequently be asked to shutdown (health
 check timeout).

 It turns out that our slaves had been unintentionally configured to use
 port 5050 in the previous configuration. We decided to fix that during the
 upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills the
 slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for this.
 Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe



Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Hi,

I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
configured with checkpointing and with reconnect recovery.

I was investigating why the slaves would successfully re-register with the
master and recover, but would subsequently be asked to shutdown (health
check timeout).

It turns out that our slaves had been unintentionally configured to use
port 5050 in the previous configuration. We decided to fix that during the
upgrade and have them use the default 5051 port.

This change seems to make the health checks fail and eventually kills the
slave due to inactivity.

I've confirmed that leaving the port to what it was in the previous
configuration makes the slave successfully re-register and is not asked to
shutdown later on.

Is this a known issue? I haven't been able to find a JIRA ticket for this.
Maybe it's the expected behaviour? Should I create a ticket?

Thanks,
Philippe


Re: Mesos Slave Port Change Fails Recovery

2015-07-02 Thread Philippe Laflamme
Here you are:

https://gist.github.com/plaflamme/9cd056dc959e0597fb1c

You can see in the mesos-master.INFO log that it re-registers the slave
using port :5050 (line 9) and fails the health checks on port :5051 (line
10). So it might be the slave that re-uses the old configuration?

Thanks,
Philippe

On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Can you paste some logs?

 On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Ok, that's reasonable, but I'm not sure why it would successfully
 re-register with the master if it's not supposed to in the first place. I
 think changing the resources (for example) will dump the old configuration
 in the logs and tell you why recovery is bailing out. It's not doing that
 in this case.

 I looks as though this doesn't work only because the master can't ping
 the slave on the old port, because the whole recovery process was
 successful otherwise.

 I'm not sure if the slave could have picked up its configuration change
 and failed the recovery early, but that would definitely be a better
 experience.

 Philippe

 On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone vinodk...@gmail.com wrote:

 For slave recovery to work, it is expected to not change its config.

 On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme phili...@hopper.com
 wrote:

 Hi,

 I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
 configured with checkpointing and with reconnect recovery.

 I was investigating why the slaves would successfully re-register with
 the master and recover, but would subsequently be asked to shutdown
 (health check timeout).

 It turns out that our slaves had been unintentionally configured to use
 port 5050 in the previous configuration. We decided to fix that during the
 upgrade and have them use the default 5051 port.

 This change seems to make the health checks fail and eventually kills
 the slave due to inactivity.

 I've confirmed that leaving the port to what it was in the previous
 configuration makes the slave successfully re-register and is not asked to
 shutdown later on.

 Is this a known issue? I haven't been able to find a JIRA ticket for
 this. Maybe it's the expected behaviour? Should I create a ticket?

 Thanks,
 Philippe