Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-17 Thread Joe Gordon
On Tue, Sep 16, 2014 at 8:02 AM, Kurt Griffiths 
kurt.griffi...@rackspace.com wrote:

  Right, graphing those sorts of variables has always been part of our
 test plan. What I’ve done so far was just some pilot tests, and I realize
 now that I wasn’t very clear on that point. I wanted to get a rough idea of
 where the Redis driver sat in case there were any obvious bug fixes that
 needed to be taken care of before performing more extensive testing. As it
 turns out, I did find one bug that has since been fixed.

  Regarding latency, saying that it is not important” is an exaggeration;
 it is definitely important, just not the* only *thing that is important.
 I have spoken with a lot of prospective Zaqar users since the inception of
 the project, and one of the common threads was that latency needed to be
 reasonable. For the use cases where they see Zaqar delivering a lot of
 value, requests don't need to be as fast as, say, ZMQ, but they do need
 something that isn’t horribly *slow,* either. They also want HTTP,
 multi-tenant, auth, durability, etc. The goal is to find a reasonable
 amount of latency given our constraints and also, obviously, be able to
 deliver all that at scale.


Can you further quantify what you would consider too slow, is it 100ms too
slow.



  In any case, I’ve continue working through the test plan and will be
 publishing further test results shortly.

   graph latency versus number of concurrent active tenants

  By tenants do you mean in the sense of OpenStack Tenants/Project-ID's or
 in  the sense of “clients/workers”? For the latter case, the pilot tests
 I’ve done so far used multiple clients (though not graphed), but in the
 former case only one “project” was used.


multiple  Tenant/Project-IDs



   From: Joe Gordon joe.gord...@gmail.com
 Reply-To: OpenStack Dev openstack-dev@lists.openstack.org
 Date: Friday, September 12, 2014 at 1:45 PM
 To: OpenStack Dev openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

  If zaqar is like amazon SQS, then the latency for a single message and
 the throughput for a single tenant is not important. I wouldn't expect
 anyone who has latency sensitive work loads or needs massive throughput to
 use zaqar, as these people wouldn't use SQS either. The consistency of the
 latency (shouldn't change under load) and zaqar's ability to scale
 horizontally mater much more. What I would be great to see some other
 things benchmarked instead:

  * graph latency versus number of concurrent active tenants
 * graph latency versus message size
 * How throughput scales as you scale up the number of assorted zaqar
 components. If one of the benefits of zaqar is its horizontal scalability,
 lets see it.
  * How does this change with message batching?

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-17 Thread Kurt Griffiths
Great question. So, some use cases, like guest agent, would like to see 
something around ~20ms if the agent is needing to respond to requests from a 
control surface/panel while a user clicks around. I spoke with a social media 
company who was also interested in low latency just because they have a big 
volume of messages they need to slog through in a timely manner or they will 
get behind (long-polling or websocket support was something they would like to 
see).

Other use cases should be fine with, say, 100ms. I want to say Heat’s needs 
probably fall into that latter category, but I’m only speculating.

Some other feedback we got a while back was that people would like a knob to 
tweak queue attributes. E.g., the tradeoff between durability and performance. 
That led to work on queue “flavors”, which Flavio has been working on this past 
cycle, so I’ll let him chime in on that.

From: Joe Gordon joe.gord...@gmail.commailto:joe.gord...@gmail.com
Reply-To: OpenStack Dev 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Wednesday, September 17, 2014 at 2:32 PM
To: OpenStack Dev 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

Can you further quantify what you would consider too slow, is it 100ms too slow.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-16 Thread Kurt Griffiths
Right, graphing those sorts of variables has always been part of our test plan. 
What I’ve done so far was just some pilot tests, and I realize now that I 
wasn’t very clear on that point. I wanted to get a rough idea of where the 
Redis driver sat in case there were any obvious bug fixes that needed to be 
taken care of before performing more extensive testing. As it turns out, I did 
find one bug that has since been fixed.

Regarding latency, saying that it is not important” is an exaggeration; it is 
definitely important, just not the only thing that is important. I have spoken 
with a lot of prospective Zaqar users since the inception of the project, and 
one of the common threads was that latency needed to be reasonable. For the use 
cases where they see Zaqar delivering a lot of value, requests don't need to be 
as fast as, say, ZMQ, but they do need something that isn’t horribly slow, 
either. They also want HTTP, multi-tenant, auth, durability, etc. The goal is 
to find a reasonable amount of latency given our constraints and also, 
obviously, be able to deliver all that at scale.

In any case, I’ve continue working through the test plan and will be publishing 
further test results shortly.

 graph latency versus number of concurrent active tenants

By tenants do you mean in the sense of OpenStack Tenants/Project-ID's or in  
the sense of “clients/workers”? For the latter case, the pilot tests I’ve done 
so far used multiple clients (though not graphed), but in the former case only 
one “project” was used.

From: Joe Gordon joe.gord...@gmail.commailto:joe.gord...@gmail.com
Reply-To: OpenStack Dev 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Friday, September 12, 2014 at 1:45 PM
To: OpenStack Dev 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

If zaqar is like amazon SQS, then the latency for a single message and the 
throughput for a single tenant is not important. I wouldn't expect anyone who 
has latency sensitive work loads or needs massive throughput to use zaqar, as 
these people wouldn't use SQS either. The consistency of the latency (shouldn't 
change under load) and zaqar's ability to scale horizontally mater much more. 
What I would be great to see some other things benchmarked instead:

* graph latency versus number of concurrent active tenants
* graph latency versus message size
* How throughput scales as you scale up the number of assorted zaqar 
components. If one of the benefits of zaqar is its horizontal scalability, lets 
see it.
* How does this change with message batching?
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-12 Thread Flavio Percoco
On 09/12/2014 01:36 AM, Boris Pavlovic wrote:
 Kurt, 
 
 Speaking generally, I’d like to see the project bake this in over
 time as
 part of the CI process. It’s definitely useful information not just for
 the developers but also for operators in terms of capacity planning.
 We’ve  
 
 talked as a team about doing this with Rally  (and in fact, some
 work has
 
 been started there), but it may be useful to also run a large-scale
 test 
 
 on a regular basis (at least per milestone). 
 
 
 I believe, we will be able to generate distributed load and generate at
 least
 20k rps in K cycle. We've done a lot of work during J in this direction,
 but there is still a lot of to do.
 
 So you'll be able to use the same tool for gates, local usage and
 large-scale tests.


Lets talk about it :)

Would it be possible to get an update from you at the summit (or mailing
list)? I'm interested to know where you guys are with this, what is
missing and most importantly, how we can help.

Thanks Boris,
Flavio


-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-12 Thread Joe Gordon
On Tue, Sep 9, 2014 at 12:19 PM, Kurt Griffiths 
kurt.griffi...@rackspace.com wrote:

 Hi folks,

 In this second round of performance testing, I benchmarked the new Redis
 driver. I used the same setup and tests as in Round 1 to make it easier to
 compare the two drivers. I did not test Redis in master-slave mode, but
 that likely would not make a significant difference in the results since
 Redis replication is asynchronous[1].

 As always, the usual benchmarking disclaimers apply (i.e., take these
 numbers with a grain of salt; they are only intended to provide a ballpark
 reference; you should perform your own tests, simulating your specific
 scenarios and using your own hardware; etc.).

 ## Setup ##

 Rather than VMs, I provisioned some Rackspace OnMetal[3] servers to
 mitigate noisy neighbor when running the performance tests:

 * 1x Load Generator
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar-bench
 * 1x Web Head
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar server
 * storage=mongodb
 * partitions=4
 * MongoDB URI configured with w=majority
 * uWSGI + gevent
 * config: http://paste.openstack.org/show/100592/
 * app.py: http://paste.openstack.org/show/100593/
 * 3x MongoDB Nodes
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * mongod 2.6.4
 * Default config, except setting replSet and enabling periodic
   logging of CPU and I/O
 * Journaling enabled
 * Profiling on message DBs enabled for requests over 10ms
 * 1x Redis Node
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * Redis 2.4.14
 * Default config (snapshotting and AOF enabled)
 * One process

 As in Round 1, Keystone auth is disabled and requests go over HTTP, not
 HTTPS. The latency introduced by enabling these is outside the control of
 Zaqar, but should be quite minimal (speaking anecdotally, I would expect
 an additional 1-3ms for cached tokens and assuming an optimized TLS
 termination setup).

 For generating the load, I again used the zaqar-bench tool. I would like
 to see the team complete a large-scale Tsung test as well (including a
 full HA deployment with Keystone and HTTPS enabled), but decided not to
 wait for that before publishing the results for the Redis driver using
 zaqar-bench.

 CPU usage on the Redis node peaked at around 75% for the one process. To
 better utilize the hardware, a production deployment would need to run
 multiple Redis processes and use Zaqar's backend pooling feature to
 distribute queues across the various instances.

 Several different messaging patterns were tested, taking inspiration
 from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)

 Each test was executed three times and the best time recorded.

 A ~1K sample message (1398 bytes) was used for all tests.

 ## Results ##

 ### Event Broadcasting (Read-Heavy) ###

 OK, so let's say you have a somewhat low-volume source, but tons of event
 observers. In this case, the observers easily outpace the producer, making
 this a read-heavy workload.

 Options
 * 1 producer process with 5 gevent workers
 * 1 message posted per request
 * 2 observer processes with 25 gevent workers each
 * 5 messages listed per request by the observers
 * Load distributed across 4[6] queues
 * 10-second duration


10 seconds is way too short



 Results
 * Redis
 * Producer: 1.7 ms/req,  585 req/sec
 * Observer: 1.5 ms/req, 1254 req/sec
 * Mongo
 * Producer: 2.2 ms/req,  454 req/sec
 * Observer: 1.5 ms/req, 1224 req/sec


If zaqar is like amazon SQS, then the latency for a single message and the
throughput for a single tenant is not important. I wouldn't expect anyone
who has latency sensitive work loads or needs massive throughput to use
zaqar, as these people wouldn't use SQS either. The consistency of the
latency (shouldn't change under load) and zaqar's ability to scale
horizontally mater much more. What I would be great to see some other
things benchmarked instead:

* graph latency versus number of concurrent active tenants
* graph latency versus message size
* How throughput scales as you scale up the number of assorted zaqar
components. If one of the benefits of zaqar is its horizontal scalability,
lets see it.
* How does this change 

Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-11 Thread Devananda van der Veen
On Wed, Sep 10, 2014 at 6:09 PM, Kurt Griffiths
kurt.griffi...@rackspace.com wrote:
 On 9/10/14, 3:58 PM, Devananda van der Veen devananda@gmail.com
 wrote:

I'm going to assume that, for these benchmarks, you configured all the
services optimally.

 Sorry for any confusion; I am not trying to hide anything about the setup.
 I thought I was pretty transparent about the way uWSGI, MongoDB, and Redis
 were configured. I tried to stick to mostly default settings to keep
 things simple, making it easier for others to reproduce/verify the results.

 Is there further information about the setup that you were curious about
 that I could provide? Was there a particular optimization that you didn’t
 see that you would recommend?


Nope.

I'm not going to question why you didn't run tests
with tens or hundreds of concurrent clients,

 If you review the different tests, you will note that a couple of them
 used at least 100 workers. That being said, I think we ought to try higher
 loads in future rounds of testing.


Perhaps I misunderstand what 2 processes with 25 gevent workers
means - I think this means you have two _processes_ which are using
greenthreads and eventlet, and so each of those two python processes
is swapping between 25 coroutines. From a load generation standpoint,
this is not the same as having 100 concurrent client _processes_.

or why you only ran the
tests for 10 seconds.

 In Round 1 I did mention that i wanted to do a followup with a longer
 duration. However, as I alluded to in the preamble for Round 2, I kept
 things the same for the redis tests to compare with the mongo ones done
 previously.

 We’ll increase the duration in the next round of testing.


Sure - consistency between tests is good. But I don't believe that a
10-second benchmark is ever enough to suss out service performance.
Lots of things only appear after high load has been applied for a
period of time as eg. caches fill up, though this leads to my next
point below...

Instead, I'm actually going to question how it is that, even with
relatively beefy dedicated hardware (128 GB RAM in your storage
nodes), Zaqar peaked at around 1,200 messages per second.

 I went back and ran some of the tests and never saw memory go over ~20M
 (as observed with redis-top) so these same results should be obtainable on
 a box with a lot less RAM.

Whoa. So, that's a *really* important piece of information which was,
afaict, missing from your previous email(s). I hope you can understand
how, with the information you provided (the Redis server has 128GB
RAM) I was shocked at the low performance.

 Furthermore, the tests only used 1 CPU on the
 Redis host, so again, similar results should be achievable on a much more
 modest box.

You described fairy beefy hardware but didn't utilize it fully -- I
was expecting your performance test to attempt to stress the various
components of a Zaqar installation and, at least in some way, attempt
to demonstrate what the capacity of a Zaqar deployment might be on the
hardware you have available. Thus my surprise at the low numbers. If
that wasn't your intent (and given the CPU/RAM usage your tests
achieved, it's not what you achieved) then my disappointment in those
performance numbers is unfounded.

But I hope you can understand, if I'm looking at a service benchmark
to gauge how well that service might perform in production, seeing
expensive hardware perform disappointingly slowly is not a good sign.


 FWIW, I went back and ran a couple scenarios to get some more data points.
 First, I did one with 50 producers and 50 observers. In that case, the
 single CPU on which the OS scheduled the Redis process peaked at 30%. The
 second test I did was with 50 producers + 5 observers + 50 consumers
 (which claim messages and delete them rather than simply page through
 them). This time, Redis used 78% of its CPU. I suppose this should not be
 surprising because the consumers do a lot more work than the observers.
 Meanwhile, load on the web head was fairly high; around 80% for all 20
 CPUs. This tells me that python and/or uWSGI are working pretty hard to
 serve these requests, and there may be some opportunities to optimize that
 layer. I suspect there are also some opportunities to reduce the number of
 Redis operations and roundtrips required to claim a batch of messages.


OK - those resource usages sound better. At least you generated enough
load to saturate the uWSGI process CPU, which is a good point to look
at performance of the system.

At that peak, what was the:
- average msgs/sec
- min/max/avg/stdev time to [post|get|delete] a message

 The other thing to consider is that in these first two rounds I did not
 test increasing amounts of load (number of clients performing concurrent
 requests) and graph that against latency and throughput. Out of curiosity,
 I just now did a quick test to compare the messages enqueued with 50
 producers + 5 observers + 50 consumers vs. adding another 50 producer
 clients 

Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-11 Thread Kurt Griffiths
On 9/11/14, 2:11 PM, Devananda van der Veen devananda@gmail.com
wrote:

OK - those resource usages sound better. At least you generated enough
load to saturate the uWSGI process CPU, which is a good point to look
at performance of the system.

At that peak, what was the:
- average msgs/sec
- min/max/avg/stdev time to [post|get|delete] a message

To be honest, it was a quick test and I didn’t note the exact metrics
other than eyeballing them to see that they were similar to the results
that I published for the scenarios that used the same load options (e.g.,
I just re-ran some of the same test scenarios).

Some of the metrics you mention aren’t currently reported by zaqar-bench,
but could be added easily enough. In any case, I think zaqar-bench is
going to end up being mostly useful to track relative performance gains or
losses on a patch-by-patch basis, and also as an easy way to smoke-test
both python-marconiclient and the service. For large-scale testing and
detailed metrics, other tools (e.g., Tsung, JMeter) are better for the
job, so I’ve been considering using them in future rounds.

Is that 2,181 msg/sec total, or per-producer?

That metric was a combined average rate for all producers.


I'd really like to see the total throughput and latency graphed as #
of clients increases. Or if graphing isn't your thing, even just post
a .csv of the raw numbers and I will be happy to graph it.

It would also be great to see how that scales as you add more Redis
instances until all the available CPU cores on your Redis host are in
Use.

Yep, I’ve got a long list of things like this that I’d like to see in
future rounds of performance testing (and I welcome anyone in the
community with an interest to join in), but I have to balance that effort
with a lot of other things that are on my plate right now.

Speaking generally, I’d like to see the project bake this in over time as
part of the CI process. It’s definitely useful information not just for
the developers but also for operators in terms of capacity planning. We’ve
talked as a team about doing this with Rally (and in fact, some work has
been started there), but it may be useful to also run a large-scale test
on a regular basis (at least per milestone). Regardless, I think it would
be great for the Zaqar team to connect with other projects (at the
summit?) who are working on perf testing to swap ideas, collaborate on
code/tools, etc.

--KG


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-11 Thread Boris Pavlovic
Kurt,

Speaking generally, I’d like to see the project bake this in over time as
 part of the CI process. It’s definitely useful information not just for
 the developers but also for operators in terms of capacity planning. We’ve


talked as a team about doing this with Rally  (and in fact, some work has

been started there), but it may be useful to also run a large-scale test

on a regular basis (at least per milestone).


I believe, we will be able to generate distributed load and generate at
least
20k rps in K cycle. We've done a lot of work during J in this direction,
but there is still a lot of to do.

So you'll be able to use the same tool for gates, local usage and
large-scale tests.

Best regards,
Boris Pavlovic



On Fri, Sep 12, 2014 at 3:17 AM, Kurt Griffiths 
kurt.griffi...@rackspace.com wrote:

 On 9/11/14, 2:11 PM, Devananda van der Veen devananda@gmail.com
 wrote:

 OK - those resource usages sound better. At least you generated enough
 load to saturate the uWSGI process CPU, which is a good point to look
 at performance of the system.
 
 At that peak, what was the:
 - average msgs/sec
 - min/max/avg/stdev time to [post|get|delete] a message

 To be honest, it was a quick test and I didn’t note the exact metrics
 other than eyeballing them to see that they were similar to the results
 that I published for the scenarios that used the same load options (e.g.,
 I just re-ran some of the same test scenarios).

 Some of the metrics you mention aren’t currently reported by zaqar-bench,
 but could be added easily enough. In any case, I think zaqar-bench is
 going to end up being mostly useful to track relative performance gains or
 losses on a patch-by-patch basis, and also as an easy way to smoke-test
 both python-marconiclient and the service. For large-scale testing and
 detailed metrics, other tools (e.g., Tsung, JMeter) are better for the
 job, so I’ve been considering using them in future rounds.

 Is that 2,181 msg/sec total, or per-producer?

 That metric was a combined average rate for all producers.

 
 I'd really like to see the total throughput and latency graphed as #
 of clients increases. Or if graphing isn't your thing, even just post
 a .csv of the raw numbers and I will be happy to graph it.
 
 It would also be great to see how that scales as you add more Redis
 instances until all the available CPU cores on your Redis host are in
 Use.

 Yep, I’ve got a long list of things like this that I’d like to see in
 future rounds of performance testing (and I welcome anyone in the
 community with an interest to join in), but I have to balance that effort
 with a lot of other things that are on my plate right now.

 Speaking generally, I’d like to see the project bake this in over time as
 part of the CI process. It’s definitely useful information not just for
 the developers but also for operators in terms of capacity planning. We’ve
 talked as a team about doing this with Rally (and in fact, some work has
 been started there), but it may be useful to also run a large-scale test
 on a regular basis (at least per milestone). Regardless, I think it would
 be great for the Zaqar team to connect with other projects (at the
 summit?) who are working on perf testing to swap ideas, collaborate on
 code/tools, etc.

 --KG


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-10 Thread Flavio Percoco
On 09/09/2014 09:19 PM, Kurt Griffiths wrote:
 Hi folks,
 
 In this second round of performance testing, I benchmarked the new Redis
 driver. I used the same setup and tests as in Round 1 to make it easier to
 compare the two drivers. I did not test Redis in master-slave mode, but
 that likely would not make a significant difference in the results since
 Redis replication is asynchronous[1].
 
 As always, the usual benchmarking disclaimers apply (i.e., take these
 numbers with a grain of salt; they are only intended to provide a ballpark
 reference; you should perform your own tests, simulating your specific
 scenarios and using your own hardware; etc.).
 
 ## Setup ##
 
 Rather than VMs, I provisioned some Rackspace OnMetal[3] servers to
 mitigate noisy neighbor when running the performance tests:
 
 * 1x Load Generator
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar-bench
 * 1x Web Head
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar server
 * storage=mongodb
 * partitions=4
 * MongoDB URI configured with w=majority
 * uWSGI + gevent
 * config: http://paste.openstack.org/show/100592/
 * app.py: http://paste.openstack.org/show/100593/
 * 3x MongoDB Nodes
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * mongod 2.6.4
 * Default config, except setting replSet and enabling periodic
   logging of CPU and I/O
 * Journaling enabled
 * Profiling on message DBs enabled for requests over 10ms
 * 1x Redis Node
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * Redis 2.4.14
 * Default config (snapshotting and AOF enabled)
 * One process
 
 As in Round 1, Keystone auth is disabled and requests go over HTTP, not
 HTTPS. The latency introduced by enabling these is outside the control of
 Zaqar, but should be quite minimal (speaking anecdotally, I would expect
 an additional 1-3ms for cached tokens and assuming an optimized TLS
 termination setup).
 
 For generating the load, I again used the zaqar-bench tool. I would like
 to see the team complete a large-scale Tsung test as well (including a
 full HA deployment with Keystone and HTTPS enabled), but decided not to
 wait for that before publishing the results for the Redis driver using
 zaqar-bench.
 
 CPU usage on the Redis node peaked at around 75% for the one process. To
 better utilize the hardware, a production deployment would need to run
 multiple Redis processes and use Zaqar's backend pooling feature to
 distribute queues across the various instances.
 
 Several different messaging patterns were tested, taking inspiration
 from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)
 
 Each test was executed three times and the best time recorded.
 
 A ~1K sample message (1398 bytes) was used for all tests.
 
 ## Results ##
 
 ### Event Broadcasting (Read-Heavy) ###
 
 OK, so let's say you have a somewhat low-volume source, but tons of event
 observers. In this case, the observers easily outpace the producer, making
 this a read-heavy workload.
 
 Options
 * 1 producer process with 5 gevent workers
 * 1 message posted per request
 * 2 observer processes with 25 gevent workers each
 * 5 messages listed per request by the observers
 * Load distributed across 4[6] queues
 * 10-second duration
 
 Results
 * Redis
 * Producer: 1.7 ms/req,  585 req/sec
 * Observer: 1.5 ms/req, 1254 req/sec
 * Mongo
 * Producer: 2.2 ms/req,  454 req/sec
 * Observer: 1.5 ms/req, 1224 req/sec
 
 ### Event Broadcasting (Balanced) ###
 
 This test uses the same number of producers and consumers, but note that
 the observers are still listing (up to) 5 messages at a time[4], so they
 still outpace the producers, but not as quickly as before.
 
 Options
 * 2 producer processes with 25 gevent workers each
 * 1 message posted per request
 * 2 observer processes with 25 gevent workers each
 * 5 messages listed per request by the observers
 * Load distributed across 4 queues
 * 10-second duration
 
 Results
 * Redis
 * Producer: 1.4 ms/req, 1374 req/sec
 * Observer: 1.6 ms/req, 1178 req/sec
 * Mongo
 * Producer: 2.2 ms/req, 883 req/sec
 * Observer: 2.8 ms/req, 348 req/sec
 
 ### Point-to-Point Messaging ###
 

Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-10 Thread Kurt Griffiths
Thanks! Looks good. Only thing I noticed was that footnotes were still
referenced, but did not appear at the bottom of the page.

On 9/10/14, 6:16 AM, Flavio Percoco fla...@redhat.com wrote:

I've collected the information from both performance tests and put it in
the project's wiki[0] Please, double check :D

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-10 Thread Devananda van der Veen
On Tue, Sep 9, 2014 at 12:19 PM, Kurt Griffiths
kurt.griffi...@rackspace.com wrote:
 Hi folks,

 In this second round of performance testing, I benchmarked the new Redis
 driver. I used the same setup and tests as in Round 1 to make it easier to
 compare the two drivers. I did not test Redis in master-slave mode, but
 that likely would not make a significant difference in the results since
 Redis replication is asynchronous[1].

 As always, the usual benchmarking disclaimers apply (i.e., take these
 numbers with a grain of salt; they are only intended to provide a ballpark
 reference; you should perform your own tests, simulating your specific
 scenarios and using your own hardware; etc.).

 ## Setup ##

 Rather than VMs, I provisioned some Rackspace OnMetal[3] servers to
 mitigate noisy neighbor when running the performance tests:

 * 1x Load Generator
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar-bench
 * 1x Web Head
 * Hardware
 * 1x Intel Xeon E5-2680 v2 2.8Ghz
 * 32 GB RAM
 * 10Gbps NIC
 * 32GB SATADOM
 * Software
 * Debian Wheezy
 * Python 2.7.3
 * zaqar server
 * storage=mongodb
 * partitions=4
 * MongoDB URI configured with w=majority
 * uWSGI + gevent
 * config: http://paste.openstack.org/show/100592/
 * app.py: http://paste.openstack.org/show/100593/
 * 3x MongoDB Nodes
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * mongod 2.6.4
 * Default config, except setting replSet and enabling periodic
   logging of CPU and I/O
 * Journaling enabled
 * Profiling on message DBs enabled for requests over 10ms
 * 1x Redis Node
 * Hardware
 * 2x Intel Xeon E5-2680 v2 2.8Ghz
 * 128 GB RAM
 * 10Gbps NIC
 * 2x LSI Nytro WarpDrive BLP4-1600[2]
 * Software
 * Debian Wheezy
 * Redis 2.4.14
 * Default config (snapshotting and AOF enabled)
 * One process

 As in Round 1, Keystone auth is disabled and requests go over HTTP, not
 HTTPS. The latency introduced by enabling these is outside the control of
 Zaqar, but should be quite minimal (speaking anecdotally, I would expect
 an additional 1-3ms for cached tokens and assuming an optimized TLS
 termination setup).

 For generating the load, I again used the zaqar-bench tool. I would like
 to see the team complete a large-scale Tsung test as well (including a
 full HA deployment with Keystone and HTTPS enabled), but decided not to
 wait for that before publishing the results for the Redis driver using
 zaqar-bench.

 CPU usage on the Redis node peaked at around 75% for the one process. To
 better utilize the hardware, a production deployment would need to run
 multiple Redis processes and use Zaqar's backend pooling feature to
 distribute queues across the various instances.

 Several different messaging patterns were tested, taking inspiration
 from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)

 Each test was executed three times and the best time recorded.

 A ~1K sample message (1398 bytes) was used for all tests.

 ## Results ##

 ### Event Broadcasting (Read-Heavy) ###

 OK, so let's say you have a somewhat low-volume source, but tons of event
 observers. In this case, the observers easily outpace the producer, making
 this a read-heavy workload.

 Options
 * 1 producer process with 5 gevent workers
 * 1 message posted per request
 * 2 observer processes with 25 gevent workers each
 * 5 messages listed per request by the observers
 * Load distributed across 4[6] queues
 * 10-second duration

 Results
 * Redis
 * Producer: 1.7 ms/req,  585 req/sec
 * Observer: 1.5 ms/req, 1254 req/sec
 * Mongo
 * Producer: 2.2 ms/req,  454 req/sec
 * Observer: 1.5 ms/req, 1224 req/sec

 ### Event Broadcasting (Balanced) ###

 This test uses the same number of producers and consumers, but note that
 the observers are still listing (up to) 5 messages at a time[4], so they
 still outpace the producers, but not as quickly as before.

 Options
 * 2 producer processes with 25 gevent workers each
 * 1 message posted per request
 * 2 observer processes with 25 gevent workers each
 * 5 messages listed per request by the observers
 * Load distributed across 4 queues
 * 10-second duration

 Results
 * Redis
 * Producer: 1.4 ms/req, 1374 req/sec
 * Observer: 1.6 ms/req, 1178 req/sec
 * Mongo
 * Producer: 2.2 ms/req, 883 req/sec
 * Observer: 2.8 ms/req, 348 req/sec

 ### 

Re: [openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-10 Thread Kurt Griffiths
On 9/10/14, 3:58 PM, Devananda van der Veen devananda@gmail.com
wrote:

I'm going to assume that, for these benchmarks, you configured all the
services optimally.

Sorry for any confusion; I am not trying to hide anything about the setup.
I thought I was pretty transparent about the way uWSGI, MongoDB, and Redis
were configured. I tried to stick to mostly default settings to keep
things simple, making it easier for others to reproduce/verify the results.

Is there further information about the setup that you were curious about
that I could provide? Was there a particular optimization that you didn’t
see that you would recommend?

I'm not going to question why you didn't run tests
with tens or hundreds of concurrent clients,

If you review the different tests, you will note that a couple of them
used at least 100 workers. That being said, I think we ought to try higher
loads in future rounds of testing.

or why you only ran the
tests for 10 seconds.

In Round 1 I did mention that i wanted to do a followup with a longer
duration. However, as I alluded to in the preamble for Round 2, I kept
things the same for the redis tests to compare with the mongo ones done
previously.

We’ll increase the duration in the next round of testing.

Instead, I'm actually going to question how it is that, even with
relatively beefy dedicated hardware (128 GB RAM in your storage
nodes), Zaqar peaked at around 1,200 messages per second.

I went back and ran some of the tests and never saw memory go over ~20M
(as observed with redis-top) so these same results should be obtainable on
a box with a lot less RAM. Furthermore, the tests only used 1 CPU on the
Redis host, so again, similar results should be achievable on a much more
modest box.

FWIW, I went back and ran a couple scenarios to get some more data points.
First, I did one with 50 producers and 50 observers. In that case, the
single CPU on which the OS scheduled the Redis process peaked at 30%. The
second test I did was with 50 producers + 5 observers + 50 consumers
(which claim messages and delete them rather than simply page through
them). This time, Redis used 78% of its CPU. I suppose this should not be
surprising because the consumers do a lot more work than the observers.
Meanwhile, load on the web head was fairly high; around 80% for all 20
CPUs. This tells me that python and/or uWSGI are working pretty hard to
serve these requests, and there may be some opportunities to optimize that
layer. I suspect there are also some opportunities to reduce the number of
Redis operations and roundtrips required to claim a batch of messages.

The other thing to consider is that in these first two rounds I did not
test increasing amounts of load (number of clients performing concurrent
requests) and graph that against latency and throughput. Out of curiosity,
I just now did a quick test to compare the messages enqueued with 50
producers + 5 observers + 50 consumers vs. adding another 50 producer
clients and found that the producers were able to post 2,181 messages per
second while giving up only 0.3 ms.

--KG

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [zaqar] Juno Performance Testing (Round 2)

2014-09-09 Thread Kurt Griffiths
Hi folks,

In this second round of performance testing, I benchmarked the new Redis
driver. I used the same setup and tests as in Round 1 to make it easier to
compare the two drivers. I did not test Redis in master-slave mode, but
that likely would not make a significant difference in the results since
Redis replication is asynchronous[1].

As always, the usual benchmarking disclaimers apply (i.e., take these
numbers with a grain of salt; they are only intended to provide a ballpark
reference; you should perform your own tests, simulating your specific
scenarios and using your own hardware; etc.).

## Setup ##

Rather than VMs, I provisioned some Rackspace OnMetal[3] servers to
mitigate noisy neighbor when running the performance tests:

* 1x Load Generator
* Hardware
* 1x Intel Xeon E5-2680 v2 2.8Ghz
* 32 GB RAM
* 10Gbps NIC
* 32GB SATADOM
* Software
* Debian Wheezy
* Python 2.7.3
* zaqar-bench
* 1x Web Head
* Hardware
* 1x Intel Xeon E5-2680 v2 2.8Ghz
* 32 GB RAM
* 10Gbps NIC
* 32GB SATADOM
* Software
* Debian Wheezy
* Python 2.7.3
* zaqar server
* storage=mongodb
* partitions=4
* MongoDB URI configured with w=majority
* uWSGI + gevent
* config: http://paste.openstack.org/show/100592/
* app.py: http://paste.openstack.org/show/100593/
* 3x MongoDB Nodes
* Hardware
* 2x Intel Xeon E5-2680 v2 2.8Ghz
* 128 GB RAM
* 10Gbps NIC
* 2x LSI Nytro WarpDrive BLP4-1600[2]
* Software
* Debian Wheezy
* mongod 2.6.4
* Default config, except setting replSet and enabling periodic
  logging of CPU and I/O
* Journaling enabled
* Profiling on message DBs enabled for requests over 10ms
* 1x Redis Node
* Hardware
* 2x Intel Xeon E5-2680 v2 2.8Ghz
* 128 GB RAM
* 10Gbps NIC
* 2x LSI Nytro WarpDrive BLP4-1600[2]
* Software
* Debian Wheezy
* Redis 2.4.14
* Default config (snapshotting and AOF enabled)
* One process

As in Round 1, Keystone auth is disabled and requests go over HTTP, not
HTTPS. The latency introduced by enabling these is outside the control of
Zaqar, but should be quite minimal (speaking anecdotally, I would expect
an additional 1-3ms for cached tokens and assuming an optimized TLS
termination setup).

For generating the load, I again used the zaqar-bench tool. I would like
to see the team complete a large-scale Tsung test as well (including a
full HA deployment with Keystone and HTTPS enabled), but decided not to
wait for that before publishing the results for the Redis driver using
zaqar-bench.

CPU usage on the Redis node peaked at around 75% for the one process. To
better utilize the hardware, a production deployment would need to run
multiple Redis processes and use Zaqar's backend pooling feature to
distribute queues across the various instances.

Several different messaging patterns were tested, taking inspiration
from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar)

Each test was executed three times and the best time recorded.

A ~1K sample message (1398 bytes) was used for all tests.

## Results ##

### Event Broadcasting (Read-Heavy) ###

OK, so let's say you have a somewhat low-volume source, but tons of event
observers. In this case, the observers easily outpace the producer, making
this a read-heavy workload.

Options
* 1 producer process with 5 gevent workers
* 1 message posted per request
* 2 observer processes with 25 gevent workers each
* 5 messages listed per request by the observers
* Load distributed across 4[6] queues
* 10-second duration

Results
* Redis
* Producer: 1.7 ms/req,  585 req/sec
* Observer: 1.5 ms/req, 1254 req/sec
* Mongo
* Producer: 2.2 ms/req,  454 req/sec
* Observer: 1.5 ms/req, 1224 req/sec

### Event Broadcasting (Balanced) ###

This test uses the same number of producers and consumers, but note that
the observers are still listing (up to) 5 messages at a time[4], so they
still outpace the producers, but not as quickly as before.

Options
* 2 producer processes with 25 gevent workers each
* 1 message posted per request
* 2 observer processes with 25 gevent workers each
* 5 messages listed per request by the observers
* Load distributed across 4 queues
* 10-second duration

Results
* Redis
* Producer: 1.4 ms/req, 1374 req/sec
* Observer: 1.6 ms/req, 1178 req/sec
* Mongo
* Producer: 2.2 ms/req, 883 req/sec
* Observer: 2.8 ms/req, 348 req/sec

### Point-to-Point Messaging ###

In this scenario I simulated one client sending messages directly to a
different client. Only one queue is required in this case[5].

Options
* 1 producer process with 1 gevent