Re: [Openstack] Caching strategies in Nova ...
Thanks ... that's good feedback and we were discussing cache invalidation issues today. Any tips or suggestions? -S On 03/22/2012 09:28 PM, Joshua Harlow wrote: Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Yup, makes sense. Thanks for the feedback. I agree that the external caches are troublesome and we'll likely be focusing on the internal ones. Whether that manifests itself as a memcache-like implementation or another db view is unknown. The other thing about in-process caching I like is the ability to have it in a common (nova-common?) library where we can easily compute hit/miss ratios and adjust accordingly. -S On 03/23/2012 12:02 AM, Mark Washenberger wrote: This is precisely my concern. It must be brought up that with Rackspace Cloud Servers, nearly all client codes routinely submit requests with a query parameter cache-busting=some random string just to get around problems with cache invalidation. And woe to the client that does not. I get the feeling that once trust like this is lost, a project has a hard time regaining it. I'm not saying that we can avoid inconsistency entirely. Rather, I believe we will have to embrace some eventual-consistency models to enable the performance and scale we will ultimately attain. But I just get the feeling that generic caches are really only appropriate for write-once or at least write-rarely data. So personally I would rule out external caches entirely and try to be very judicious in selecting internal caches as well. Joshua Harlow harlo...@yahoo-inc.com said: ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
(resent to list as I realized I just did a Reply) Cool! This is great stuff. Look forward to seeing the branch. I started working on a similar tool that takes the data collected from Tach and fetches the data from Graphite to look at the performance issues (no changes to nova trunk requires since Tach is awesome). It's a shell of an idea yet, but the basics work: https://github.com/ohthree/novaprof But if there is something already existing, I'm happy to kill it off. I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. Thanks again for the great feedback ... keep it comin'! -S On 03/22/2012 11:53 PM, Mark Washenberger wrote: Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is an area where we should improve, since these are the most frequent api queries. I just see a ton of specific performance problems that are easier to address one by one, rather than diving into a general (albeit obvious) solution such as caching. Sandy Walsh sandy.wa...@rackspace.com said: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Was reading up some more on cache invalidation schemes last night. The best practice approach seems to be using a sequence ID in the key. When you want to invalidate a large set of keys, just bump the sequence id. This could easily be handled with a notifier that listens to instance state changes. Thoughts? On 03/22/2012 09:28 PM, Joshua Harlow wrote: Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
I'd prefer to just set a different expectation for the user. Rather than worrying about state change and invalidation, lets just set the expectation that the system as a whole is eventually consistent. I would love to prevent any cache busting strategies or expectations as well as anything that requires something other than time based data refreshing. We can all agree, I hope, that there is some level of eventual consistency even without caching in our current system. The fact is that db updates are not instantaneous with other changes in the system; see snapshotting, instance creation, etc. What I'd like to see is additional fields included in the API response that how old this particular piece of data is. This way the consumer can decide if they need to be concerned about the fact that this state hasn't changed, and it allows operators to tune their system to whatever their deployments can handle. If we are exploring caching, I think that gives us the advantage of not a lot of extra code that worries about invalidation, allowing deployers to not use caching at all if its unneeded, and paves the way for view tables in large deployments which I think is important when we are thinking about this on a large scale. Gabe -Original Message- From: openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net [mailto:openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net] On Behalf Of Sandy Walsh Sent: Friday, March 23, 2012 7:58 AM To: Joshua Harlow Cc: openstack Subject: Re: [Openstack] Caching strategies in Nova ... Was reading up some more on cache invalidation schemes last night. The best practice approach seems to be using a sequence ID in the key. When you want to invalidate a large set of keys, just bump the sequence id. This could easily be handled with a notifier that listens to instance state changes. Thoughts? On 03/22/2012 09:28 PM, Joshua Harlow wrote: Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On 03/23/2012 09:44 AM, Gabe Westmaas wrote: I'd prefer to just set a different expectation for the user. Rather than worrying about state change and invalidation, lets just set the expectation that the system as a whole is eventually consistent. I would love to prevent any cache busting strategies or expectations as well as anything that requires something other than time based data refreshing. We can all agree, I hope, that there is some level of eventual consistency even without caching in our current system. The fact is that db updates are not instantaneous with other changes in the system; see snapshotting, instance creation, etc. I think that's completely valid. The in-process caching schemes are really just implementation techniques. The end-result (of view tables vs key/value in-memory dicts vs whatever) is the same. What I'd like to see is additional fields included in the API response that how old this particular piece of data is. This way the consumer can decide if they need to be concerned about the fact that this state hasn't changed, and it allows operators to tune their system to whatever their deployments can handle. If we are exploring caching, I think that gives us the advantage of not a lot of extra code that worries about invalidation, allowing deployers to not use caching at all if its unneeded, and paves the way for view tables in large deployments which I think is important when we are thinking about this on a large scale. My fear is clients will simply start to poll the system until new data magically appears. An alternative might be, rather than say how old the data is, how long until the cache expires? Gabe -Original Message- From: openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net [mailto:openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net] On Behalf Of Sandy Walsh Sent: Friday, March 23, 2012 7:58 AM To: Joshua Harlow Cc: openstack Subject: Re: [Openstack] Caching strategies in Nova ... Was reading up some more on cache invalidation schemes last night. The best practice approach seems to be using a sequence ID in the key. When you want to invalidate a large set of keys, just bump the sequence id. This could easily be handled with a notifier that listens to instance state changes. Thoughts? On 03/22/2012 09:28 PM, Joshua Harlow wrote: Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On 3/23/12 8:56 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: On 03/23/2012 09:44 AM, Gabe Westmaas wrote: I'd prefer to just set a different expectation for the user. Rather than worrying about state change and invalidation, lets just set the expectation that the system as a whole is eventually consistent. I would love to prevent any cache busting strategies or expectations as well as anything that requires something other than time based data refreshing. We can all agree, I hope, that there is some level of eventual consistency even without caching in our current system. The fact is that db updates are not instantaneous with other changes in the system; see snapshotting, instance creation, etc. I think that's completely valid. The in-process caching schemes are really just implementation techniques. The end-result (of view tables vs key/value in-memory dicts vs whatever) is the same. Agreed! As long as the interface doesn't imply one implementation over another (see below). What I'd like to see is additional fields included in the API response that how old this particular piece of data is. This way the consumer can decide if they need to be concerned about the fact that this state hasn't changed, and it allows operators to tune their system to whatever their deployments can handle. If we are exploring caching, I think that gives us the advantage of not a lot of extra code that worries about invalidation, allowing deployers to not use caching at all if its unneeded, and paves the way for view tables in large deployments which I think is important when we are thinking about this on a large scale. My fear is clients will simply start to poll the system until new data magically appears. An alternative might be, rather than say how old the data is, how long until the cache expires? Definitely a valid concern. However, I kind of expect that many users will still poll even if they know they won't get new data until X time. In addition, I think if we say how old the data is, it still implies too much knowledge unless we go with a strict caching system. I'd love for us to leave the ability for us to update that data asynchronously, and hopefully really quickly, except in the cases where the system is under unexpected load. Basically, if we give them that information, and we miss it, that¹s a call in to support, not to say they won't call in if it takes too long to update, of course. Also, if its hitting a cache or something optimized for GETs, hopefully we can handle lots of polling by adding more API nodes. Gabe Gabe -Original Message- From: openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net [mailto:openstack- bounces+gabe.westmaas=rackspace@lists.launchpad.net] On Behalf Of Sandy Walsh Sent: Friday, March 23, 2012 7:58 AM To: Joshua Harlow Cc: openstack Subject: Re: [Openstack] Caching strategies in Nova ... Was reading up some more on cache invalidation schemes last night. The best practice approach seems to be using a sequence ID in the key. When you want to invalidate a large set of keys, just bump the sequence id. This could easily be handled with a notifier that listens to instance state changes. Thoughts? On 03/22/2012 09:28 PM, Joshua Harlow wrote: Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing
Re: [Openstack] Caching strategies in Nova ...
Alas, I let my patch get too stale to rebase properly. However, it is a fairly dumb approach I took that can be demonstrated just from the patch. And in any case, I think the approach you're taking, profiling based on Tach, is going to be better in the long run and more share-able in the community. + 1 gazillion to getting good metrics! Sandy Walsh sandy.wa...@rackspace.com said: (resent to list as I realized I just did a Reply) Cool! This is great stuff. Look forward to seeing the branch. I started working on a similar tool that takes the data collected from Tach and fetches the data from Graphite to look at the performance issues (no changes to nova trunk requires since Tach is awesome). It's a shell of an idea yet, but the basics work: https://github.com/ohthree/novaprof But if there is something already existing, I'm happy to kill it off. I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. Thanks again for the great feedback ... keep it comin'! -S On 03/22/2012 11:53 PM, Mark Washenberger wrote: Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is an area where we should improve, since these are the most frequent api queries. I just see a ton of specific performance problems that are easier to address one by one, rather than diving into a general (albeit obvious) solution such as caching. Sandy Walsh sandy.wa...@rackspace.com said: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to :
Re: [Openstack] Caching strategies in Nova ...
On Fri, 2012-03-23 at 13:43 +, Gabe Westmaas wrote: However, I kind of expect that many users will still poll even if they know they won't get new data until X time. I wish there was some kind of way for us to issue push notifications to the client, i.e., have the client register some sort of callback and what piece of data / state change they're interested in, then nova would call that callback when the condition occurred. It probably wouldn't stop polling, but we could ratchet down rate limits to encourage users to use the callback mechanism. Of course, then there's the problem of, what if the user is behind a firewall or some sort of NAT... :/ -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Fri, 2012-03-23 at 08:55 -0300, Sandy Walsh wrote: I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. For that matter, I wouldn't be surprised if there were things we could do to nova's DB to speed things up. For instance, what if we supported non-SQL data stores? -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Fri, Mar 23, 2012, Kevin L. Mitchell kevin.mitch...@rackspace.com wrote: On Fri, 2012-03-23 at 13:43 +, Gabe Westmaas wrote: However, I kind of expect that many users will still poll even if they know they won't get new data until X time. I wish there was some kind of way for us to issue push notifications to the client, i.e., have the client register some sort of callback and what piece of data / state change they're interested in, then nova would call that callback when the condition occurred. It probably wouldn't stop polling, but we could ratchet down rate limits to encourage users to use the callback mechanism. Of course, then there's the problem of, what if the user is behind a firewall or some sort of NAT... :/ Long polling is always an option. JE ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On 03/23/2012 11:36 AM, Johannes Erdfelt wrote: On Fri, Mar 23, 2012, Kevin L. Mitchell kevin.mitch...@rackspace.com wrote: On Fri, 2012-03-23 at 13:43 +, Gabe Westmaas wrote: However, I kind of expect that many users will still poll even if they know they won't get new data until X time. I wish there was some kind of way for us to issue push notifications to the client, i.e., have the client register some sort of callback and what piece of data / state change they're interested in, then nova would call that callback when the condition occurred. It probably wouldn't stop polling, but we could ratchet down rate limits to encourage users to use the callback mechanism. Of course, then there's the problem of, what if the user is behind a firewall or some sort of NAT... :/ Long polling is always an option. Or WebSockets. -- Russell Bryant ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Mar 23, 2012, at 11:22 AM, Kevin L. Mitchell wrote: On Fri, 2012-03-23 at 08:55 -0300, Sandy Walsh wrote: I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. For that matter, I wouldn't be surprised if there were things we could do to nova's DB to speed things up. For instance, what if we supported non-SQL data stores? Any database is going to be slow if you're talking to it more than necessary. Even if we replaced MySQL with the latest and greatest web-scale noSQL database out there we'd still be slow. I'd love to see a combination effort of improving the flexibility of the DB layer as well as improvements surrounding the sheer number of calls to the database. -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Right, Lets fix the problem, not add a patch that hides the problem. U can't put lipstick on a pig, haha. Its still a pig... On 3/22/12 8:02 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: This is precisely my concern. It must be brought up that with Rackspace Cloud Servers, nearly all client codes routinely submit requests with a query parameter cache-busting=some random string just to get around problems with cache invalidation. And woe to the client that does not. I get the feeling that once trust like this is lost, a project has a hard time regaining it. I'm not saying that we can avoid inconsistency entirely. Rather, I believe we will have to embrace some eventual-consistency models to enable the performance and scale we will ultimately attain. But I just get the feeling that generic caches are really only appropriate for write-once or at least write-rarely data. So personally I would rule out external caches entirely and try to be very judicious in selecting internal caches as well. Joshua Harlow harlo...@yahoo-inc.com said: ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
+ 100 On 3/23/12 10:50 AM, Brian Lamar brian.la...@rackspace.com wrote: On Mar 23, 2012, at 11:22 AM, Kevin L. Mitchell wrote: On Fri, 2012-03-23 at 08:55 -0300, Sandy Walsh wrote: I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. For that matter, I wouldn't be surprised if there were things we could do to nova's DB to speed things up. For instance, what if we supported non-SQL data stores? Any database is going to be slow if you're talking to it more than necessary. Even if we replaced MySQL with the latest and greatest web-scale noSQL database out there we'd still be slow. I'd love to see a combination effort of improving the flexibility of the DB layer as well as improvements surrounding the sheer number of calls to the database. -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
+1 to DBs being slow. But what if we used a combo of memcache and db. Or use couch/mongo. Comparision: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis Anyone has experience on large deployments to see the kind of db traffic we need to optimize for? Another thing could be to avoid joins and then do sharding. debo -Original Message- From: openstack-bounces+dedutta=cisco@lists.launchpad.net [mailto:openstack-bounces+dedutta=cisco@lists.launchpad.net] On Behalf Of Brian Lamar Sent: Friday, March 23, 2012 10:51 AM To: openstack@lists.launchpad.net Subject: Re: [Openstack] Caching strategies in Nova ... On Mar 23, 2012, at 11:22 AM, Kevin L. Mitchell wrote: On Fri, 2012-03-23 at 08:55 -0300, Sandy Walsh wrote: I don't doubt for a second the db is the culprit for many of our woes. The thing I like about internal caching using established tools is that it works for db issues too without having to resort to custom tables. SQL query optimization, I'm sure, will go equally far. For that matter, I wouldn't be surprised if there were things we could do to nova's DB to speed things up. For instance, what if we supported non-SQL data stores? Any database is going to be slow if you're talking to it more than necessary. Even if we replaced MySQL with the latest and greatest web-scale noSQL database out there we'd still be slow. I'd love to see a combination effort of improving the flexibility of the DB layer as well as improvements surrounding the sheer number of calls to the database. -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Fri, Mar 23, 2012, Debo Dutta (dedutta) dedu...@cisco.com wrote: +1 to DBs being slow. But what if we used a combo of memcache and db. Or use couch/mongo. Comparision: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis Anyone has experience on large deployments to see the kind of db traffic we need to optimize for? Another thing could be to avoid joins and then do sharding. Seems like that's the opposite of what we want. MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. JE ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
You can. The sanctioned approach is to use Yagi with a feed into something like PubSubHubBub that lives on the public interweeb. It's just an optional component. -S On 03/23/2012 12:20 PM, Kevin L. Mitchell wrote: On Fri, 2012-03-23 at 13:43 +, Gabe Westmaas wrote: However, I kind of expect that many users will still poll even if they know they won't get new data until X time. I wish there was some kind of way for us to issue push notifications to the client, i.e., have the client register some sort of callback and what piece of data / state change they're interested in, then nova would call that callback when the condition occurred. It probably wouldn't stop polling, but we could ratchet down rate limits to encourage users to use the callback mechanism. Of course, then there's the problem of, what if the user is behind a firewall or some sort of NAT... :/ ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Ugh (reply vs reply-all again) On 03/23/2012 02:58 PM, Joshua Harlow wrote: Right, Lets fix the problem, not add a patch that hides the problem. U can’t put lipstick on a pig, haha. Its still a pig... When stuff is expensive to compute, caching is the only option (yes?). Whether that lives in memcache, a db or in a dict. Tuning sql queries will only get us so far. I think creating custom view tables is a laborious and error prone tact ... additionally you get developers that start to depend on the view tables as gospel. Or am I missing something here? -S On 3/22/12 8:02 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: This is precisely my concern. It must be brought up that with Rackspace Cloud Servers, nearly all client codes routinely submit requests with a query parameter cache-busting=some random string just to get around problems with cache invalidation. And woe to the client that does not. I get the feeling that once trust like this is lost, a project has a hard time regaining it. I'm not saying that we can avoid inconsistency entirely. Rather, I believe we will have to embrace some eventual-consistency models to enable the performance and scale we will ultimately attain. But I just get the feeling that generic caches are really only appropriate for write-once or at least write-rarely data. So personally I would rule out external caches entirely and try to be very judicious in selecting internal caches as well. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Mar 23, 2012, at 10:20 AM, Kevin L. Mitchell wrote: On Fri, 2012-03-23 at 13:43 +, Gabe Westmaas wrote: However, I kind of expect that many users will still poll even if they know they won't get new data until X time. I wish there was some kind of way for us to issue push notifications to the client, i.e., have the client register some sort of callback and what piece of data / state change they're interested in, then nova would call that callback when the condition occurred. It probably wouldn't stop polling, but we could ratchet down rate limits to encourage users to use the callback mechanism. Actually, that is (one) of the things the notifications system was designed to accommodate. If you use attach a feed generator (like Yagi) to the notification queues, plus a PubSubHubub hub, folks can subscribe to events by event type. (Other pubsub strategies would work too, like XMPP pubsub) Of course, then there's the problem of, what if the user is behind a firewall or some sort of NAT... :/ PSH pushes to a web callback supplied by the client. Presumably they could run the callback receiver somewhere else, or some thru some proxy. -- Kevin L. Mitchell kevin.mitch...@rackspace.com ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp -- Monsyne M. Dragon OpenStack/Nova cell 210-441-0965 work x 5014190 ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Was the db on a separate server or loopback? On 03/23/2012 05:26 PM, Mark Washenberger wrote: Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On 03/23/2012 01:26 PM, Mark Washenberger wrote: Johannes Erdfeltjohan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. How much data would the queries return, and how long between queries? One networking thing that might come into play would be slow start after idle - if the query returns are INITCWND (either 3 or 10 segments depending on which kernel) and they are separated by at least one RTO (or is it RTT?) then they will hit slow start each time. Now, the extent to which that matters is a function of how large the return is, and it is only adding RTTs so it wouldn't be minutes, but it could add up a bit I suppose. rick jones ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
This is great: hard numbers are exactly what we need. I would love to see a statement-by-statement SQL log with timings from someone that has a performance issue. I'm happy to look into any DB problems that demonstrates. The nova database is small enough that it should always be in-memory (if you're running a million VMs, I don't think asking for one gigabyte of RAM on your DB is unreasonable!) If it isn't hitting disk, PostgreSQL or MySQL with InnoDB can serve 10k 'indexed' requests per second through SQL on a low-end ($1000) box. With tuning you can get 10x that. Using one of the SQL bypass engines (e.g. MySQL HandlerSocket) can supposedly give you 10x again. Throwing money at the problem in the form of multi-processor boxes (or disks if you're I/O bound) can probably get you 10x again. However, if you put a DB on a remote host, you'll have to wait for a network round-trip per query. If your ORM is doing a 1+N query, the total read time will be slow. If your DB is doing a sync on every write, writes will be slow. If the DB isn't tuned with a sensible amount of cache (at least as big as the DB size), it will be slow(er). Each of these has a very simple fix for OpenStack. Relational databases have very efficient caching mechanisms built in. Any out-of-process cache will have a hard time beating it. Let's make sure the bottleneck is the DB, and not (for example) RabbitMQ, before we go off a huge rearchitecture. Justin On Thu, Mar 22, 2012 at 7:53 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is an area where we should improve, since these are the most frequent api queries. I just see a ton of specific performance problems that are easier to address one by one, rather than diving into a general (albeit obvious) solution such as caching. Sandy Walsh sandy.wa...@rackspace.com said: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems
Re: [Openstack] Caching strategies in Nova ...
Hi Mark, what workload and what setup do you have while you are profiling? e.g. how many compute nodes do you have, how many VMs do you have, are you creating/destroying/migrating VMs, volumes, networks? Thanks, Yun On Fri, Mar 23, 2012 at 4:26 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
+1 Documenting these findings would be nice too. best, Joe On Fri, Mar 23, 2012 at 2:15 PM, Justin Santa Barbara jus...@fathomdb.comwrote: This is great: hard numbers are exactly what we need. I would love to see a statement-by-statement SQL log with timings from someone that has a performance issue. I'm happy to look into any DB problems that demonstrates. The nova database is small enough that it should always be in-memory (if you're running a million VMs, I don't think asking for one gigabyte of RAM on your DB is unreasonable!) If it isn't hitting disk, PostgreSQL or MySQL with InnoDB can serve 10k 'indexed' requests per second through SQL on a low-end ($1000) box. With tuning you can get 10x that. Using one of the SQL bypass engines (e.g. MySQL HandlerSocket) can supposedly give you 10x again. Throwing money at the problem in the form of multi-processor boxes (or disks if you're I/O bound) can probably get you 10x again. However, if you put a DB on a remote host, you'll have to wait for a network round-trip per query. If your ORM is doing a 1+N query, the total read time will be slow. If your DB is doing a sync on every write, writes will be slow. If the DB isn't tuned with a sensible amount of cache (at least as big as the DB size), it will be slow(er). Each of these has a very simple fix for OpenStack. Relational databases have very efficient caching mechanisms built in. Any out-of-process cache will have a hard time beating it. Let's make sure the bottleneck is the DB, and not (for example) RabbitMQ, before we go off a huge rearchitecture. Justin On Thu, Mar 22, 2012 at 7:53 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is an area where we should improve, since these are the most frequent api queries. I just see a ton of specific performance problems that are easier to address one by one, rather than diving into a general (albeit obvious) solution such as caching. Sandy Walsh sandy.wa...@rackspace.com said: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them
Re: [Openstack] Caching strategies in Nova ...
Hmm. . it was definitely different xen virtual machines on either the same hypervisor or one that was adjacent to it in an L2 sense. On a similar environment I have set up now, I notice that the ping time from one vm to another on the same hypervisor is not noticeably less than the ping time to a vm on a different hypervisor. Not sure why that is the case! In any case it is trivial. . ~3 ms for first ping, ~0.3 ms for subsequent pings. Sandy Walsh sandy.wa...@rackspace.com said: Was the db on a separate server or loopback? On 03/23/2012 05:26 PM, Mark Washenberger wrote: Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Yun, I was working with a very small but fairly realistic setup. In this case I had only 3 Xen hosts, no more than 10 nova vms up at a time. And the environment was very nearly fresh so I believe the db tables were as small as they could be. I believe the utilization across the board in my setup was very low, and indeed the numbers were very consistent (I ran a large number of times, but didn't save all of the data :-(). Also, there were only 2 compute nodes running, but as the workflow only had rpc casts, I'm not sure that really mattered very much. The profile I gave was for vm creation. But I also ran tests for deletion, listing, and showing vms in the OS API. Networks were static throughout the process. Volumes were absent. Yun Mao yun...@gmail.com said: Hi Mark, what workload and what setup do you have while you are profiling? e.g. how many compute nodes do you have, how many VMs do you have, are you creating/destroying/migrating VMs, volumes, networks? Thanks, Yun On Fri, Mar 23, 2012 at 4:26 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Great suggestions guys ... we'll give some thought on how the community can share and compare performance measurements in a consistent way. -S On 03/23/2012 07:26 PM, Joe Gordon wrote: +1 Documenting these findings would be nice too. best, Joe On Fri, Mar 23, 2012 at 2:15 PM, Justin Santa Barbara jus...@fathomdb.com mailto:jus...@fathomdb.com wrote: This is great: hard numbers are exactly what we need. I would love to see a statement-by-statement SQL log with timings from someone that has a performance issue. I'm happy to look into any DB problems that demonstrates. The nova database is small enough that it should always be in-memory (if you're running a million VMs, I don't think asking for one gigabyte of RAM on your DB is unreasonable!) If it isn't hitting disk, PostgreSQL or MySQL with InnoDB can serve 10k 'indexed' requests per second through SQL on a low-end ($1000) box. With tuning you can get 10x that. Using one of the SQL bypass engines (e.g. MySQL HandlerSocket) can supposedly give you 10x again. Throwing money at the problem in the form of multi-processor boxes (or disks if you're I/O bound) can probably get you 10x again. However, if you put a DB on a remote host, you'll have to wait for a network round-trip per query. If your ORM is doing a 1+N query, the total read time will be slow. If your DB is doing a sync on every write, writes will be slow. If the DB isn't tuned with a sensible amount of cache (at least as big as the DB size), it will be slow(er). Each of these has a very simple fix for OpenStack. Relational databases have very efficient caching mechanisms built in. Any out-of-process cache will have a hard time beating it. Let's make sure the bottleneck is the DB, and not (for example) RabbitMQ, before we go off a huge rearchitecture. Justin On Thu, Mar 22, 2012 at 7:53 PM, Mark Washenberger mark.washenber...@rackspace.com mailto:mark.washenber...@rackspace.com wrote: Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is
Re: [Openstack] Caching strategies in Nova ...
Got it. Thanks, If I read your number correctly, there are 10 db api calls, with total time 0.388 seconds. This is certainly not lightning fast. But it's not really slow, given that the user is expecting to have the VM created in more than 10 seconds. 0.5 s latency is tolerable. If most of the time is spent in network to db, then I'd say when we scale up a lot in compute/vm numbers, the latency won't increase much. One thing to note is that right now the DB APIs are all blocking calls. So it could be tricky to get the performance number right when measuring multiple concurrent requests. Yun On Fri, Mar 23, 2012 at 6:47 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Yun, I was working with a very small but fairly realistic setup. In this case I had only 3 Xen hosts, no more than 10 nova vms up at a time. And the environment was very nearly fresh so I believe the db tables were as small as they could be. I believe the utilization across the board in my setup was very low, and indeed the numbers were very consistent (I ran a large number of times, but didn't save all of the data :-(). Also, there were only 2 compute nodes running, but as the workflow only had rpc casts, I'm not sure that really mattered very much. The profile I gave was for vm creation. But I also ran tests for deletion, listing, and showing vms in the OS API. Networks were static throughout the process. Volumes were absent. Yun Mao yun...@gmail.com said: Hi Mark, what workload and what setup do you have while you are profiling? e.g. how many compute nodes do you have, how many VMs do you have, are you creating/destroying/migrating VMs, volumes, networks? Thanks, Yun On Fri, Mar 23, 2012 at 4:26 PM, Mark Washenberger mark.washenber...@rackspace.com wrote: Johannes Erdfelt johan...@erdfelt.com said: MySQL isn't exactly slow and Nova doesn't have particularly large tables. It looks like the slowness is coming from the network and how many queries are being made. Avoiding joins would mean even more queries, which looks like it would slow it down even further. This is exactly what I saw in my profiling. More complex queries did still seem to take longer than less complex ones, but it was a second order effect compared to the overall volume of queries. I'm not sure that network was the culprit though, since my ping roundtrip time was small relative to the wall time I measured for each nova.db.api call. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
On Mar 22, 2012, at 8:06 AM, Sandy Walsh wrote: o/ Vek and myself are looking into caching strategies in and around Nova. There are essentially two approaches: in-process and external (proxy). The in-process schemes sit in with the python code while the external ones basically proxy the the HTTP requests. We may need http caches as well in some cases, but we already use memcached in a few places, so I think we need internal caching as well. There are some obvious pro's and con's to each approach. The external is easier for operations to manage, but in-process allows us greater control over the caching (for things like caching db calls and not just HTTP calls). But, in-memory also means more code, more memory usage on the servers, monolithic services, limited to python based solutions, etc. In-process also gives us access to tools like Tach https://github.com/ohthree/tach for profiling performance. I see Jesse recently landed a branch that touches on the in-process approach: https://github.com/openstack/nova/commit/1bcf5f5431d3c9620596f5329d7654872235c7ee#nova/common/memorycache.py I don't know if people think putting caching code inside nova is a good or bad idea. If we do continue down this road, it would be nice to make it a little more modular/plug-in-based (YAPI .. yet another plug-in). Perhaps a hybrid solution is required? openstack-common is where jesse was planning on putting memorycache We're looking at tools like memcache, beaker, varnish, etc. I kind of like keeping our caching simple, just talking to something that is replicating the python-memcached api so that we can change out an in memory cache or actual memcached or db cache or etc... This has a bit of promise: http://code.google.com/p/python-cache/ Vish ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Agree that there are pros and cons to caching at different layers. As for plugins, in most places where we support memcache we revert to an in-memory cache if it isn't configured. The work that was done during essex was to make the metadata service use either an external cache or internal cache. When an cloud-init based image boots it makes dozens of calls to get data, which results in dozens of RPC calls to nova-network to map IP to instance and then to the DB to load data. - Background info: The in-process cache I commited is actually the code previously known as the fake memcache, moved from nova/tests to nova/common. The reason for the move is two-fold: 1) the code was already in use as an in-memory cache in other locations in the code 2) nova/tests isn't included in most (all?) packagings. Also Josh Harlow has been researching better alternatives for in-memory caches for python (rather than re-inventing the wheel - which I started here: https://github.com/cloudbuilders/millicache ...) -- So ya, I think a summit proposal would be good. Jesse On Thu, Mar 22, 2012 at 8:06 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: o/ Vek and myself are looking into caching strategies in and around Nova. There are essentially two approaches: in-process and external (proxy). The in-process schemes sit in with the python code while the external ones basically proxy the the HTTP requests. There are some obvious pro's and con's to each approach. The external is easier for operations to manage, but in-process allows us greater control over the caching (for things like caching db calls and not just HTTP calls). But, in-memory also means more code, more memory usage on the servers, monolithic services, limited to python based solutions, etc. In-process also gives us access to tools like Tach https://github.com/ohthree/tach for profiling performance. I see Jesse recently landed a branch that touches on the in-process approach: https://github.com/openstack/nova/commit/1bcf5f5431d3c9620596f5329d7654872235c7ee#nova/common/memorycache.py I don't know if people think putting caching code inside nova is a good or bad idea. If we do continue down this road, it would be nice to make it a little more modular/plug-in-based (YAPI .. yet another plug-in). Perhaps a hybrid solution is required? We're looking at tools like memcache, beaker, varnish, etc. Has anyone already started down this road already? Any insights to share? Opinions? (summit talk?) What are Glance, Swift, Keystone (lite?) doing? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. Sandy Walsh sandy.wa...@rackspace.com said: o/ Vek and myself are looking into caching strategies in and around Nova. There are essentially two approaches: in-process and external (proxy). The in-process schemes sit in with the python code while the external ones basically proxy the the HTTP requests. There are some obvious pro's and con's to each approach. The external is easier for operations to manage, but in-process allows us greater control over the caching (for things like caching db calls and not just HTTP calls). But, in-memory also means more code, more memory usage on the servers, monolithic services, limited to python based solutions, etc. In-process also gives us access to tools like Tach https://github.com/ohthree/tach for profiling performance. I see Jesse recently landed a branch that touches on the in-process approach: https://github.com/openstack/nova/commit/1bcf5f5431d3c9620596f5329d7654872235c7ee#nova/common/memorycache.py I don't know if people think putting caching code inside nova is a good or bad idea. If we do continue down this road, it would be nice to make it a little more modular/plug-in-based (YAPI .. yet another plug-in). Perhaps a hybrid solution is required? We're looking at tools like memcache, beaker, varnish, etc. Has anyone already started down this road already? Any insights to share? Opinions? (summit talk?) What are Glance, Swift, Keystone (lite?) doing? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
Working on this independently, I created a branch with some simple performance logging around the nova-api, and individually around glance, nova.db, and nova.rpc calls. (Sorry, I only have a local copy and its on a different computer right now, and probably needs a rebase. I will rebase and publish it on GitHub tomorrow.) With this logging, I could get some simple profiling that I found very useful. Here is a GH project with the analysis code as well as some nova-api logs I was using as input. https://github.com/markwash/nova-perflog With these tools, you can get a wall-time profile for individual requests. For example, looking at one server create request (and you can run this directly from the checkout as the logs are saved there): markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f keycountavg nova.api.openstack.wsgi.POST 1 0.657 nova.db.api.instance_update1 0.191 nova.image.show1 0.179 nova.db.api.instance_add_security_group1 0.082 nova.rpc.cast 1 0.059 nova.db.api.instance_get_all_by_filters1 0.034 nova.db.api.security_group_get_by_name 2 0.029 nova.db.api.instance_create1 0.011 nova.db.api.quota_get_all_by_project 3 0.003 nova.db.api.instance_data_get_for_project 1 0.003 key count total nova.api.openstack.wsgi 1 0.657 nova.db.api 10 0.388 nova.image 1 0.179 nova.rpc 1 0.059 All times are in seconds. The nova.rpc time is probably high since this was the first call since server restart, so the connection handshake is probably included. This is also probably 1.5 months stale. The conclusion I reached from this profiling is that we just plain overuse the db (and we might do the same in glance). For example, whenever we do updates, we actually re-retrieve the item from the database, update its dictionary, and save it. This is double the cost it needs to be. We also handle updates for data across tables inefficiently, where they could be handled in single database round trip. In particular, in the case of server listings, extensions are just rough on performance. Most extensions hit the database again at least once. This isn't really so bad, but it clearly is an area where we should improve, since these are the most frequent api queries. I just see a ton of specific performance problems that are easier to address one by one, rather than diving into a general (albeit obvious) solution such as caching. Sandy Walsh sandy.wa...@rackspace.com said: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] Caching strategies in Nova ...
This is precisely my concern. It must be brought up that with Rackspace Cloud Servers, nearly all client codes routinely submit requests with a query parameter cache-busting=some random string just to get around problems with cache invalidation. And woe to the client that does not. I get the feeling that once trust like this is lost, a project has a hard time regaining it. I'm not saying that we can avoid inconsistency entirely. Rather, I believe we will have to embrace some eventual-consistency models to enable the performance and scale we will ultimately attain. But I just get the feeling that generic caches are really only appropriate for write-once or at least write-rarely data. So personally I would rule out external caches entirely and try to be very judicious in selecting internal caches as well. Joshua Harlow harlo...@yahoo-inc.com said: ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp Just from experience. They do a great job. But the killer thing about caching is how u do the cache invalidation. Just caching stuff is easy-peasy, making sure it is invalidated on all servers in all conditions, not so easy... On 3/22/12 4:26 PM, Sandy Walsh sandy.wa...@rackspace.com wrote: We're doing tests to find out where the bottlenecks are, caching is the most obvious solution, but there may be others. Tools like memcache do a really good job of sharing memory across servers so we don't have to reinvent the wheel or hit the db at all. In addition to looking into caching technologies/approaches we're gluing together some tools for finding those bottlenecks. Our first step will be finding them, then squashing them ... however. -S On 03/22/2012 06:25 PM, Mark Washenberger wrote: What problems are caching strategies supposed to solve? On the nova compute side, it seems like streamlining db access and api-view tables would solve any performance problems caching would address, while keeping the stale data management problem small. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp