Re: [openstack-dev] [nova] Proposal for an Experiment
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 08/03/2015 02:24 PM, Jesse Cook wrote: Performance tests against 1000 node clusters being setup by OSIC? Sounds like you have a playground for your tests. Unfortunately, the consensus of the nova cores during the mid-cycle meetup was that while this is an interesting approach, and that experimenting with novel approaches can be very worthwhile, it was not considered a priority, as there is too much work already on everyone's plate for Liberty. So the experiment isn't going to happen any time soon . - -- - -- Ed Leafe -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: GPGTools - https://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIbBAEBCgAGBQJVwM5gAAoJEKMgtcocwZqLG90P9RNKm4pRmcLPK+PqVaXCIu/E c+i0SW9Af5fmy4cC7Efnuv2o7UqJNNU7GsGw3on54Lt1SoF3z1yJ/9WarzjiNLq+ 25Uz+2HDytovvnREi+P/LoVB4mj49nowEkMh/3QgQWVsMOvPitPM7mBPAGkyzNvb 5ElC1Xr2HkyQ/9h34IGLWcC/X/meR79BcvRHfJwqzNyTP2foi7tboq4sugbFVQfy 72vbaTAtLI/mDzUjkafNGB5W2ge4VWAJRsjf1y+eIv+j2f3PKbKsx3XLrTIuzJ78 9qBmpGal/biqyUwFyrvrg/e//KuD0FhJdiDwiuc35hebkN7UBJbq71RAjdxtK4Jr clImy04sGAvKI0r27LFZA3ycjv0J8OW4nJeH9vjdBg5N2D0FuhOIBNxXsKohdF42 0maWFe1Wj7Icv9YnJ26ZaaWjqwnGE/PjVl3lFd1X5W7KFJ5Ay/uYY28cfHbD0wKT Ych2oSR0/Jrzzfm9jd+VP2kjORgtEdDaARbP11auT+o1xnIKLIA9qTiOGuhX807L cZVitPoUdIlogQyKJ6hZtiIytrnnYocHKZp/MQjpTabVKkl4aKmKLAo2onNgDcD8 6B08hmsW28pm7aKZf0SBb7oe6OU0vCQjFaKWAVb1O+zGqspOrhqehQR3u6nF+O1Y TaG3j4n4w/5D9gadk4k= =l9oR -END PGP SIGNATURE- __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Rally? Something else? What can we do to measure this? Of cause, if you looking for instrument for measure performance - Rally is the best choice! On Tue, Aug 4, 2015 at 5:38 PM, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 08/03/2015 02:24 PM, Jesse Cook wrote: Performance tests against 1000 node clusters being setup by OSIC? Sounds like you have a playground for your tests. Unfortunately, the consensus of the nova cores during the mid-cycle meetup was that while this is an interesting approach, and that experimenting with novel approaches can be very worthwhile, it was not considered a priority, as there is too much work already on everyone's plate for Liberty. So the experiment isn't going to happen any time soon . - -- - -- Ed Leafe -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: GPGTools - https://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIbBAEBCgAGBQJVwM5gAAoJEKMgtcocwZqLG90P9RNKm4pRmcLPK+PqVaXCIu/E c+i0SW9Af5fmy4cC7Efnuv2o7UqJNNU7GsGw3on54Lt1SoF3z1yJ/9WarzjiNLq+ 25Uz+2HDytovvnREi+P/LoVB4mj49nowEkMh/3QgQWVsMOvPitPM7mBPAGkyzNvb 5ElC1Xr2HkyQ/9h34IGLWcC/X/meR79BcvRHfJwqzNyTP2foi7tboq4sugbFVQfy 72vbaTAtLI/mDzUjkafNGB5W2ge4VWAJRsjf1y+eIv+j2f3PKbKsx3XLrTIuzJ78 9qBmpGal/biqyUwFyrvrg/e//KuD0FhJdiDwiuc35hebkN7UBJbq71RAjdxtK4Jr clImy04sGAvKI0r27LFZA3ycjv0J8OW4nJeH9vjdBg5N2D0FuhOIBNxXsKohdF42 0maWFe1Wj7Icv9YnJ26ZaaWjqwnGE/PjVl3lFd1X5W7KFJ5Ay/uYY28cfHbD0wKT Ych2oSR0/Jrzzfm9jd+VP2kjORgtEdDaARbP11auT+o1xnIKLIA9qTiOGuhX807L cZVitPoUdIlogQyKJ6hZtiIytrnnYocHKZp/MQjpTabVKkl4aKmKLAo2onNgDcD8 6B08hmsW28pm7aKZf0SBb7oe6OU0vCQjFaKWAVb1O+zGqspOrhqehQR3u6nF+O1Y TaG3j4n4w/5D9gadk4k= =l9oR -END PGP SIGNATURE- __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Jesse J. CookCompute Team Lead jesse.c...@rackspace.com irc: #compute-eng (gimchi) mobile: 618-530-0659 https://rackspacemarketing.com/signatyourEmail/ https://www.linkedin.com/pub/jesse-cook/8/292/620 https://plus.google.com/u/0/+JesseCooks/posts/p/pub On 7/20/15, 12:40 PM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Jesse Cook's message of 2015-07-20 07:48:46 -0700: On 7/15/15, 9:18 AM, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. +1 The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: Call me an optimist, I think this can work :) I would prefer a solution that avoids state management all together and instead depends on each individual making rule-based decisions using their limited observations of their perceived environment. Of course, this has certain emergent behaviors you have to learn from, but on the upside, no more braiding state throughout the system. I don¹t like the assumption that it has to be a global state management problem when it doesn¹t have to be. That being said, I¹m not opposed to trying a solution like you described using Cassandra or something similar. I generally support improvements :) a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I¹d like to see n build requests within 1 second each be successfully scheduled to a host that has spare capacity with only say a total system capacity of n * 1.10 where n = 1, each cell having ~100 hosts, the number of hosts is = n * 0.10 and = n *
Re: [openstack-dev] [nova] Proposal for an Experiment
On 7/15/15, 9:18 AM, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. +1 The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: Call me an optimist, I think this can work :) I would prefer a solution that avoids state management all together and instead depends on each individual making rule-based decisions using their limited observations of their perceived environment. Of course, this has certain emergent behaviors you have to learn from, but on the upside, no more braiding state throughout the system. I don¹t like the assumption that it has to be a global state management problem when it doesn¹t have to be. That being said, I¹m not opposed to trying a solution like you described using Cassandra or something similar. I generally support improvements :) a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I¹d like to see n build requests within 1 second each be successfully scheduled to a host that has spare capacity with only say a total system capacity of n * 1.10 where n = 1, each cell having ~100 hosts, the number of hosts is = n * 0.10 and = n * 0.90, and the number of schedulers is = 2. For example: Build requests: 1 in 1 second Slots for flavor requested: 11000 Hosts that can build flavor: 7500 Number of schedulers: 3 Number of cells: 75 (each with 100 hosts) I'm also asking that you refrain from talking about why this can't work for now. I know it'll be difficult to do that, since nobody likes ranting about stuff more than I do, but right now it won't be helpful. There will be plenty of time for that
Re: [openstack-dev] [nova] Proposal for an Experiment
Excerpts from Jesse Cook's message of 2015-07-20 07:48:46 -0700: On 7/15/15, 9:18 AM, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. +1 The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: Call me an optimist, I think this can work :) I would prefer a solution that avoids state management all together and instead depends on each individual making rule-based decisions using their limited observations of their perceived environment. Of course, this has certain emergent behaviors you have to learn from, but on the upside, no more braiding state throughout the system. I don¹t like the assumption that it has to be a global state management problem when it doesn¹t have to be. That being said, I¹m not opposed to trying a solution like you described using Cassandra or something similar. I generally support improvements :) a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I¹d like to see n build requests within 1 second each be successfully scheduled to a host that has spare capacity with only say a total system capacity of n * 1.10 where n = 1, each cell having ~100 hosts, the number of hosts is = n * 0.10 and = n * 0.90, and the number of schedulers is = 2. For example: Build requests: 1 in 1 second Slots for flavor requested: 11000 Hosts that can build flavor: 7500 Number of schedulers: 3 Number of cells: 75 (each with 100 hosts) This is right on, though one thing missing is where the current code fails this
Re: [openstack-dev] [nova] Proposal for an Experiment
On 07/20/2015 02:04 PM, Clint Byrum wrote: Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700: Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? Please be concrete on what you mean by resource. I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios, then each flavor is a queue. Thats the easiest problem to solve. Then if you have a single special thing that can only have one VM per host (lets say, a PCI pass through thing), then thats another iteration of each flavor. So assuming 3 flavors: 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2 3=large cpu=8,ram=16384,disk=200gb,rxtx=2 This means you have these queues: reserve release compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1 compute,cpu=1,ram=1024m,disk=5gb,rxtx=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2 compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1 compute,cpu=8,ram=16384,disk=200gb,rxtx=2 snip Now, I've made this argument in the past, and people have pointed out that the permutations can get into the tens of thousands very easily if you start adding lots of dimensions and/or flavors. I suggest that is no big deal, but maybe I'm biased because I have done something like that in Gearman and it was, in fact, no big deal. Yeah, that's what I was worried about. We have things that can be specified per flavor, and things that can be specified per image, and things that can be specified per instance, and they all multiply together. 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. Can you provide a concrete example? Feels like I'm asking for a straw man to be built. ;) Well, as an example we have a cluster that is aimed at high-performance network processing and so all else being equal they will choose the compute node with the least network traffic. You might also try to pack instances together for power efficiency (allowing you to turn off unused compute nodes), or choose the compute node that results in the tightest packing (to minimize unused resources). 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these ones with this provider network, etc.) The scheduler could then look at the whole request all at once and optimize it rather than looking at each piece separately. That could also allow relocating multiple instances that want to be co-located on the same compute node. So, if the grouping is arbitrary, then there's no way to pre-calculate the group size, I agree. I am wont to pursue something like this though, as I don't really think this is the kind of optimization that cloud workloads should be built on top of. If you need two processes to have low latency, why not just boot a bigger machine and do it all in one VM? There are a few reasons I can think of, but I wonder how many are in the general case? It's a fair question. :) I honestly don't know...I was just thinking that we allow the expression of affinity/anti-affinity policies via server groups, but the scheduler doesn't really do a good job of actually scheduling those groups. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
I have a feeling that we really need to make whatever this selection process has clearly defined API boundaries, so that various 'implementation experiments' can be used (and researched on). Those API boundaries will be what scheduling entities must provide but the implementations could be many things. I have feeling that this is really an on-going area of research and no solution will likely be optimal 'yet' (maybe someday...). Without even defined API boundaries I start to wonder if this whole exploring will end up just burning out people (when said people find a possible solution but the code won't be accepted due to lack of API boundaries in the first place); I believe gantt was trying to fix this (but I'm not sure of the status of that)? -Josh Chris Friesen wrote: On 07/20/2015 02:04 PM, Clint Byrum wrote: Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700: Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? Please be concrete on what you mean by resource. I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios, then each flavor is a queue. Thats the easiest problem to solve. Then if you have a single special thing that can only have one VM per host (lets say, a PCI pass through thing), then thats another iteration of each flavor. So assuming 3 flavors: 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2 3=large cpu=8,ram=16384,disk=200gb,rxtx=2 This means you have these queues: reserve release compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1 compute,cpu=1,ram=1024m,disk=5gb,rxtx=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2 compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1 compute,cpu=8,ram=16384,disk=200gb,rxtx=2 snip Now, I've made this argument in the past, and people have pointed out that the permutations can get into the tens of thousands very easily if you start adding lots of dimensions and/or flavors. I suggest that is no big deal, but maybe I'm biased because I have done something like that in Gearman and it was, in fact, no big deal. Yeah, that's what I was worried about. We have things that can be specified per flavor, and things that can be specified per image, and things that can be specified per instance, and they all multiply together. 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. Can you provide a concrete example? Feels like I'm asking for a straw man to be built. ;) Well, as an example we have a cluster that is aimed at high-performance network processing and so all else being equal they will choose the compute node with the least network traffic. You might also try to pack instances together for power efficiency (allowing you to turn off unused compute nodes), or choose the compute node that results in the tightest packing (to minimize unused resources). 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these ones with this provider network, etc.) The scheduler could then look at the whole request all at once and optimize it rather than looking at each piece separately. That could also allow relocating multiple instances that want to be co-located on the same compute node. So, if the grouping is arbitrary, then there's no way to pre-calculate the group size, I agree. I am wont to pursue something like this though, as I don't really think this is the kind of optimization that cloud workloads should be built on top of. If you need two processes to have low latency, why not just boot a bigger machine and do it all in one VM? There are a few reasons I can think of, but I wonder how many are in the general case? It's a fair question. :) I honestly don't know...I was just thinking that we allow the expression of affinity/anti-affinity policies via server groups, but the scheduler doesn't really do a good job of actually scheduling those groups. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
Re: [openstack-dev] [nova] Proposal for an Experiment
On 07/20/2015 11:40 AM, Clint Byrum wrote: To your earlier point about state being abused in the system, I totally 100% agree. In the past I've wondered a lot if there can be a worker model, where compute hosts all try to grab work off queues if they have available resources. So API requests for boot/delete don't change any state, they just enqueue a message. Queues would be matched up to resources and the more filter choices, the more queues. Each time a compute node completed a task (create vm, destroy vm) it would re-evaluate all of the queues and subscribe to the ones it could satisfy right now. Quotas would simply be the first stop for the enqueued create messages, and a final stop for the enqueued delete messages (once its done, release quota). If you haven't noticed, this would agree with Robert Collins's suggestion that something like Kafka is a technology more suited to this (or my favorite old-often-forgotten solution to this , Gearman. ;) This would have no global dynamic state, and very little local dynamic state. API, conductor, and compute nodes simply need to know all of the choices users are offered, and there is no scheduler at runtime, just a predictive queue-list-manager that only gets updated when choices are added or removed. This would relieve a ton of the burden currently put on the database by scheduling since the only accesses would be simple read/writes (that includes 'server-list' type operations since that would read a single index key). Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these ones with this provider network, etc.) The scheduler could then look at the whole request all at once and optimize it rather than looking at each piece separately. That could also allow relocating multiple instances that want to be co-located on the same compute node. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700: On 07/20/2015 11:40 AM, Clint Byrum wrote: To your earlier point about state being abused in the system, I totally 100% agree. In the past I've wondered a lot if there can be a worker model, where compute hosts all try to grab work off queues if they have available resources. So API requests for boot/delete don't change any state, they just enqueue a message. Queues would be matched up to resources and the more filter choices, the more queues. Each time a compute node completed a task (create vm, destroy vm) it would re-evaluate all of the queues and subscribe to the ones it could satisfy right now. Quotas would simply be the first stop for the enqueued create messages, and a final stop for the enqueued delete messages (once its done, release quota). If you haven't noticed, this would agree with Robert Collins's suggestion that something like Kafka is a technology more suited to this (or my favorite old-often-forgotten solution to this , Gearman. ;) This would have no global dynamic state, and very little local dynamic state. API, conductor, and compute nodes simply need to know all of the choices users are offered, and there is no scheduler at runtime, just a predictive queue-list-manager that only gets updated when choices are added or removed. This would relieve a ton of the burden currently put on the database by scheduling since the only accesses would be simple read/writes (that includes 'server-list' type operations since that would read a single index key). Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? Please be concrete on what you mean by resource. I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios, then each flavor is a queue. Thats the easiest problem to solve. Then if you have a single special thing that can only have one VM per host (lets say, a PCI pass through thing), then thats another iteration of each flavor. So assuming 3 flavors: 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2 3=large cpu=8,ram=16384,disk=200gb,rxtx=2 This means you have these queues: reserve release compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1 compute,cpu=1,ram=1024m,disk=5gb,rxtx=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2 compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1 compute,cpu=8,ram=16384,disk=200gb,rxtx=2 Also you have a delete queue per compute node (and migrate and and and.. RPC still is pretty unchanged at the single-instance level) So, compute nodes that have the pci device boot up, query the flavors table, and subscribe to the compute queues that they can satisfy now (which would be _all_ of them assuming they have 16G of ram available). A user asks for a tiny + pci pass through. API node injects a message to the reserve queue, a conductor receives it, checks the user's quota, bumps usage by 1, and then sends it to the appropriate compute queue. A compute node receives it. It starts the VM, ACK's the job (so it is dropped from the queue so it won't be retried) and then looks at its capabilities vs. the queues, and unsubscribes from all of the pci=1 queues, since its one pci device is in use. When the user deletes the node, the compute node receives that on its delete queue, removes the node, and then sends a message on the release queue that the resources can be returned to the user's quota (or we can talk about whether to just release them earlier.. when releasing happens is a sub-topic). Now, I've made this argument in the past, and people have pointed out that the permutations can get into the tens of thousands very easily if you start adding lots of dimensions and/or flavors. I suggest that is no big deal, but maybe I'm biased because I have done something like that in Gearman and it was, in fact, no big deal. 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. Can you provide a concrete example? Feels like I'm asking for a straw man to be built. ;) 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these ones with this provider network, etc.) The scheduler could then look at the whole request all at once and optimize it rather than looking at each piece separately. That could also allow relocating multiple instances that want to be co-located on the same compute node. So, if
Re: [openstack-dev] [nova] Proposal for an Experiment
Excerpts from Chris Friesen's message of 2015-07-20 14:30:53 -0700: On 07/20/2015 02:04 PM, Clint Byrum wrote: Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700: Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? Please be concrete on what you mean by resource. I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios, then each flavor is a queue. Thats the easiest problem to solve. Then if you have a single special thing that can only have one VM per host (lets say, a PCI pass through thing), then thats another iteration of each flavor. So assuming 3 flavors: 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2 3=large cpu=8,ram=16384,disk=200gb,rxtx=2 This means you have these queues: reserve release compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1 compute,cpu=1,ram=1024m,disk=5gb,rxtx=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2 compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1 compute,cpu=8,ram=16384,disk=200gb,rxtx=2 snip Now, I've made this argument in the past, and people have pointed out that the permutations can get into the tens of thousands very easily if you start adding lots of dimensions and/or flavors. I suggest that is no big deal, but maybe I'm biased because I have done something like that in Gearman and it was, in fact, no big deal. Yeah, that's what I was worried about. We have things that can be specified per flavor, and things that can be specified per image, and things that can be specified per instance, and they all multiply together. So all that matters is the size of the set of permutations that people are using _now_ to request nodes. It's relatively low-cost to create the queues in a distributed manner and just have compute nodes listen to a broadcast for new ones that they should try to subscribe to. Even if there are 1 million queues possible, it's unlikely there will be 1 million legitimate unique boot arguments. This does complicate things quite a bit though, so part of me just wants to suggest don't do that. ;) 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. Can you provide a concrete example? Feels like I'm asking for a straw man to be built. ;) Well, as an example we have a cluster that is aimed at high-performance network processing and so all else being equal they will choose the compute node with the least network traffic. You might also try to pack instances together for power efficiency (allowing you to turn off unused compute nodes), or choose the compute node that results in the tightest packing (to minimize unused resources). Least-utilized is hard since it requires knowledge of all of the nodes' state. It also breaks down and gives 0 benefit when all the nodes are fully bandwidth-utilized. However, Below 20% utilized is extremely easy and achieves the actual goal that the user stated, since each node can self-assess whether it is or is not in that group. In this way a user gets given an error I don't have any fully available networking for you instead of getting a node which is oversubscribed unknowingly. Packing is kind of interesting. One can achieve it on an empty cluster simply by only turning on one node at a time, and whenever the queue has less than safety_margin workers, turn on more nodes. However, once nodes are full and workloads are being deleted, you want to assess which ones would be the least cost to migrate off of and turn off. I'm inclined to say I would do this from something outside the scheduler, as part of a power-reclaimer, but perhaps a centralized scheduler that always knows would do a better job here. It would need to do that in such a manner that is so efficient it would outweigh the benefit of not needing global state awareness. An external reclaimer can work in an eventually consistent manner and thus I would still lean toward that over the realtime scheduler, but this needs some experimentation to confirm. 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these ones with this provider network, etc.) The scheduler could then look at the whole request all at once and optimize it rather than looking at each piece separately. That could also allow relocating multiple instances that want to be co-located on the
Re: [openstack-dev] [nova] Proposal for an Experiment
Excerpts from Joshua Harlow's message of 2015-07-20 14:57:48 -0700: I have a feeling that we really need to make whatever this selection process has clearly defined API boundaries, so that various 'implementation experiments' can be used (and researched on). Those API boundaries will be what scheduling entities must provide but the implementations could be many things. I have feeling that this is really an on-going area of research and no solution will likely be optimal 'yet' (maybe someday...). Without even defined API boundaries I start to wonder if this whole exploring will end up just burning out people (when said people find a possible solution but the code won't be accepted due to lack of API boundaries in the first place); I believe gantt was trying to fix this (but I'm not sure of the status of that)? Yes, right now it's just too tightly wound into Nova to experiment without doing major surgery. If one can simply make the scheduler go faster, without having to change everything else around it, we get something that is easier to test, and easier for deployers to migrate to. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Clint Byrum wrote: Excerpts from Chris Friesen's message of 2015-07-20 14:30:53 -0700: On 07/20/2015 02:04 PM, Clint Byrum wrote: Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700: Some questions: 1) Could you elaborate a bit on how this would work? I don't quite understand how you would handle a request for booting an instance with a certain set of resources--would you queue up a message for each resource? Please be concrete on what you mean by resource. I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios, then each flavor is a queue. Thats the easiest problem to solve. Then if you have a single special thing that can only have one VM per host (lets say, a PCI pass through thing), then thats another iteration of each flavor. So assuming 3 flavors: 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2 3=large cpu=8,ram=16384,disk=200gb,rxtx=2 This means you have these queues: reserve release compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1 compute,cpu=1,ram=1024m,disk=5gb,rxtx=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1 compute,cpu=2,ram=4096m,disk=100gb,rxtx=2 compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1 compute,cpu=8,ram=16384,disk=200gb,rxtx=2 snip Now, I've made this argument in the past, and people have pointed out that the permutations can get into the tens of thousands very easily if you start adding lots of dimensions and/or flavors. I suggest that is no big deal, but maybe I'm biased because I have done something like that in Gearman and it was, in fact, no big deal. Yeah, that's what I was worried about. We have things that can be specified per flavor, and things that can be specified per image, and things that can be specified per instance, and they all multiply together. So all that matters is the size of the set of permutations that people are using _now_ to request nodes. It's relatively low-cost to create the queues in a distributed manner and just have compute nodes listen to a broadcast for new ones that they should try to subscribe to. Even if there are 1 million queues possible, it's unlikely there will be 1 million legitimate unique boot arguments. This does complicate things quite a bit though, so part of me just wants to suggest don't do that. ;) 2) How would it handle stuff like weight functions where you could have multiple compute nodes that *could* satisfy the requirement but some of them would be better than others by some arbitrary criteria. Can you provide a concrete example? Feels like I'm asking for a straw man to be built. ;) Well, as an example we have a cluster that is aimed at high-performance network processing and so all else being equal they will choose the compute node with the least network traffic. You might also try to pack instances together for power efficiency (allowing you to turn off unused compute nodes), or choose the compute node that results in the tightest packing (to minimize unused resources). Least-utilized is hard since it requires knowledge of all of the nodes' state. It also breaks down and gives 0 benefit when all the nodes are fully bandwidth-utilized. However, Below 20% utilized is extremely easy and achieves the actual goal that the user stated, since each node can self-assess whether it is or is not in that group. In this way a user gets given an error I don't have any fully available networking for you instead of getting a node which is oversubscribed unknowingly. Packing is kind of interesting. One can achieve it on an empty cluster simply by only turning on one node at a time, and whenever the queue has less than safety_margin workers, turn on more nodes. However, once nodes are full and workloads are being deleted, you want to assess which ones would be the least cost to migrate off of and turn off. I'm inclined to say I would do this from something outside the scheduler, as part of a power-reclaimer, but perhaps a centralized scheduler that always knows would do a better job here. It would need to do that in such a manner that is so efficient it would outweigh the benefit of not needing global state awareness. An external reclaimer can work in an eventually consistent manner and thus I would still lean toward that over the realtime scheduler, but this needs some experimentation to confirm. From what I've heard (idk how widely this is done in the industry); but actually turning off nodes I've heard causes more problems than it solves in terms of power-costs, cooling, hardware [disk, cpu, other] failures and so-on, so maybe turning nodes off may not be the best idea. This is all things I've heard second-hand though so may not be what others do. 3) The biggest improvement I'd like to see is in group scheduling. Suppose I want to schedule multiple instances, each with their own resource requirements, but also with interdependency between them (these ones on the same node, these ones not on the same node, these
Re: [openstack-dev] [nova] Proposal for an Experiment
On 15 July 2015 at 19:25, Robert Collins robe...@robertcollins.net wrote: On 16 July 2015 at 02:18, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 ... What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. +1 on doing an experiment. Some semi-random thoughts here. Well, not random at all, I've been mulling on this for a while. I think Kafka may fit our model significantly vis-a-vis updating state more closely than Cassandra does. It would be neat if we could do a few different sketchy implementations and head-to-head test them. I love Cassandra in a lot of ways, but lightweight-transaction are two words that I'd really not expect to see in Cassandra (Yes, I know it has them in the official docs and design :)) - its a full paxos interaction to do SERIAL consistency, which is more work than ether QUORUM or LOCAL_QUORUM. A sharded approach - there is only one compute node in question for the update needed - can be less work than either and still race free. I too also very much want to see us move to brokerless RPC, systematically, for all the reasons :). You might need a little of that mixed in to the experiments, depending on the scale reached. In terms of quantification; are you looking to test scalability (e.g. scheduling some N events per second without races), [there are huge improvements possible by rewriting the current schedulers innards to be less wasteful, but that doesn't address active-active setups], latency (e.g. 99th percentile time-to-schedule) or ... ? +1 for trying Kafka I have tried to write up my thoughts on the Kafka approach (and a few related things) in here: https://review.openstack.org/#/c/191914/5/specs/backlog/approved/parallel-scheduler.rst,cm Its trying to describe what I want to prototype for the next scheduler, its also possibly one of the worse specs I have ever seen. There may be some ideas worth nicking in there (there may not be!) John PS I also cover my want for multiple schedulers living in Nova, long term (We already have 2.5 schedulers, depending on how you count them) I can see some of these schedulers being the best for a sub set of deployments. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Chris Friesen wrote: On 07/15/2015 09:31 AM, Joshua Harlow wrote: I do like experiments! What about going even farther and trying to integrate somehow into mesos? https://mesos.apache.org/documentation/latest/mesos-architecture/ Replace the hadooop executor, MPI executor with a 'VM executor' and perhaps we could eliminate a large part of the scheduler code (just a thought)... Is the mesos scheduler sufficiently generic as to encompass all the filters we currently have in nova? IMHO some of these should probably have never existed in the first place: ie https://github.com/openstack/nova/blob/master/nova/scheduler/filters/json_filter.py since they are near impossible to ever migrate away from once created (a JSON-based grammar for selecting hosts, like woah). So if someone is going to do a comparison/experiment I'd hope that they can overlook some of the filters that should likely never have been created in the first place ;) Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
On 16 July 2015 at 02:18, Ed Leafe e...@leafe.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 ... What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. +1 on doing an experiment. Some semi-random thoughts here. Well, not random at all, I've been mulling on this for a while. I think Kafka may fit our model significantly vis-a-vis updating state more closely than Cassandra does. It would be neat if we could do a few different sketchy implementations and head-to-head test them. I love Cassandra in a lot of ways, but lightweight-transaction are two words that I'd really not expect to see in Cassandra (Yes, I know it has them in the official docs and design :)) - its a full paxos interaction to do SERIAL consistency, which is more work than ether QUORUM or LOCAL_QUORUM. A sharded approach - there is only one compute node in question for the update needed - can be less work than either and still race free. I too also very much want to see us move to brokerless RPC, systematically, for all the reasons :). You might need a little of that mixed in to the experiments, depending on the scale reached. In terms of quantification; are you looking to test scalability (e.g. scheduling some N events per second without races), [there are huge improvements possible by rewriting the current schedulers innards to be less wasteful, but that doesn't address active-active setups], latency (e.g. 99th percentile time-to-schedule) or ... ? -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
What you describe is a spike. It's a grand plan, and you don't need anyone's permission, so huzzah for the spike! As far as what should be improved, I hear a lot that having multiple schedulers does not scale well, so I'd suggest that as a primary target (maybe measure the _current_ problem, and then set the target as a 10x improvement over what we have now). Things to consider while pushing on that goal: * Do not backslide the resilience in the system. The code is just now starting to be fault tolerant when talking to RabbitMQ, so make sure to also consider how tolerant of failures this will be. Cassandra is typically chosen for its resilience and performance, but Cassandra does a neat trick in that clients can switch its CAP theorem profile from Consistent and Available (but slow) to Available and Performant when reading things. That might be useful in the context of trying to push the performance _UP_ for schedulers, while not breaking anything else. * Consider the cost of introducing a brand new technology into the deployer space. If there _is_ a way to get the desired improvement with, say, just MySQL and some clever sharding, then that might be a smaller pill to swallow for deployers. Anyway, I wish you well on this endeavor and hope to see your results soon! Excerpts from Ed Leafe's message of 2015-07-15 07:18:42 -0700: Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I'm also asking that you refrain from
Re: [openstack-dev] [nova] Proposal for an Experiment
On Jul 15, 2015, at 1:08 PM, Maish Saidel-Keesing mais...@maishsk.com wrote: * Consider the cost of introducing a brand new technology into the deployer space. If there _is_ a way to get the desired improvement with, say, just MySQL and some clever sharding, then that might be a smaller pill to swallow for deployers. +1000 to this part regarding introducing a new technology Yes, of course it has been considered. If it were trivial, I would just propose a blueprint. Again, I'd really like to hear ideas on what kind of results would be convincing enough to make it worthwhile to introduce a new technology. -- Ed Leafe signature.asc Description: Message signed with OpenPGP using GPGMail __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
On 16 July 2015 at 07:27, Ed Leafe e...@leafe.com wrote: On Jul 15, 2015, at 1:08 PM, Maish Saidel-Keesing mais...@maishsk.com wrote: * Consider the cost of introducing a brand new technology into the deployer space. If there _is_ a way to get the desired improvement with, say, just MySQL and some clever sharding, then that might be a smaller pill to swallow for deployers. +1000 to this part regarding introducing a new technology Yes, of course it has been considered. If it were trivial, I would just propose a blueprint. Again, I'd really like to hear ideas on what kind of results would be convincing enough to make it worthwhile to introduce a new technology. We spent some summit time discussing just this: https://wiki.openstack.org/wiki/TechnologyChoices The summary here is IMO: - ops will follow where we lead BUT - we need to take their needs into account - which includes robustness, operability, and so on - things where an alternative implementation exists can be uptake-driven : e.g. we expand the choices, and observe what folk move onto. That said, I think the fundamental thing today is that we have a bug and its not fixed. LOTS of them. Where fixing them needs better plumbing, lets be bold - but not hasty. -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
On 07/15/15 20:40, Clint Byrum wrote: What you describe is a spike. It's a grand plan, and you don't need anyone's permission, so huzzah for the spike! As far as what should be improved, I hear a lot that having multiple schedulers does not scale well, so I'd suggest that as a primary target (maybe measure the _current_ problem, and then set the target as a 10x improvement over what we have now). Things to consider while pushing on that goal: * Do not backslide the resilience in the system. The code is just now starting to be fault tolerant when talking to RabbitMQ, so make sure to also consider how tolerant of failures this will be. Cassandra is typically chosen for its resilience and performance, but Cassandra does a neat trick in that clients can switch its CAP theorem profile from Consistent and Available (but slow) to Available and Performant when reading things. That might be useful in the context of trying to push the performance _UP_ for schedulers, while not breaking anything else. * Consider the cost of introducing a brand new technology into the deployer space. If there _is_ a way to get the desired improvement with, say, just MySQL and some clever sharding, then that might be a smaller pill to swallow for deployers. +1000 to this part regarding introducing a new technology Anyway, I wish you well on this endeavor and hope to see your results soon! Excerpts from Ed Leafe's message of 2015-07-15 07:18:42 -0700: Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I'm
Re: [openstack-dev] [nova] Proposal for an Experiment
On 07/15/2015 09:31 AM, Joshua Harlow wrote: I do like experiments! What about going even farther and trying to integrate somehow into mesos? https://mesos.apache.org/documentation/latest/mesos-architecture/ Replace the hadooop executor, MPI executor with a 'VM executor' and perhaps we could eliminate a large part of the scheduler code (just a thought)... Is the mesos scheduler sufficiently generic as to encompass all the filters we currently have in nova? Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
Chris Friesen wrote: On 07/15/2015 09:31 AM, Joshua Harlow wrote: I do like experiments! What about going even farther and trying to integrate somehow into mesos? https://mesos.apache.org/documentation/latest/mesos-architecture/ Replace the hadooop executor, MPI executor with a 'VM executor' and perhaps we could eliminate a large part of the scheduler code (just a thought)... Is the mesos scheduler sufficiently generic as to encompass all the filters we currently have in nova? Unsure, if not it's just another open-source project right? I'm sure they'd love to collaborate, and maybe they will even do most of the work? Who knows... Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Proposal for an Experiment
On 07/15/2015 08:18 AM, Ed Leafe wrote: What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. It seems to me that the ability to run multiple schedulers comes from the fact that you're talking about claiming resources in the data store, and not from anything inherent in Cassandra itself. Why couldn't we just update the existing nova scheduler to claim resources in the existing database in order to get the same reduction of raciness? (Thus allowing multiple schedulers running in parallel.) Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Proposal for an Experiment
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I'm also asking that you refrain from talking about why this can't work for now. I know it'll be difficult to do that, since nobody likes ranting about stuff more than I do, but right now it won't be helpful. There will be plenty of time for that later, assuming that this experiment yields anything worthwhile. Instead, think of the current pain points in the scheduler design, and what sort of improvement you would have to see in order to seriously consider undertaking this change to Nova. I've gotten the OK from my management to pursue this, and several people in the community have expressed support for both the approach and the experiment, even though most don't have spare cycles to contribute. I'd love to have anyone who is interested become involved. I hope that this will be a positive discussion at the Nova mid-cycle next week. I know it will be a lively one. :) [1] http://cassandra.apache.org/ [2] http://www.datastax.com/ - -- - -- Ed Leafe -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: GPGTools - https://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJVpmvCAAoJEKMgtcocwZqLSNYP/0b8s7pZnXaF3tTYF+WtNppr lHyQMHSXLQ3CESoS4961ZWOCMtV2hCxvcioXem+PJzOdZED143XMJ3LR6+dZ012q RGSp43Co+vUfuTtaTg030sLyDlXZKEenkPXy0202WpPaK4RYSonrnrxs0kmv+ZpH yamsZP2/gReZseBsKiww0FkqWGkIJxD7bi1r8DdLa/HLvwYUD+U2zrcUvT4cMXMR
Re: [openstack-dev] [nova] Proposal for an Experiment
I do like experiments! What about going even farther and trying to integrate somehow into mesos? https://mesos.apache.org/documentation/latest/mesos-architecture/ Replace the hadooop executor, MPI executor with a 'VM executor' and perhaps we could eliminate a large part of the scheduler code (just a thought)... I think a bunch of other ideas were also written down @ https://review.openstack.org/#/c/191914/ maybe u can try some of those to :) Ed Leafe wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I'm also asking that you refrain from talking about why this can't work for now. I know it'll be difficult to do that, since nobody likes ranting about stuff more than I do, but right now it won't be helpful. There will be plenty of time for that later, assuming that this experiment yields anything worthwhile. Instead, think of the current pain points in the scheduler design, and what sort of improvement you would have to see in order to seriously consider undertaking this change to Nova. I've gotten the OK from my management to pursue this, and several people in the community have expressed support for both the approach and the experiment, even though most don't have spare cycles to contribute. I'd love to have anyone who is interested become involved. I hope that this will be a positive discussion at the Nova mid-cycle next week. I know it will be a lively one. :) [1] http://cassandra.apache.org/ [2] http://www.datastax.com/ - -- - --
Re: [openstack-dev] [nova] Proposal for an Experiment
On 7/15/2015 9:18 AM, Ed Leafe wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Changing the architecture of a complex system such as Nova is never easy, even when we know that the design isn't working as well as we need it to. And it's even more frustrating because when the change is complete, it's hard to know if the improvement, if any, was worth it. So I had an idea: what if we ran a test of that architecture change out-of-tree? In other words, create a separate deployment, and rip out the parts that don't work well, replacing them with an alternative design. There would be no Gerrit reviews or anything that would slow down the work or add load to the already overloaded reviewers. Then we could see if this modified system is a significant-enough improvement to justify investing the time in implementing it in-tree. And, of course, if the test doesn't show what was hoped for, it is scrapped and we start thinking anew. The important part in this process is defining up front what level of improvement would be needed to make considering actually making such a change worthwhile, and what sort of tests would demonstrate whether or not whether this level was met. I'd like to discuss such an experiment next week at the Nova mid-cycle. What I'd like to investigate is replacing the current design of having the compute nodes communicating with the scheduler via message queues. This design is overly complex and has several known scalability issues. My thought is to replace this with a Cassandra [1] backend. Compute nodes would update their state to Cassandra whenever they change, and that data would be read by the scheduler to make its host selection. When the scheduler chooses a host, it would post the claim to Cassandra wrapped in a lightweight transaction, which would ensure that no other scheduler has tried to claim those resources. When the host has built the requested VM, it will delete the claim and update Cassandra with its current state. One main motivation for using Cassandra over the current design is that it will enable us to run multiple schedulers without increasing the raciness of the system. Another is that it will greatly simplify a lot of the internal plumbing we've set up to implement in Nova what we would get out of the box with Cassandra. A third is that if this proves to be a success, it would also be able to be used further down the road to simplify inter-cell communication (but this is getting ahead of ourselves...). I've worked with Cassandra before and it has been rock-solid to run and simple to set up. I've also had preliminary technical reviews with the engineers at DataStax [2], the company behind Cassandra, and they agreed that this was a good fit. At this point I'm sure that most of you are filled with thoughts on how this won't work, or how much trouble it will be to switch, or how much more of a pain it will be, or how you hate non-relational DBs, or any of a zillion other negative thoughts. FWIW, I have them too. But instead of ranting, I would ask that we acknowledge for now that: a) it will be disruptive and painful to switch something like this at this point in Nova's development b) it would have to provide *significant* improvement to make such a change worthwhile So what I'm asking from all of you is to help define the second part: what we would want improved, and how to measure those benefits. In other words, what results would you have to see in order to make you reconsider your initial nah, this'll never work reaction, and start to think that this is will be a worthwhile change to make to Nova. I'm also asking that you refrain from talking about why this can't work for now. I know it'll be difficult to do that, since nobody likes ranting about stuff more than I do, but right now it won't be helpful. There will be plenty of time for that later, assuming that this experiment yields anything worthwhile. Instead, think of the current pain points in the scheduler design, and what sort of improvement you would have to see in order to seriously consider undertaking this change to Nova. I've gotten the OK from my management to pursue this, and several people in the community have expressed support for both the approach and the experiment, even though most don't have spare cycles to contribute. I'd love to have anyone who is interested become involved. I hope that this will be a positive discussion at the Nova mid-cycle next week. I know it will be a lively one. :) [1] http://cassandra.apache.org/ [2] http://www.datastax.com/ - -- - -- Ed Leafe -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: GPGTools - https://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJVpmvCAAoJEKMgtcocwZqLSNYP/0b8s7pZnXaF3tTYF+WtNppr lHyQMHSXLQ3CESoS4961ZWOCMtV2hCxvcioXem+PJzOdZED143XMJ3LR6+dZ012q RGSp43Co+vUfuTtaTg030sLyDlXZKEenkPXy0202WpPaK4RYSonrnrxs0kmv+ZpH yamsZP2/gReZseBsKiww0FkqWGkIJxD7bi1r8DdLa/HLvwYUD+U2zrcUvT4cMXMR
Re: [openstack-dev] [nova] Proposal for an Experiment
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On 07/15/2015 09:49 AM, Matt Riedemann wrote: Without reading the whole thread, couldn't you just do a feature branch (but that would require reviews in gerrit which we don't want), or fork the repo in github and just hack on it there without gerrit? I'm sure many will say it's not cool to fork the repo, but that's essentially what you'd be doing anyway, so meh. It will be a temporary fork, not anything designed to live on forever. I think you just have to have an understanding that whatever you work on in the fork won't necessarily be accepted back in the main repo. Yes, that's sort of the whole point. First prove that it can work; then and only then do we sit down and discuss the best way to implement. The odds that any of the changes made would be able to be pulled directly into master would be slim. - -- - -- Ed Leafe -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: GPGTools - https://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJVpnlOAAoJEKMgtcocwZqL1loQAI/xGth0tSbXAB5gm3bjKYMQ mdsWopf2sAfBUqgSXys5VmYRMuJPGsVXmIQhOYVZtjA8FFAAHcfeHba6c8uIw04n iZgOv/Da8ABX+Saj7jFnjXrBpujD6v4b7T2WhIWg38RNT15z79wTCG0Olh2WPHP0 UNGu79iTqV2c7jaFQ1P91jswRRfLYoY/MaRnTCEhT0Rl/VYS46IeSK9GY1PXrC+z ZBNKdqo2RHqNisPPsdvBVvdTsbcTU3Y8T00u+djp/OEHTPQGIP6SIUzFL61iOVye RXcdSehWmGNG61Tiq1ng6qSzVoisWYaP9kATrXRGTVUhYVJXrhiCgCZPJ8WK3jSI Du3meEW1mr18NcDClTsMbbPmuMeTlPTwWoVNqqqDBhFYQIHTYhbwk9cI2XwkKy0+ VQdORuO5h9Qt7JNdRGb62kDLrC4tKnXP7TWCmqmGXdj31kiCQc4vno+kozzJb90j 6I/I37acxIDKFBvF6GsdWxYNnJdIz03IfoQtMwfR6Jc3QTwl47h/aUuIrTpVpXPA +CCgmcrimef5reQB8kaUEbPyPbwjBUOoYxaFJi3mtQ13nWoOsU23km9qt253+9eS xWVcRL06L6418juvPbMPqDz3giNhUT5ZOL/qC/a9UirQw3p2mASeVwTKmDwfOl+i zhnQQpeqIPkWR3N7+Mwu =5/gA -END PGP SIGNATURE- __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev