Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Excerpts from Zane Bitter's message of 2015-01-09 14:57:21 -0800: On 08/01/15 05:39, Anant Patil wrote: 1. The stack was failing when there were single disjoint resources or just one resource in template. The graph did not include this resource due to a minor bug in dependency_names(). I have added a test case and fix here: https://github.com/anantpatil/heat-convergence-prototype/commit/b58abd77cf596475ecf3f19ed38adf8ad3bb6b3b Thanks, sorry about that! I will push a patch to fix it up. 2. The resource graph is created with keys in both forward order traversal and reverse order traversal and the update will finish the forward order and attempt the reverse order. If this is the case, then the update-replaced resources will be deleted before the update is complete and if the update fails, the old resource is not available for roll-back; a new resource has to be created then. I have added a test case at the above mentioned location. In our PoC, the updates (concurrent updates) won't remove a update-replaced resource until all the resources are updated, and resource clean-up phase is started. Hmmm, this is a really interesting question actually. That's certainly not how Heat works at the moment; we've always assumed that rollback is best-effort at recovering the exact resources you had before. It would be great to have users weigh in on how they expect this to behave. I'm curious now what CloudFormation does. I'm reluctant to change it though because I'm pretty sure this is definitely *not* how you would want e.g. a rolling update of an autoscaling group to happen. It is unacceptable to remove the old resource to be rolled-back to since it may have changes which the user doesn't want to loose; If they didn't want to lose it they shouldn't have tried an update that would replace it. If an update causes a replacement or an interruption to service then I consider the same fair game for the rollback - the user has already given us permission for that kind of change. (Whether the user's consent was informed is a separate question, addressed by Ryan's update-preview work.) In the original vision we had for using scaled groups to manage, say, nova-compute nodes, you definitely can't create new servers, so you can't just create all the new instances without de-allocating some. That said, thats why we are using in-place methods like rebuild. I think it would be acceptable to have cleanup run asynchronously, and to have rollback re-create anything that has already been cleaned up. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 08/01/15 05:39, Anant Patil wrote: 1. The stack was failing when there were single disjoint resources or just one resource in template. The graph did not include this resource due to a minor bug in dependency_names(). I have added a test case and fix here: https://github.com/anantpatil/heat-convergence-prototype/commit/b58abd77cf596475ecf3f19ed38adf8ad3bb6b3b Thanks, sorry about that! I will push a patch to fix it up. 2. The resource graph is created with keys in both forward order traversal and reverse order traversal and the update will finish the forward order and attempt the reverse order. If this is the case, then the update-replaced resources will be deleted before the update is complete and if the update fails, the old resource is not available for roll-back; a new resource has to be created then. I have added a test case at the above mentioned location. In our PoC, the updates (concurrent updates) won't remove a update-replaced resource until all the resources are updated, and resource clean-up phase is started. Hmmm, this is a really interesting question actually. That's certainly not how Heat works at the moment; we've always assumed that rollback is best-effort at recovering the exact resources you had before. It would be great to have users weigh in on how they expect this to behave. I'm curious now what CloudFormation does. I'm reluctant to change it though because I'm pretty sure this is definitely *not* how you would want e.g. a rolling update of an autoscaling group to happen. It is unacceptable to remove the old resource to be rolled-back to since it may have changes which the user doesn't want to loose; If they didn't want to lose it they shouldn't have tried an update that would replace it. If an update causes a replacement or an interruption to service then I consider the same fair game for the rollback - the user has already given us permission for that kind of change. (Whether the user's consent was informed is a separate question, addressed by Ryan's update-preview work.) and that's why probably they use the roll-back flag. I don't think there's any basis for saying that. People use the rollback flag because they want the stack left in a consistent state even if an error occurs. cheers, Zane. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 08-Jan-15 16:09, Anant Patil wrote: On 16-Dec-14 09:41, Zane Bitter wrote: On 15/12/14 09:32, Anant Patil wrote: On 12-Dec-14 06:29, Zane Bitter wrote: On 11/12/14 01:14, Anant Patil wrote: On 04-Dec-14 10:49, Zane Bitter wrote: On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. We were working on new branch and as we discussed on Skype, we have handled all these cases. Please have a look at our current branch: https://github.com/anantpatil/heat-convergence-poc/tree/graph-version When a new resource is taken for convergence, its children are loaded and the resource definition is re-parsed. The frozen resource definition will have all the get_attr resolved. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 Thanks for pointing this out Zane. We too had a buggy implementation for handling inverted dependency. I had a hard look at our algorithm where we were continuously merging the edges from new template into the edges from previous updates. It was an optimized way of traversing the graph in both forward and reverse order with out missing any resources. But, when the dependencies are inverted, this wouldn't work. We have changed our algorithm. The changes in edges are noted down in DB, only the delta of edges from previous template is calculated and kept. At any given point of time, the graph table has all the edges from current template and delta from previous templates. Each edge has template ID associated with it. The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information. There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go). Absolutely! The edges from the template are kept only for the reference purposes. When we have a resource in new template, its template ID will also be marked to current template. At any point of time, realized resource will from current template, even if it were found in previous templates. The template ID moves for a resource if it is found. In theory (disclaimer: I didn't implement this yet) it can change on an even smaller timescale than that. The existing plugins are something of a black box to us: if a failure occurs we don't necessarily know whether the real-world dependency is on the old or new version of another resource. Yes, and that's why we rollback the failed resource and its dependent resources to older versions, provided that, the older resources are not deleted unless update is done. It is easier with template Id as we know the previous complete template. Given that our Resource entries in the DB are in 1:1
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 16-Dec-14 09:41, Zane Bitter wrote: On 15/12/14 09:32, Anant Patil wrote: On 12-Dec-14 06:29, Zane Bitter wrote: On 11/12/14 01:14, Anant Patil wrote: On 04-Dec-14 10:49, Zane Bitter wrote: On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. We were working on new branch and as we discussed on Skype, we have handled all these cases. Please have a look at our current branch: https://github.com/anantpatil/heat-convergence-poc/tree/graph-version When a new resource is taken for convergence, its children are loaded and the resource definition is re-parsed. The frozen resource definition will have all the get_attr resolved. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 Thanks for pointing this out Zane. We too had a buggy implementation for handling inverted dependency. I had a hard look at our algorithm where we were continuously merging the edges from new template into the edges from previous updates. It was an optimized way of traversing the graph in both forward and reverse order with out missing any resources. But, when the dependencies are inverted, this wouldn't work. We have changed our algorithm. The changes in edges are noted down in DB, only the delta of edges from previous template is calculated and kept. At any given point of time, the graph table has all the edges from current template and delta from previous templates. Each edge has template ID associated with it. The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information. There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go). Absolutely! The edges from the template are kept only for the reference purposes. When we have a resource in new template, its template ID will also be marked to current template. At any point of time, realized resource will from current template, even if it were found in previous templates. The template ID moves for a resource if it is found. In theory (disclaimer: I didn't implement this yet) it can change on an even smaller timescale than that. The existing plugins are something of a black box to us: if a failure occurs we don't necessarily know whether the real-world dependency is on the old or new version of another resource. Yes, and that's why we rollback the failed resource and its dependent resources to older versions, provided that, the older resources are not deleted unless update is done. It is easier with template Id as we know the previous complete template. Given that our Resource entries in the DB are in 1:1 correspondence with actual resources (we create a
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
-Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Thursday, December 18, 2014 7:42 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 17/12/14 13:05, Gurjar, Unmesh wrote: I'm storing a tuple of its name and database ID. The data structure is resource.GraphKey. I was originally using the name for something, but I suspect I could probably drop it now and just store the database ID, but I haven't tried it yet. (Having the name in there definitely makes debugging more pleasant though ;) I agree, having name might come in handy while debugging! When I build the traversal graph each node is a tuple of the GraphKey and a boolean to indicate whether it corresponds to an update or a cleanup operation (both can appear for a single resource in the same graph). Just to confirm my understanding, cleanup operation takes care of both: 1. resources which are deleted as a part of update and 2. previous versioned resource which was updated by replacing with a new resource (UpdateReplace scenario) Yes, correct. Also: 3. resource versions which failed to delete for whatever reason on a previous traversal Also, the cleanup operation is performed after the update completes successfully. NO! They are not separate things! https://github.com/openstack/heat/blob/stable/juno/heat/engine/update. py#L177-L198 If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. I'm calling update() on all resources regardless of change, but update() will only call handle_update() if something has changed (unless the plugin has overridden Resource._needs_update()). There's no way to know whether a resource needs to be updated before you're ready to update it, so I don't think of this as 'inefficient', just 'correct'. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. About streams, Yes in a million resource stack, the data will be huge, but less than template. No way, it's O(n^3) (cubed!) in the worst case to store streams for each resource. Also this stream is stored only In IN_PROGRESS resources. Now I'm really confused. Where does it come from if the resource doesn't get it until it's already in progress? And how will that information help it? When an operation on stack is initiated, the stream will be identified. OK, this may be one of the things I was getting confused about - I though a 'stream' belonged to one particular resource and just contained all of the paths to reaching that resource. But here it seems like you're saying that a 'stream' is a representation of the entire graph? So it's essentially just a gratuitously bloated NIH serialisation of the Dependencies graph? To begin the operation, the action is initiated on the leaf (or root) resource(s) and the stream is stored (only) in this/these IN_PROGRESS resource(s). How does that work? Does it get deleted again when the resource moves to COMPLETE? Yes, IMO, upon resource completion, the stream can be deleted. I do not foresee any situation where-in the storing the stream is required. Since, when another operation is initiated on the stack, that template should be parsed and the new stream should be identified and used. The stream should then keep getting passed to the next/previous level of resource(s) as and when the dependencies for the next/previous level of resource(s) are met. That sounds... identical to the way it's implemented in my prototype (passing a serialisation of the graph down through the notification triggers), except for the part about storing it in the Resource table. Why would we persist to the database data that we only need for the duration that we already have it in memory anyway? Earlier we thought of passing it along while initiating the next level of resource(s). However, for the million resource stack, it will be quite large and passing it around will be inefficient. So, we intended to have it stored in database. Also, it can be used for resuming a stack operation when the processing engine goes down and another engine has to resume that stack operation. If we're going to persist it we should do so once, in the Stack table, at the time that we're preparing to start the traversal. The reason to have entire dependency list to reduce DB queries while a stack update. But we never need to know that. We only need to know what just happened and what to do next. As mentioned earlier, each level of resources in a graph pass on the dependency list/stream to their next/previous level
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Excerpts from Anant Patil's message of 2014-12-16 07:36:58 -0800: On 16-Dec-14 00:59, Clint Byrum wrote: Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800: On 13-Dec-14 05:42, Zane Bitter wrote: On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. This is where the version is so handy! For UpdateReplaced ones, there is an older version to go back to. This version could just be template ID, as I mentioned in another e-mail. All resources are at the current template ID if they are found in the current template, even if they is no need to update them. Otherwise, they need to be cleaned-up in the order given in the previous templates. I think the template ID is used as version as far as I can see in Zane's PoC. If the resource template key doesn't match the current template key, the resource is deleted. The version is misnomer here, but that field (template id) is used as though we had versions of resources. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 15/12/14 07:47, Murugan, Visnusaran wrote: We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. I added a bunch of extra docstrings and comments: https://github.com/zaneb/heat-convergence-prototype/commit/5d79e009196dc224bd588e19edef5f0939b04607 I also implemented a --pdb option that will automatically set breakpoints at the beginning of all of the asynchronous events, so that you'll be dropped into the debugger and can single-step through the code, look at variables and so on: https://github.com/zaneb/heat-convergence-prototype/commit/2a7a56dde21cad979fae25acc9fb01c6b4d9c6f7 I hope that helps. If you have more questions, please feel free to ask. cheers, Zane. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
-Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Tuesday, December 16, 2014 9:43 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 15/12/14 07:47, Murugan, Visnusaran wrote: Hi Zane, We have been going through this chain for quite some time now and we still feel a disconnect in our understanding. Yes, I thought last week that we were on the same page, but now it looks like we're drifting off again :( Can you put up a etherpad where we can understand your approach. Maybe you could put up an etherpad with your questions. Practically all of the relevant code is in Stack._create_or_update, Stack._dependencies and Converger.check_resource. That's 134 lines of code by my count. There's not a lot more I can usefully say about it without knowing which parts exactly you're stuck on, but I can definitely answer specific questions. For example: for storing resource dependencies, Are you storing its name, version tuple or just its ID. I'm storing a tuple of its name and database ID. The data structure is resource.GraphKey. I was originally using the name for something, but I suspect I could probably drop it now and just store the database ID, but I haven't tried it yet. (Having the name in there definitely makes debugging more pleasant though ;) I agree, having name might come in handy while debugging! When I build the traversal graph each node is a tuple of the GraphKey and a boolean to indicate whether it corresponds to an update or a cleanup operation (both can appear for a single resource in the same graph). Just to confirm my understanding, cleanup operation takes care of both: 1. resources which are deleted as a part of update and 2. previous versioned resource which was updated by replacing with a new resource (UpdateReplace scenario) Also, the cleanup operation is performed after the update completes successfully. If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. I'm calling update() on all resources regardless of change, but update() will only call handle_update() if something has changed (unless the plugin has overridden Resource._needs_update()). There's no way to know whether a resource needs to be updated before you're ready to update it, so I don't think of this as 'inefficient', just 'correct'. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. About streams, Yes in a million resource stack, the data will be huge, but less than template. No way, it's O(n^3) (cubed!) in the worst case to store streams for each resource. Also this stream is stored only In IN_PROGRESS resources. Now I'm really confused. Where does it come from if the resource doesn't get it until it's already in progress? And how will that information help it? When an operation on stack is initiated, the stream will be identified. To begin the operation, the action is initiated on the leaf (or root) resource(s) and the stream is stored (only) in this/these IN_PROGRESS resource(s). The stream should then keep getting passed to the next/previous level of resource(s) as and when the dependencies for the next/previous level of resource(s) are met. The reason to have entire dependency list to reduce DB queries while a stack update. But we never need to know that. We only need to know what just happened and what to do next. As mentioned earlier, each level of resources in a graph pass on the dependency list/stream to their next/previous level of resources. This is information should further be used to determine what is to be done next and drive the operation to completion. When you have a singular dependency on each resources similar to your implantation, then we will end up loading Dependencies one at a time and altering almost all resource's dependency regardless of their change. Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream To be taken up post update. That would be a regression from Heat's current behaviour, where we start cleaning up resources as soon as they have nothing depending on them. There's not even a reason to make it worse than what we already have, because it's actually a lot _easier_ to treat update and clean up as the same kind of operation and throw both into the same big graph. The dual implementations and all of the edge cases go away and you can just trust in the graph traversal to do the Right Thing in the most parallel way possible. (Any post operation will be handled as an update) This approach is True
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 17/12/14 13:05, Gurjar, Unmesh wrote: I'm storing a tuple of its name and database ID. The data structure is resource.GraphKey. I was originally using the name for something, but I suspect I could probably drop it now and just store the database ID, but I haven't tried it yet. (Having the name in there definitely makes debugging more pleasant though ;) I agree, having name might come in handy while debugging! When I build the traversal graph each node is a tuple of the GraphKey and a boolean to indicate whether it corresponds to an update or a cleanup operation (both can appear for a single resource in the same graph). Just to confirm my understanding, cleanup operation takes care of both: 1. resources which are deleted as a part of update and 2. previous versioned resource which was updated by replacing with a new resource (UpdateReplace scenario) Yes, correct. Also: 3. resource versions which failed to delete for whatever reason on a previous traversal Also, the cleanup operation is performed after the update completes successfully. NO! They are not separate things! https://github.com/openstack/heat/blob/stable/juno/heat/engine/update.py#L177-L198 If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. I'm calling update() on all resources regardless of change, but update() will only call handle_update() if something has changed (unless the plugin has overridden Resource._needs_update()). There's no way to know whether a resource needs to be updated before you're ready to update it, so I don't think of this as 'inefficient', just 'correct'. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. About streams, Yes in a million resource stack, the data will be huge, but less than template. No way, it's O(n^3) (cubed!) in the worst case to store streams for each resource. Also this stream is stored only In IN_PROGRESS resources. Now I'm really confused. Where does it come from if the resource doesn't get it until it's already in progress? And how will that information help it? When an operation on stack is initiated, the stream will be identified. OK, this may be one of the things I was getting confused about - I though a 'stream' belonged to one particular resource and just contained all of the paths to reaching that resource. But here it seems like you're saying that a 'stream' is a representation of the entire graph? So it's essentially just a gratuitously bloated NIH serialisation of the Dependencies graph? To begin the operation, the action is initiated on the leaf (or root) resource(s) and the stream is stored (only) in this/these IN_PROGRESS resource(s). How does that work? Does it get deleted again when the resource moves to COMPLETE? The stream should then keep getting passed to the next/previous level of resource(s) as and when the dependencies for the next/previous level of resource(s) are met. That sounds... identical to the way it's implemented in my prototype (passing a serialisation of the graph down through the notification triggers), except for the part about storing it in the Resource table. Why would we persist to the database data that we only need for the duration that we already have it in memory anyway? If we're going to persist it we should do so once, in the Stack table, at the time that we're preparing to start the traversal. The reason to have entire dependency list to reduce DB queries while a stack update. But we never need to know that. We only need to know what just happened and what to do next. As mentioned earlier, each level of resources in a graph pass on the dependency list/stream to their next/previous level of resources. This is information should further be used to determine what is to be done next and drive the operation to completion. Why would we store *and* forward? When you have a singular dependency on each resources similar to your implantation, then we will end up loading Dependencies one at a time and altering almost all resource's dependency regardless of their change. Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream To be taken up post update. That would be a regression from Heat's current behaviour, where we start cleaning up resources as soon as they have nothing depending on them. There's not even a reason to make it worse than what we already have, because it's actually a lot _easier_ to treat update and clean up as the same kind of operation and throw both into the same big graph. The dual implementations and all of the edge cases go away and you can just trust in the graph traversal to do the Right Thing in
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 16-Dec-14 00:59, Clint Byrum wrote: Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800: On 13-Dec-14 05:42, Zane Bitter wrote: On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. This is where the version is so handy! For UpdateReplaced ones, there is an older version to go back to. This version could just be template ID, as I mentioned in another e-mail. All resources are at the current template ID if they are found in the current template, even if they is no need to update them. Otherwise, they need to be cleaned-up in the order given in the previous templates. I think the template ID is used as version as far as I can see in Zane's PoC. If the resource template key doesn't match the current template key, the resource is deleted. The version is misnomer here, but that field (template id) is used as though we had versions of resources. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we don't update the central graph, we update the individual Resource so there's no global lock required. [1] https://github.com/zaneb/heat-convergence-prototype/blob
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Hi Zane, We have been going through this chain for quite some time now and we still feel a disconnect in our understanding. Can you put up a etherpad where we can understand your approach. For example: for storing resource dependencies, Are you storing its name, version tuple or just its ID. If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. About streams, Yes in a million resource stack, the data will be huge, but less than template. Also this stream is stored only In IN_PROGRESS resources. The reason to have entire dependency list to reduce DB queries while a stack update. When you have a singular dependency on each resources similar to your implantation, then we will end up loading Dependencies one at a time and altering almost all resource's dependency regardless of their change. Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream To be taken up post update. (Any post operation will be handled as an update) This approach is True when Rollback==True We can always fall back to regular stream (non-delete stream) if Rollback=False In our view we would like to have only one basic operation and that is UPDATE. 1. CREATE will be an update where a realized graph == Empty 2. UPDATE will be an update where realized graph == Full/Partial realized (possibly with a delete stream as post operation if Rollback==True) 3. DELETE will be just another update with an empty to_be_realized_graph. It would be great if we can freeze a stable approach by mid-week as Christmas vacations are round the corner. :) :) -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Saturday, December 13, 2014 5:43 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 12-Dec-14 06:29, Zane Bitter wrote: On 11/12/14 01:14, Anant Patil wrote: On 04-Dec-14 10:49, Zane Bitter wrote: On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. We were working on new branch and as we discussed on Skype, we have handled all these cases. Please have a look at our current branch: https://github.com/anantpatil/heat-convergence-poc/tree/graph-version When a new resource is taken for convergence, its children are loaded and the resource definition is re-parsed. The frozen resource definition will have all the get_attr resolved. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 Thanks for pointing this out Zane. We too had a buggy implementation for handling inverted dependency. I had a hard look at our algorithm where we were continuously merging the edges from new template into the edges from previous updates. It was an optimized way of traversing the graph in both forward and reverse order with out missing any resources. But, when the dependencies are inverted, this wouldn't work. We have changed our algorithm. The changes in edges are noted down in DB, only the delta of edges from previous template is calculated and kept. At any given point of time, the graph table has all the edges from current template and delta from previous templates. Each edge has template ID associated with it. The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information. There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go). Absolutely! The edges from the template are kept only for the reference purposes. When we have a resource in new template, its template ID will also be marked to current template. At any point of time, realized resource will from current template, even if it were found in previous templates. The template ID moves for a resource if it is found. Given that our Resource entries in the DB are in 1:1 correspondence with actual resources (we create a new one whenever we need to replace the underlying resource), I found it makes the most conceptual and practical sense to store the requirements in the resource itself, and update them at the time they actually change in the real world (bonus: introduces no new locking issues and no extra DB writes). I settled on this after a legitimate attempt at trying other options, but they didn't work out: https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d I am okay with the notion of graph stored in resource table. For resource clean up, we start from the first
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 13-Dec-14 05:42, Zane Bitter wrote: On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. This is where the version is so handy! For UpdateReplaced ones, there is an older version to go back to. This version could just be template ID, as I mentioned in another e-mail. All resources are at the current template ID if they are found in the current template, even if they is no need to update them. Otherwise, they need to be cleaned-up in the order given in the previous templates. I think the template ID is used as version as far as I can see in Zane's PoC. If the resource template key doesn't match the current template key, the resource is deleted. The version is misnomer here, but that field (template id) is used as though we had versions of resources. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we don't update the central graph, we update the individual Resource so there's no global lock required. [1] https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168 A centralized graph and decision making will make
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800: On 13-Dec-14 05:42, Zane Bitter wrote: On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. This is where the version is so handy! For UpdateReplaced ones, there is an older version to go back to. This version could just be template ID, as I mentioned in another e-mail. All resources are at the current template ID if they are found in the current template, even if they is no need to update them. Otherwise, they need to be cleaned-up in the order given in the previous templates. I think the template ID is used as version as far as I can see in Zane's PoC. If the resource template key doesn't match the current template key, the resource is deleted. The version is misnomer here, but that field (template id) is used as though we had versions of resources. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we don't update the central graph, we update the individual Resource so there's no global lock required. [1
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 15/12/14 09:32, Anant Patil wrote: On 12-Dec-14 06:29, Zane Bitter wrote: On 11/12/14 01:14, Anant Patil wrote: On 04-Dec-14 10:49, Zane Bitter wrote: On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. We were working on new branch and as we discussed on Skype, we have handled all these cases. Please have a look at our current branch: https://github.com/anantpatil/heat-convergence-poc/tree/graph-version When a new resource is taken for convergence, its children are loaded and the resource definition is re-parsed. The frozen resource definition will have all the get_attr resolved. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 Thanks for pointing this out Zane. We too had a buggy implementation for handling inverted dependency. I had a hard look at our algorithm where we were continuously merging the edges from new template into the edges from previous updates. It was an optimized way of traversing the graph in both forward and reverse order with out missing any resources. But, when the dependencies are inverted, this wouldn't work. We have changed our algorithm. The changes in edges are noted down in DB, only the delta of edges from previous template is calculated and kept. At any given point of time, the graph table has all the edges from current template and delta from previous templates. Each edge has template ID associated with it. The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information. There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go). Absolutely! The edges from the template are kept only for the reference purposes. When we have a resource in new template, its template ID will also be marked to current template. At any point of time, realized resource will from current template, even if it were found in previous templates. The template ID moves for a resource if it is found. In theory (disclaimer: I didn't implement this yet) it can change on an even smaller timescale than that. The existing plugins are something of a black box to us: if a failure occurs we don't necessarily know whether the real-world dependency is on the old or new version of another resource. Given that our Resource entries in the DB are in 1:1 correspondence with actual resources (we create a new one whenever we need to replace the underlying resource), I found it makes the most conceptual and practical sense to store the requirements in the resource itself, and update them at the time they actually change in the real world (bonus: introduces no new locking issues and no extra DB writes). I settled on this after a legitimate attempt at
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 15/12/14 07:47, Murugan, Visnusaran wrote: Hi Zane, We have been going through this chain for quite some time now and we still feel a disconnect in our understanding. Yes, I thought last week that we were on the same page, but now it looks like we're drifting off again :( Can you put up a etherpad where we can understand your approach. Maybe you could put up an etherpad with your questions. Practically all of the relevant code is in Stack._create_or_update, Stack._dependencies and Converger.check_resource. That's 134 lines of code by my count. There's not a lot more I can usefully say about it without knowing which parts exactly you're stuck on, but I can definitely answer specific questions. For example: for storing resource dependencies, Are you storing its name, version tuple or just its ID. I'm storing a tuple of its name and database ID. The data structure is resource.GraphKey. I was originally using the name for something, but I suspect I could probably drop it now and just store the database ID, but I haven't tried it yet. (Having the name in there definitely makes debugging more pleasant though ;) When I build the traversal graph each node is a tuple of the GraphKey and a boolean to indicate whether it corresponds to an update or a cleanup operation (both can appear for a single resource in the same graph). If I am correct, you are updating all resources on update regardless of their change which will be inefficient if stack contains a million resource. I'm calling update() on all resources regardless of change, but update() will only call handle_update() if something has changed (unless the plugin has overridden Resource._needs_update()). There's no way to know whether a resource needs to be updated before you're ready to update it, so I don't think of this as 'inefficient', just 'correct'. We have similar questions regarding other areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get a hold on your approach just by looking at code. Docs strings / Etherpad will help. About streams, Yes in a million resource stack, the data will be huge, but less than template. No way, it's O(n^3) (cubed!) in the worst case to store streams for each resource. Also this stream is stored only In IN_PROGRESS resources. Now I'm really confused. Where does it come from if the resource doesn't get it until it's already in progress? And how will that information help it? The reason to have entire dependency list to reduce DB queries while a stack update. But we never need to know that. We only need to know what just happened and what to do next. When you have a singular dependency on each resources similar to your implantation, then we will end up loading Dependencies one at a time and altering almost all resource's dependency regardless of their change. Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream To be taken up post update. That would be a regression from Heat's current behaviour, where we start cleaning up resources as soon as they have nothing depending on them. There's not even a reason to make it worse than what we already have, because it's actually a lot _easier_ to treat update and clean up as the same kind of operation and throw both into the same big graph. The dual implementations and all of the edge cases go away and you can just trust in the graph traversal to do the Right Thing in the most parallel way possible. (Any post operation will be handled as an update) This approach is True when Rollback==True We can always fall back to regular stream (non-delete stream) if Rollback=False I don't understand what you're saying here. In our view we would like to have only one basic operation and that is UPDATE. 1. CREATE will be an update where a realized graph == Empty 2. UPDATE will be an update where realized graph == Full/Partial realized (possibly with a delete stream as post operation if Rollback==True) 3. DELETE will be just another update with an empty to_be_realized_graph. Yes, that goes without saying. In my implementation Stack._create_or_update() handles all three operations. It would be great if we can freeze a stable approach by mid-week as Christmas vacations are round the corner. :) :) -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Saturday, December 13, 2014 5:43 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 15/12/14 10:15, Anant Patil wrote: On 13-Dec-14 05:42, Zane Bitter wrote: On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. This is where the version is so handy! For UpdateReplaced ones, there is *sometimes* an older version to go back to. This version could just be template ID, as I mentioned in another e-mail. All resources are at the current template ID if they are found in the current template, even if they is no need to update them. Otherwise, they need to be cleaned-up in the order given in the previous templates. I think the template ID is used as version as far as I can see in Zane's PoC. If the resource template key doesn't match the current template key, the resource is deleted. The version is misnomer here, but that field (template id) is used as though we had versions of resources. Correct. Because if we had wanted to keep it, we would have updated it to the new template version already. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we don't update the central graph, we update the individual Resource so there's no global lock required. [1] https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
-Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Also taking care of deleting resources in order will be an issue. It works fine. This implies that there will be different versions of a resource which will even complicate further. No it doesn't, other than the different versions we already have due to UpdateReplace. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did. The three differences I can see are: 1) I think you are proposing to create all of the sync points at the start of the traversal, rather than on an as-needed basis. This is probably a good idea. I didn't consider it because of the way my prototype evolved, but there's now no reason I can see not to do this. If we could move the data to the Resource table itself then we could even get it for free from an efficiency point of view. +1. But we will need engine_id to be stored somewhere for recovery purpose (easy to be queried format). Yeah, so I'm starting to think you're right, maybe the/a Lock table is the right thing to use there. We could probably do it within the resource table using the same select-for-update to set the engine_id, but I agree that we might be starting to jam too much into that one table. yeah. Unrelated values in resource table. Upon resource completion we have to unset engine_id as well as compared to dropping a row from resource lock. Both are good. Having engine_id in resource_table will reduce db operaions in half. We should go with just resource table along with engine_id. Sync points are created as-needed. Single resource is enough to restart that entire stream. I
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Hi zaneb, Etherpad updated. https://etherpad.openstack.org/p/execution-stream-and-aggregator-based-convergence -Original Message- From: Murugan, Visnusaran Sent: Friday, December 12, 2014 4:00 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Also taking care of deleting resources in order will be an issue. It works fine. This implies that there will be different versions of a resource which will even complicate further. No it doesn't, other than the different versions we already have due to UpdateReplace. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did. The three differences I can see are: 1) I think you are proposing to create all of the sync points at the start of the traversal, rather than on an as-needed basis. This is probably a good idea. I didn't consider it because of the way my prototype evolved, but there's now no reason I can see not to do this. If we could move the data to the Resource table itself then we could even get it for free from an efficiency point of view. +1. But we will need engine_id to be stored somewhere for recovery purpose (easy to be queried format). Yeah, so I'm starting to think you're right, maybe the/a Lock table is the right thing to use there. We could probably do it within the resource table using the same select-for-update to set the engine_id, but I agree
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 12/12/14 05:29, Murugan, Visnusaran wrote: -Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Friday, December 12, 2014 6:37 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Can you elaborate a bit on rollback. I didn't do anything special to handle rollback. It's possible that we need to - obviously the difference in the UpdateReplace + rollback case is that the replaced resource is now the one we want to keep, and yet the replaced_by/replaces dependency will force the newer (replacement) resource to be checked for deletion first, which is an inversion of the usual order. However, I tried to think of a scenario where that would cause problems and I couldn't come up with one. Provided we know the actual, real-world dependencies of each resource I don't think the ordering of those two checks matters. In fact, I currently can't think of a case where the dependency order between replacement and replaced resources matters at all. It matters in the current Heat implementation because resources are artificially segmented into the current and backup stacks, but with a holistic view of dependencies that may well not be required. I tried taking that line out of the simulator code and all the tests still passed. If anybody can think of a scenario in which it would make a difference, I would be very interested to hear it. In any event though, it should be no problem to reverse the direction of that one edge in these particular circumstances if it does turn out to be a problem. We had an approach with depends_on and needed_by columns in ResourceTable. But dropped it when we figured out we had too many DB operations for Update. Yeah, I initially ran into this problem too - you have a bunch of nodes that are waiting on the current node, and now you have to go look them all up in the database to see what else they're waiting on in order to tell if they're ready to be triggered. It turns out the answer is to distribute the writes but centralise the reads. So at the start of the update, we read all of the Resources, obtain their dependencies and build one central graph[1]. We than make that graph available to each resource (either by passing it as a notification parameter, or storing it somewhere central in the DB that they will all have to read anyway, i.e. the Stack). But when we update a dependency we don't update the central graph, we update the individual Resource so there's no global lock required. [1] https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168 Also taking care of deleting resources in order will be an issue. It works fine. This implies that there will be different versions of a resource which will even complicate further. No it doesn't, other than the different versions we already have due to UpdateReplace. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
-Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Wednesday, December 10, 2014 11:17 PM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown You really need to get a real email client with quoting support ;) Apologies :) I screwed up my mail client's configuration. On 10/12/14 06:42, Murugan, Visnusaran wrote: Well, we still have to persist the dependencies of each version of a resource _somehow_, because otherwise we can't know how to clean them up in the correct order. But what I think you meant to say is that this approach doesn't require it to be persisted in a separate table where the rows are marked as traversed as we work through the graph. [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. Also taking care of deleting resources in order will be an issue. This implies that there will be different versions of a resource which will even complicate further. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did. The three differences I can see are: 1) I think you are proposing to create all of the sync points at the start of the traversal, rather than on an as-needed basis. This is probably a good idea. I didn't consider it because of the way my prototype evolved, but there's now no reason I can see not to do this. If we could move the data to the Resource table itself then we could even get it for free from an efficiency point of view. +1. But we will need engine_id to be stored somewhere for recovery purpose (easy to be queried format). Sync points are created as-needed. Single resource is enough to restart that entire stream. I think there is a disconnect in our understanding. I will detail it as well in the etherpad. 2) You're using a single list from which items are removed, rather than two lists (one static, and one to which items are added) that get compared. Assuming (1) then this is probably a good idea too. Yeah. We have a single list per active stream which work by removing Complete/satisfied resources from it. 3) You're suggesting to notify the engine unconditionally and let the engine decide if the list is empty. That's probably not a good idea - not only does
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 11/12/14 01:14, Anant Patil wrote: On 04-Dec-14 10:49, Zane Bitter wrote: On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. We were working on new branch and as we discussed on Skype, we have handled all these cases. Please have a look at our current branch: https://github.com/anantpatil/heat-convergence-poc/tree/graph-version When a new resource is taken for convergence, its children are loaded and the resource definition is re-parsed. The frozen resource definition will have all the get_attr resolved. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 Thanks for pointing this out Zane. We too had a buggy implementation for handling inverted dependency. I had a hard look at our algorithm where we were continuously merging the edges from new template into the edges from previous updates. It was an optimized way of traversing the graph in both forward and reverse order with out missing any resources. But, when the dependencies are inverted, this wouldn't work. We have changed our algorithm. The changes in edges are noted down in DB, only the delta of edges from previous template is calculated and kept. At any given point of time, the graph table has all the edges from current template and delta from previous templates. Each edge has template ID associated with it. The thing is, the cleanup dependencies aren't really about the template. The real resources really depend on other real resources. You can't delete a Volume before its VolumeAttachment, not because it says so in the template but because it will fail if you try. The template can give us a rough guide in advance to what those dependencies will be, but if that's all we keep then we are discarding information. There may be multiple versions of a resource corresponding to one template version. Even worse, the actual dependencies of a resource change on a smaller time scale than an entire stack update (this is the reason the current implementation updates the template one resource at a time as we go). Given that our Resource entries in the DB are in 1:1 correspondence with actual resources (we create a new one whenever we need to replace the underlying resource), I found it makes the most conceptual and practical sense to store the requirements in the resource itself, and update them at the time they actually change in the real world (bonus: introduces no new locking issues and no extra DB writes). I settled on this after a legitimate attempt at trying other options, but they didn't work out: https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d For resource clean up, we start from the first template (template which was completed and updates were made on top of it, empty template otherwise), and move towards the current template in the order in which the updates were issued, and for each template the graph (edges if found for the template) is traversed in reverse order and resources are cleaned-up. I'm pretty sure this is backwards - you'll need to clean up newer resources first because they may reference resources from older templates. Also if you have a stubborn old resource that won't
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 11/12/14 08:26, Murugan, Visnusaran wrote: [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. Agree that looking to all templates for a delete is not good. But baring Complexity, we feel we could achieve it by way of having an update and a delete stream for a stack update operation. I will elaborate in detail in the etherpad sometime tomorrow :) I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. Not sure how this will work in an update scenario when a resource does not change and its dependencies do. We'll always update the requirements, even when the properties don't change. Also taking care of deleting resources in order will be an issue. It works fine. This implies that there will be different versions of a resource which will even complicate further. No it doesn't, other than the different versions we already have due to UpdateReplace. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did. The three differences I can see are: 1) I think you are proposing to create all of the sync points at the start of the traversal, rather than on an as-needed basis. This is probably a good idea. I didn't consider it because of the way my prototype evolved, but there's now no reason I can see not to do this. If we could move the data to the Resource table itself then we could even get it for free from an efficiency point of view. +1. But we will need engine_id to be stored somewhere for recovery purpose (easy to be queried format). Yeah, so I'm starting to think you're right, maybe the/a Lock table is the right thing to use there. We could probably do it within the resource table using the same select-for-update to set the engine_id, but I agree that we might be starting to jam too much into that one table. Sync points are created as-needed. Single resource is enough to restart that entire stream. I think there is a disconnect in our understanding. I will detail it as well in the etherpad. OK, that would be good. 2) You're using a single list from which items are removed, rather than two lists (one static, and one to which items are added) that get compared. Assuming (1) then this is probably a good idea too. Yeah. We have a single list per active stream which work by removing Complete/satisfied resources from it. I went to change this and then remembered why I did it this way: the sync point is also storing data about the resources that are triggering it. Part of this is the RefID and attributes, and we could replace that by storing that data in the Resource itself and querying it rather than having it passed in via the notification. But the other part is the ID/key of those resources, which we _need_ to know in order to update
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
-Original Message- From: Zane Bitter [mailto:zbit...@redhat.com] Sent: Tuesday, December 9, 2014 3:50 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On 08/12/14 07:00, Murugan, Visnusaran wrote: Hi Zane Michael, Please have a look @ https://etherpad.openstack.org/p/execution-stream-and-aggregator-based -convergence Updated with a combined approach which does not require persisting graph and backup stack removal. Well, we still have to persist the dependencies of each version of a resource _somehow_, because otherwise we can't know how to clean them up in the correct order. But what I think you meant to say is that this approach doesn't require it to be persisted in a separate table where the rows are marked as traversed as we work through the graph. [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. It's not clear to me how the 'streams' differ in practical terms from just passing a serialisation of the Dependencies object, other than being incomprehensible to me ;). The current Dependencies implementation (1) is a very generic implementation of a DAG, (2) works and has plenty of unit tests, (3) has, with I think one exception, a pretty straightforward API, (4) has a very simple serialisation, returned by the edges() method, which can be passed back into the constructor to recreate it, and (5) has an API that is to some extent relied upon by resources, and so won't likely be removed outright in any event. Whatever code we need to handle dependencies ought to just build on this existing implementation. [Murugan, Visnusaran] Our thought was to reduce payload size (template/graph). Just planning for worst case scenario (million resource stack) We could always dump them in ResourceLock.data to be loaded by Worker. I think the difference may be that the streams only include the *shortest* paths (there will often be more than one) to each resource. i.e. A --- B --- C ^ | | | +-+ can just be written as: A --- B --- C because there's only one order in which that can execute anyway. (If we're going to do this though, we should just add a method to the dependencies.Graph class to delete redundant edges, not create a whole new data structure.) There is a big potential advantage here in that it reduces the theoretical maximum number of edges in the graph from O(n^2) to O(n). (Although in practice real templates are typically not likely to have such dense graphs.) There's a downside to this too though: say that A in the above diagram is replaced during an update. In that case not only B but also C will need to figure out what the latest version of A is. One option here is to pass that data along via B, but that will become very messy to implement in a non-trivial example. The other would be for C to go search in the database for resources with the same name as A and the current traversal_id marked as the latest. But that not only creates a concurrency problem we didn't have before (A could have been updated with a new traversal_id at some point after C had established that the current traversal was still valid but before it went looking for A), it also eliminates all of the performance gains from removing that edge in the first place. [1] https://github.com/zaneb/heat-convergence
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
You really need to get a real email client with quoting support ;) On 10/12/14 06:42, Murugan, Visnusaran wrote: Well, we still have to persist the dependencies of each version of a resource _somehow_, because otherwise we can't know how to clean them up in the correct order. But what I think you meant to say is that this approach doesn't require it to be persisted in a separate table where the rows are marked as traversed as we work through the graph. [Murugan, Visnusaran] In case of rollback where we have to cleanup earlier version of resources, we could get the order from old template. We'd prefer not to have a graph table. In theory you could get it by keeping old templates around. But that means keeping a lot of templates, and it will be hard to keep track of when you want to delete them. It also means that when starting an update you'll need to load every existing previous version of the template in order to calculate the dependencies. It also leaves the dependencies in an ambiguous state when a resource fails, and although that can be worked around it will be a giant pain to implement. I agree that I'd prefer not to have a graph table. After trying a couple of different things I decided to store the dependencies in the Resource table, where we can read or write them virtually for free because it turns out that we are always reading or updating the Resource itself at exactly the same time anyway. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) [Murugan, Visnusaran] Resource completes, removes itself from resource_lock and notifies engine. Engine will acquire parent lock and initiate parent only if all its children are satisfied (no child entry in resource_lock). This will come in place of Aggregator. Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did. The three differences I can see are: 1) I think you are proposing to create all of the sync points at the start of the traversal, rather than on an as-needed basis. This is probably a good idea. I didn't consider it because of the way my prototype evolved, but there's now no reason I can see not to do this. If we could move the data to the Resource table itself then we could even get it for free from an efficiency point of view. 2) You're using a single list from which items are removed, rather than two lists (one static, and one to which items are added) that get compared. Assuming (1) then this is probably a good idea too. 3) You're suggesting to notify the engine unconditionally and let the engine decide if the list is empty. That's probably not a good idea - not only does it require extra reads, it introduces a race condition that you then have to solve (it can be solved, it's just more work). Since the update to remove a child from the list is atomic, it's best to just trigger the engine only if the list is now empty. It's not clear to me how the 'streams' differ in practical terms from just passing a serialisation of the Dependencies object, other than being incomprehensible to me ;). The current Dependencies implementation (1) is a very generic implementation of a DAG, (2) works and has plenty of unit tests, (3) has, with I think one exception, a pretty straightforward API, (4) has a very simple serialisation, returned by the edges() method, which can be passed back into the constructor to recreate it, and (5) has an API that is to some extent relied upon by resources, and so won't likely be removed outright in any event. Whatever code we need to handle dependencies ought to just build on this existing implementation. [Murugan, Visnusaran] Our thought was to reduce payload size (template/graph). Just planning for worst case scenario (million resource stack) We could always dump them in ResourceLock.data to be loaded by Worker. If there's a smaller representation of a graph than a list of edges then
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 08/12/14 07:00, Murugan, Visnusaran wrote: Hi Zane Michael, Please have a look @ https://etherpad.openstack.org/p/execution-stream-and-aggregator-based-convergence Updated with a combined approach which does not require persisting graph and backup stack removal. Well, we still have to persist the dependencies of each version of a resource _somehow_, because otherwise we can't know how to clean them up in the correct order. But what I think you meant to say is that this approach doesn't require it to be persisted in a separate table where the rows are marked as traversed as we work through the graph. This approach reduces DB queries by waiting for completion notification on a topic. The drawback I see is that delete stack stream will be huge as it will have the entire graph. We can always dump such data in ResourceLock.data Json and pass a simple flag load_stream_from_db to converge RPC call as a workaround for delete operation. This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with the key difference that the data is stored in-memory in a Heat engine rather than the database. I suspect it's probably a mistake to move it in-memory for similar reasons to the argument Clint made against synchronising the marking off of dependencies in-memory. The database can handle that and the problem of making the DB robust against failures of a single machine has already been solved by someone else. If we do it in-memory we are just creating a single point of failure for not much gain. (I guess you could argue it doesn't matter, since if any Heat engine dies during the traversal then we'll have to kick off another one anyway, but it does limit our options if that changes in the future.) It's not clear to me how the 'streams' differ in practical terms from just passing a serialisation of the Dependencies object, other than being incomprehensible to me ;). The current Dependencies implementation (1) is a very generic implementation of a DAG, (2) works and has plenty of unit tests, (3) has, with I think one exception, a pretty straightforward API, (4) has a very simple serialisation, returned by the edges() method, which can be passed back into the constructor to recreate it, and (5) has an API that is to some extent relied upon by resources, and so won't likely be removed outright in any event. Whatever code we need to handle dependencies ought to just build on this existing implementation. I think the difference may be that the streams only include the *shortest* paths (there will often be more than one) to each resource. i.e. A --- B --- C ^ | | | +-+ can just be written as: A --- B --- C because there's only one order in which that can execute anyway. (If we're going to do this though, we should just add a method to the dependencies.Graph class to delete redundant edges, not create a whole new data structure.) There is a big potential advantage here in that it reduces the theoretical maximum number of edges in the graph from O(n^2) to O(n). (Although in practice real templates are typically not likely to have such dense graphs.) There's a downside to this too though: say that A in the above diagram is replaced during an update. In that case not only B but also C will need to figure out what the latest version of A is. One option here is to pass that data along via B, but that will become very messy to implement in a non-trivial example. The other would be for C to go search in the database for resources with the same name as A and the current traversal_id marked as the latest. But that not only creates a concurrency problem we didn't have before (A could have been updated with a new traversal_id at some point after C had established that the current traversal was still valid but before it went looking for A), it also eliminates all of the performance gains from removing that edge in the first place. [1] https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/sync_point.py To Stop current stack operation, we will use your traversal_id based approach. +1 :) If in case you feel Aggregator model creates more queues, then we might have to poll DB to get resource status. (Which will impact performance adversely :) ) For the reasons given above I would vote for doing this in the DB. I agree there will be a performance penalty for doing so, because we'll be paying for robustness. Lock table: name(Unique - Resource_id), stack_id, engine_id, data (Json to store stream dict) Based on our call on Thursday, I think you're taking the idea of the Lock table too literally. The point of referring to locks is that we can use the same concepts as the Lock table relies on to do atomic updates on a particular row of the database, and we can use those atomic updates to prevent race conditions
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 01/12/14 02:02, Anant Patil wrote: On GitHub:https://github.com/anantpatil/heat-convergence-poc I'm trying to review this code at the moment, and finding some stuff I don't understand: https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 This appears to loop through all of the resources *prior* to kicking off any actual updates to check if the resource will change. This is impossible to do in general, since a resource may obtain a property value from an attribute of another resource and there is no way to know whether an update to said other resource would cause a change in the attribute value. In addition, no attempt to catch UpdateReplace is made. Although that looks like a simple fix, I'm now worried about the level to which this code has been tested. I'm also trying to wrap my head around how resources are cleaned up in dependency order. If I understand correctly, you store in the ResourceGraph table the dependencies between various resource names in the current template (presumably there could also be some left around from previous templates too?). For each resource name there may be a number of rows in the Resource table, each with an incrementing version. As far as I can tell though, there's nowhere that the dependency graph for _previous_ templates is persisted? So if the dependency order changes in the template we have no way of knowing the correct order to clean up in any more? (There's not even a mechanism to associate a resource version with a particular template, which might be one avenue by which to recover the dependencies.) I think this is an important case we need to be able to handle, so I added a scenario to my test framework to exercise it and discovered that my implementation was also buggy. Here's the fix: https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 It was difficult, for me personally, to completely understand Zane's PoC and how it would lay the foundation for aforementioned design goals. It would be very helpful to have Zane's understanding here. I could understand that there are ideas like async message passing and notifying the parent which we also subscribe to. So I guess the thing to note is that there are essentially two parts to my Poc: 1) A simulation framework that takes what will be in the final implementation multiple tasks running in parallel in separate processes and talking to a database, and replaces it with an event loop that runs the tasks sequentially in a single process with an in-memory data store. I could have built a more realistic simulator using Celery or something, but I preferred this way as it offers deterministic tests. 2) A toy implementation of Heat on top of this framework. The files map roughly to Heat something like this: converge.engine - heat.engine.service converge.stack- heat.engine.stack converge.resource - heat.engine.resource converge.template - heat.engine.template converge.dependencies - actually is heat.engine.dependencies converge.sync_point - no equivalent converge.converger- no equivalent (this is convergence worker) converge.reality - represents the actual OpenStack services For convenience, I just use the @asynchronous decorator to turn an ordinary method call into a simulated message. The concept is essentially as follows: At the start of a stack update (creates and deletes are also just updates) we create any new resources in the DB calculate the dependency graph for the update from the data in the DB and template. This graph is the same one used by updates in Heat currently, so it contains both the forward and reverse (cleanup) dependencies. The stack update then kicks off checks of all the leaf nodes, passing the pre-calculated dependency graph. Each resource check may result in a call to the create(), update() or delete() methods of a Resource plugin. The resource also reads any attributes that will be required from it. Once this is complete, it triggers any dependent resources that are ready, or updates a SyncPoint in the database if there are dependent resources that have multiple requirements. The message triggering the next resource will contain the dependency graph again, as well as the RefIds and required attributes of any resources it depends on. The new dependencies thus created are added to the resource itself in the database at the time it is checked, allowing it to record the changes caused by a requirement being unexpectedly replaced without needing a global lock on anything. When cleaning up resources, we also endeavour to remove any that are successfully deleted from the dependencies graph. Each traversal has a unique ID that is both stored in the stack and passed down through the resource check triggers. (At present this is the template ID, but it may make more sense to have a unique ID since old template
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800: On 27-Nov-14 18:03, Murugan, Visnusaran wrote: Hi Zane, At this stage our implementation (as mentioned in wiki https://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your design goals. 1. In case of a parallel update, our implementation adjusts graph according to new template and waits for dispatched resource tasks to complete. 2. Reason for basing our PoC on Heat code: a. To solve contention processing parent resource by all dependent resources in parallel. b. To avoid porting issue from PoC to HeatBase. (just to be aware of potential issues asap) 3. Resource timeout would be helpful, but I guess its resource specific and has to come from template and default values from plugins. 4. We see resource notification aggregation and processing next level of resources without contention and with minimal DB usage as the problem area. We are working on the following approaches in *parallel.* a. Use a Queue per stack to serialize notification. b. Get parent ProcessLog (ResourceID, EngineID) and initiate convergence upon first child notification. Subsequent children who fail to get parent resource lock will directly send message to waiting parent task (topic=stack_id.parent_resource_id) Based on performance/feedback we can select either or a mashed version. Advantages: 1. Failed Resource tasks can be re-initiated after ProcessLog table lookup. 2. One worker == one resource. 3. Supports concurrent updates 4. Delete == update with empty stack 5. Rollback == update to previous know good/completed stack. Disadvantages: 1. Still holds stackLock (WIP to remove with ProcessLog) Completely understand your concern on reviewing our code, since commits are numerous and there is change of course at places. Our start commit is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J Your Thoughts. Happy Thanksgiving. Vishnu. *From:*Angus Salkeld [mailto:asalk...@mirantis.com] *Sent:* Thursday, November 27, 2014 9:46 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com mailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Yes, that's the synchronization block for which we use the stack lock. Currently, a thread spin waits to acquire the lock to enter this critical section. I don't really know how to do application level transaction. Is there an external library for that? AFAIK, we cannot switch from DB transaction to application level to do some computation to resolve next set of resources, submit to async queue, and again continue with the DB transaction. The transnational method looks attractive and clean to me but I am limited by my knowledge.Or do you mean to have DB transaction object made available via DB APIs and use them in the application? Please share your thoughts. To avoid locking the stack, we were thinking of designating a single engine with responsibility of processing all notifications for a stack. All the workers will notify on the stack topic, which only one engine listens to and then the notifications end up in a queue (local to the engine, per stack), from where they are taken up one-by-one to continue the stack operation. The convergence jobs are produced by this engine for the stack it is responsible for, and they might end-up in any of the engines. But the notifications for a stack are directed to one engine to avoid contention for lock. The convergence load is leveled and distributed, and stack lock is not needed. - Anant On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800: On 27-Nov-14 18:03, Murugan, Visnusaran wrote: Hi Zane, At this stage our implementation (as mentioned in wiki https://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your design goals. 1. In case of a parallel update, our implementation adjusts graph according to new template and waits for dispatched resource tasks to complete. 2. Reason for basing our PoC on Heat code: a. To solve contention processing parent resource by all dependent resources in parallel. b. To avoid porting issue from PoC to HeatBase. (just to be aware of potential issues asap) 3. Resource timeout would be helpful, but I guess its resource specific and has to come from template and default values from plugins. 4. We see resource notification aggregation and processing next level of resources without contention and with minimal DB usage as the problem area. We are working on the following approaches in *parallel.* a. Use a Queue per stack to serialize notification. b. Get parent ProcessLog (ResourceID, EngineID) and initiate convergence upon first child notification. Subsequent children who fail to get parent resource lock will directly send message to waiting parent task (topic=stack_id.parent_resource_id) Based on performance/feedback we can select either or a mashed version. Advantages: 1. Failed Resource tasks can be re-initiated after ProcessLog table lookup. 2. One worker == one resource. 3. Supports concurrent updates 4. Delete == update with empty stack 5. Rollback == update to previous know good/completed stack. Disadvantages: 1. Still holds stackLock (WIP to remove with ProcessLog) Completely understand your concern on reviewing our code, since commits are numerous and there is change of course at places. Our start commit is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J Your Thoughts. Happy Thanksgiving. Vishnu. *From:*Angus Salkeld [mailto:asalk...@mirantis.com] *Sent:* Thursday, November 27, 2014 9:46 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com mailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Excerpts from Anant Patil's message of 2014-12-02 10:37:31 -0800: Yes, that's the synchronization block for which we use the stack lock. Currently, a thread spin waits to acquire the lock to enter this critical section. I don't really know how to do application level transaction. Is there an external library for that? AFAIK, we cannot switch from DB transaction to application level to do some computation to resolve next set of resources, submit to async queue, and again continue with the DB transaction. The transnational method looks attractive and clean to me but I am limited by my knowledge.Or do you mean to have DB transaction object made available via DB APIs and use them in the application? Please share your thoughts. Every DB request has a context attached to the currently running greenthread which has the DB session object. So yes, you do begin the transaction in one db API call, and try to commit it in another after having attempted any application logic. The whole thing should always be in a try/except retry loop to handle deadlocks and conflicts introduced by multi-master sync replication like Galera. For non-transactional backends, they would have to use a spin lock like you're using now. To avoid locking the stack, we were thinking of designating a single engine with responsibility of processing all notifications for a stack. All the workers will notify on the stack topic, which only one engine listens to and then the notifications end up in a queue (local to the engine, per stack), from where they are taken up one-by-one to continue the stack operation. The convergence jobs are produced by this engine for the stack it is responsible for, and they might end-up in any of the engines. But the notifications for a stack are directed to one engine to avoid contention for lock. The convergence load is leveled and distributed, and stack lock is not needed. No don't do that. You already have that engine, it is the database engine and it is intended to be used for synchronization and will achieve a higher degree of concurrency in bigger stacks because it will only block two workers when they try to inspect or change the same rows. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800: On 27-Nov-14 18:03, Murugan, Visnusaran wrote: Hi Zane, At this stage our implementation (as mentioned in wiki https://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your design goals. 1. In case of a parallel update, our implementation adjusts graph according to new template and waits for dispatched resource tasks to complete. 2. Reason for basing our PoC on Heat code: a. To solve contention processing parent resource by all dependent resources in parallel. b. To avoid porting issue from PoC to HeatBase. (just to be aware of potential issues asap) 3. Resource timeout would be helpful, but I guess its resource specific and has to come from template and default values from plugins. 4. We see resource notification aggregation and processing next level of resources without contention and with minimal DB usage as the problem area. We are working on the following approaches in *parallel.* a. Use a Queue per stack to serialize notification. b. Get parent ProcessLog (ResourceID, EngineID) and initiate convergence upon first child notification. Subsequent children who fail to get parent resource lock will directly send message to waiting parent task (topic=stack_id.parent_resource_id) Based on performance/feedback we can select either or a mashed version. Advantages: 1. Failed Resource tasks can be re-initiated after ProcessLog table lookup. 2. One worker == one resource. 3. Supports concurrent updates 4. Delete == update with empty stack 5. Rollback == update to previous know good/completed stack. Disadvantages: 1. Still holds stackLock (WIP to remove with ProcessLog) Completely understand your concern on reviewing our code, since commits are numerous and there is change of course at places. Our start commit is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J Your Thoughts. Happy Thanksgiving. Vishnu. *From:*Angus Salkeld [mailto:asalk...@mirantis.com] *Sent:* Thursday, November 27, 2014 9:46 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com mailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On 27-Nov-14 18:03, Murugan, Visnusaran wrote: Hi Zane, At this stage our implementation (as mentioned in wiki https://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your design goals. 1. In case of a parallel update, our implementation adjusts graph according to new template and waits for dispatched resource tasks to complete. 2. Reason for basing our PoC on Heat code: a. To solve contention processing parent resource by all dependent resources in parallel. b. To avoid porting issue from PoC to HeatBase. (just to be aware of potential issues asap) 3. Resource timeout would be helpful, but I guess its resource specific and has to come from template and default values from plugins. 4. We see resource notification aggregation and processing next level of resources without contention and with minimal DB usage as the problem area. We are working on the following approaches in *parallel.* a. Use a Queue per stack to serialize notification. b. Get parent ProcessLog (ResourceID, EngineID) and initiate convergence upon first child notification. Subsequent children who fail to get parent resource lock will directly send message to waiting parent task (topic=stack_id.parent_resource_id) Based on performance/feedback we can select either or a mashed version. Advantages: 1. Failed Resource tasks can be re-initiated after ProcessLog table lookup. 2. One worker == one resource. 3. Supports concurrent updates 4. Delete == update with empty stack 5. Rollback == update to previous know good/completed stack. Disadvantages: 1. Still holds stackLock (WIP to remove with ProcessLog) Completely understand your concern on reviewing our code, since commits are numerous and there is change of course at places. Our start commit is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J Your Thoughts. Happy Thanksgiving. Vishnu. *From:*Angus Salkeld [mailto:asalk...@mirantis.com] *Sent:* Thursday, November 27, 2014 9:46 AM *To:* OpenStack Development Mailing List (not for usage questions) *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com mailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read resource attributes once Open questions: - What do we do when we encounter a resource that is in progress from a previous update while doing a subsequent update? Obviously we don't
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
W dniu 11/27/2014 o 5:15 AM, Angus Salkeld pisze: On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com mailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-__convergence-prototype/tree/__distributed-graph https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read resource attributes once Open questions: - What do we do when we encounter a resource that is in progress from a previous update while doing a subsequent update? Obviously we don't want to interrupt it, as it will likely be left in an unknown state. Making a replacement is one obvious answer, but in many cases there could be serious down-sides to that. How long should we wait before trying it? What if it's still in progress because the engine processing the resource already died? Also, how do we implement resource level timeouts in general? Michał's Proposal = https://github.com/inc0/heat-__convergence-prototype/tree/__iterative https://github.com/inc0/heat-convergence-prototype/tree/iterative Note that a version modified by me to use the same test scenario format (but not the same scenarios) is here: https://github.com/zaneb/heat-__convergence-prototype/tree/__iterative-adapted https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted This is based on my simulation framework after a fashion, but with everything implemented synchronously and a lot of handwaving about how the actual implementation could be distributed. The central premise is that at each step of the algorithm, the entire graph is examined for tasks that can be performed next, and those are then started. Once all are complete (it's synchronous, remember), the next step is run. Keen observers will be asking how we know when it is time to run the next step in a distributed version of this algorithm, where it will be run and what to do about resources that are in an intermediate state at that time. All of these questions remain unanswered. Q1. When is time to run next step? There is no really next step or previous step, there is step which can be run *right now*, which effectively means when current step finishes, because then and only then all requirements are met and we can proceed. Q2. I can see 3 main processes there: * Converger (I'd assume thats current engine): Process which parses graph and schedules next steps, it will be run after change in reality is detected. * Observer: Process which keeps reality_db and actual resources aligned. Its mostly for phase2. This one
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
Hi Zane, At this stage our implementation (as mentioned in wikihttps://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your design goals. 1. In case of a parallel update, our implementation adjusts graph according to new template and waits for dispatched resource tasks to complete. 2. Reason for basing our PoC on Heat code: a. To solve contention processing parent resource by all dependent resources in parallel. b. To avoid porting issue from PoC to HeatBase. (just to be aware of potential issues asap) 3. Resource timeout would be helpful, but I guess its resource specific and has to come from template and default values from plugins. 4. We see resource notification aggregation and processing next level of resources without contention and with minimal DB usage as the problem area. We are working on the following approaches in parallel. a. Use a Queue per stack to serialize notification. b. Get parent ProcessLog (ResourceID, EngineID) and initiate convergence upon first child notification. Subsequent children who fail to get parent resource lock will directly send message to waiting parent task (topic=stack_id.parent_resource_id) Based on performance/feedback we can select either or a mashed version. Advantages: 1. Failed Resource tasks can be re-initiated after ProcessLog table lookup. 2. One worker == one resource. 3. Supports concurrent updates 4. Delete == update with empty stack 5. Rollback == update to previous know good/completed stack. Disadvantages: 1. Still holds stackLock (WIP to remove with ProcessLog) Completely understand your concern on reviewing our code, since commits are numerous and there is change of course at places. Our start commit is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help ☺ Your Thoughts. Happy Thanksgiving. Vishnu. From: Angus Salkeld [mailto:asalk...@mirantis.com] Sent: Thursday, November 27, 2014 9:46 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.commailto:zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read resource attributes once Open questions: - What do we do when we encounter a resource that is in progress from a previous update while doing a subsequent update? Obviously we don't want to interrupt it, as it will likely be left in an unknown state. Making a replacement is one obvious answer, but in many cases there could be serious down-sides to that. How long should we wait before trying it? What if it's still in progress because the engine processing the resource already died? Also, how do we implement resource level
[openstack-dev] [Heat] Convergence proof-of-concept showdown
A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read resource attributes once Open questions: - What do we do when we encounter a resource that is in progress from a previous update while doing a subsequent update? Obviously we don't want to interrupt it, as it will likely be left in an unknown state. Making a replacement is one obvious answer, but in many cases there could be serious down-sides to that. How long should we wait before trying it? What if it's still in progress because the engine processing the resource already died? Michał's Proposal = https://github.com/inc0/heat-convergence-prototype/tree/iterative Note that a version modified by me to use the same test scenario format (but not the same scenarios) is here: https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted This is based on my simulation framework after a fashion, but with everything implemented synchronously and a lot of handwaving about how the actual implementation could be distributed. The central premise is that at each step of the algorithm, the entire graph is examined for tasks that can be performed next, and those are then started. Once all are complete (it's synchronous, remember), the next step is run. Keen observers will be asking how we know when it is time to run the next step in a distributed version of this algorithm, where it will be run and what to do about resources that are in an intermediate state at that time. All of these questions remain unanswered. A non-exhaustive list of concerns I have: - Replace on update is not implemented yet - AFAIK rollback is not implemented yet - The simulation doesn't actually implement the proposed architecture - This approach is punishingly heavy on the database - O(n^2) or worse - A lot of phase 2 is mixed in with phase 1 here, making it difficult to evaluate which changes need to be made first and whether this approach works with existing plugins - The code is not really based on how Heat works at the moment, so there would be either a major redesign required or lots of radical changes in Heat or both I think there's a fair chance that given another 3-4 weeks to work on this, all of these issues and others could probably be resolved. The question for me at this point is not so much if but why. Michał believes that this approach will make Phase 2 easier to implement, which is a valid reason to consider it. However, I'm not aware of any particular issues that my approach would cause in implementing phase 2 (note that I have barely looked into it at all though). In fact, I very much want Phase 2 to be entirely encapsulated by the Resource class, so that the plugin type (legacy vs. convergence-enabled) is transparent to the rest of the system. Only in this way can we be sure that we'll be
Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com wrote: A bunch of us have spent the last few weeks working independently on proof of concept designs for the convergence architecture. I think those efforts have now reached a sufficient level of maturity that we should start working together on synthesising them into a plan that everyone can forge ahead with. As a starting point I'm going to summarise my take on the three efforts; hopefully the authors of the other two will weigh in to give us their perspective. Zane's Proposal === https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph I implemented this as a simulator of the algorithm rather than using the Heat codebase itself in order to be able to iterate rapidly on the design, and indeed I have changed my mind many, many times in the process of implementing it. Its notable departure from a realistic simulation is that it runs only one operation at a time - essentially giving up the ability to detect race conditions in exchange for a completely deterministic test framework. You just have to imagine where the locks need to be. Incidentally, the test framework is designed so that it can easily be ported to the actual Heat code base as functional tests so that the same scenarios could be used without modification, allowing us to have confidence that the eventual implementation is a faithful replication of the simulation (which can be rapidly experimented on, adjusted and tested when we inevitably run into implementation issues). This is a complete implementation of Phase 1 (i.e. using existing resource plugins), including update-during-update, resource clean-up, replace on update and rollback; with tests. Some of the design goals which were successfully incorporated: - Minimise changes to Heat (it's essentially a distributed version of the existing algorithm), and in particular to the database - Work with the existing plugin API - Limit total DB access for Resource/Stack to O(n) in the number of resources - Limit overall DB access to O(m) in the number of edges - Limit lock contention to only those operations actually contending (i.e. no global locks) - Each worker task deals with only one resource - Only read resource attributes once Open questions: - What do we do when we encounter a resource that is in progress from a previous update while doing a subsequent update? Obviously we don't want to interrupt it, as it will likely be left in an unknown state. Making a replacement is one obvious answer, but in many cases there could be serious down-sides to that. How long should we wait before trying it? What if it's still in progress because the engine processing the resource already died? Also, how do we implement resource level timeouts in general? Michał's Proposal = https://github.com/inc0/heat-convergence-prototype/tree/iterative Note that a version modified by me to use the same test scenario format (but not the same scenarios) is here: https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted This is based on my simulation framework after a fashion, but with everything implemented synchronously and a lot of handwaving about how the actual implementation could be distributed. The central premise is that at each step of the algorithm, the entire graph is examined for tasks that can be performed next, and those are then started. Once all are complete (it's synchronous, remember), the next step is run. Keen observers will be asking how we know when it is time to run the next step in a distributed version of this algorithm, where it will be run and what to do about resources that are in an intermediate state at that time. All of these questions remain unanswered. Yes, I was struggling to figure out how it could manage an IN_PROGRESS state as it's stateless. So you end up treading on the other action's toes. Assuming we use the resource's state (IN_PROGRESS) you could get around that. Then you kick off a converge when ever an action completes (if there is nothing new to be done then do nothing). A non-exhaustive list of concerns I have: - Replace on update is not implemented yet - AFAIK rollback is not implemented yet - The simulation doesn't actually implement the proposed architecture - This approach is punishingly heavy on the database - O(n^2) or worse Yes, re-reading the state of all resources when ever run a new converge is worrying, but I think Michal had some ideas to minimize this. - A lot of phase 2 is mixed in with phase 1 here, making it difficult to evaluate which changes need to be made first and whether this approach works with existing plugins - The code is not really based on how Heat works at the moment, so there would be either a major redesign required or lots of radical changes in Heat or both I think there's a fair chance that given another 3-4 weeks to work on