Re: [Openstack] New nova service proposal
I would think we have enough tracking information to support the goal of identifying failures. In any scenario, some of the failures will simply be unrecoverable. Regarding the process crashing, who's to say the retry process also wouldn't crash? We could endlessly argue the arbiter/watchdog processes will crash at each tier. As such, I think it's better to say we need a simpler mechanism for identifying failures and perhaps a best-effort retry. Retrying can be scary, to say the least. You can't possibly handle all of the possible failure scenarios, and some of the ones you think you can might be different in subtle ways such that retrying them only causes more issues. I agree with Lamar that we could make things significantly more reliable, and I think that's where we should start. We may find that, after some stabilization work, the failure rate is acceptably low and any retry mechanism is no longer required. On 8/29/11 11:24 AM, Kevin L. Mitchell kevin.mitch...@rackspace.com wrote: On Fri, 2011-08-26 at 23:10 +, Monsyne Dragon wrote: First off, I think it would be better if whatever had the failure responded by sending a request somewhere (a cast) to say Hey, this bombed. Retry it. What if the failure was due to the process crashing, so that it can't possibly send a request/cast off for retry? -- Kevin L. Mitchell kevin.mitch...@rackspace.com This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
On Aug 26, 2011, at 6:10 PM, Monsyne Dragon wrote: First off, I think it would be better if whatever had the failure responded by sending a request somewhere (a cast) to say Hey, this bombed. Retry it. I wouldn't try doing calls instead of casts, as some have suggested, for performance reasons. (and I could see deadlocking issues) That was my initial idea, too, but while it seemed simpler and cleaner in a basic deployment, it starts to become problematic when you consider a nested zone scenario. One of the defining traits of a child zone is that it has *no knowledge* of its parent zone, so message passing back up the chain is limited to the API status codes. Since we don't block on the API call until the instance is created and verified as running properly, that avenue is not available for reporting errors that happen at any point downstream. That's why I concluded that the follow-up process must be running in the same zone which initiated the request. I also considered making it part of the existing scheduler service, but wasn't sure how to add a time-delayed message to the scheduler queue for the follow-up. If that's possible, then there would not need to be a separate service; the scheduler can simply follow up itself. -- Ed Leafe ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
I also considered making it part of the existing scheduler service, but wasn't sure how to add a time-delayed message to the scheduler queue for the follow-up. If that's possible, then there would not need to be a separate service; the scheduler can simply follow up itself. Unless the scheduler dies. These sorts of long running processes have to be controlled by an external state machine. It's a well solved problem using business process modeling/workflow orchestration. -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
I was wondering if some of this could be solved by simply using rpc.call vs rpc.cast so that we get appropriate responses, even if they are exceptions. - Chris On Aug 26, 2011, at 2:23 PM, Brian Lamar wrote: Hey Ed, I absolutely agree that we need to be confident that all requests will be handled by the system eventually. However, I'm unsure of the need for a new service to be created to handle error cases. I'm not saying that we can solve every case through our current architecture, but with some subtle tweaks I think it can be made much more reliable. I feel like if we look at every place where errors *can* occur we can find solutions not involving an external service. Are there any particular bugs/situations that are happening which aren't listed as bugs in Launchpad? Long story short I'd rather not create an external service which attempts to clean up after poor exception handling / bad logic, but there might be some cases I'm not considering. Brian -Original Message- From: Ed Leafe e...@leafe.com Sent: Friday, August 26, 2011 3:22pm To: openstack@lists.launchpad.net Subject: [Openstack] New nova service proposal Sorry I haven't come up with a snazzy name for it yet, but what I have in mind is a new service that is essential for my employer (Rackspace), and might be important for other OpenStack deployments. This new service would be completely optional, of course - only those for whom it is relevant would run it. Let me start by stating the problem: when a customer requests that we create instances for them, nova casts those requests into the queue, where they are eventually acted upon. That usually works great, but in cases where the instance creation fails, we need to detect that failure and re-issue the create request with a different host. This is currently not possible with the asynchronous design of the compute-scheduler interactions. So what I envision is a service that scans a list of recent requests' reservation IDs, and follows up to see if the request was successful or not, and takes action if needed. The blueprint for this can be found at https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with an Etherpad created for ongoing idea exchange at http://etherpad.openstack.org/instance-creation-assurance -- Ed Leafe ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
Hey Ed, Great idea. I think there are a lot of ways we can go with this. I started working on a branch called tasks that did something like this. It essentially allowed you to put every action into a tasks database and you could re-run tasks that didn't complete properly. It was based on the idea that every action should be composed of a bunch of idempotent chunks. I have since discovered that a lot of what I had done is very similar to the concept of a task in celery. In general i think that actions should become first class citizens. Action requests to the api should get back a UUID representing the task (or reservation if you prefer), and we should use the task object to keep track of the status of the action. Right now status of actions are stored in status fields in the objects that are acted on. This makes for difficult logic when tasks involve multiple objects (boot from volume for example). If we have rich information in a task object, a recovery service could be written to retry or fix failing tasks. Regardless of the approach, this seems like a great topic for the summit. So coming up with some proposals would be awesome. Vish On Aug 26, 2011, at 12:22 PM, Ed Leafe wrote: Sorry I haven't come up with a snazzy name for it yet, but what I have in mind is a new service that is essential for my employer (Rackspace), and might be important for other OpenStack deployments. This new service would be completely optional, of course - only those for whom it is relevant would run it. Let me start by stating the problem: when a customer requests that we create instances for them, nova casts those requests into the queue, where they are eventually acted upon. That usually works great, but in cases where the instance creation fails, we need to detect that failure and re-issue the create request with a different host. This is currently not possible with the asynchronous design of the compute-scheduler interactions. So what I envision is a service that scans a list of recent requests' reservation IDs, and follows up to see if the request was successful or not, and takes action if needed. The blueprint for this can be found at https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with an Etherpad created for ongoing idea exchange at http://etherpad.openstack.org/instance-creation-assurance -- Ed Leafe ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
On Aug 26, 2011, at 2:22 PM, Ed Leafe wrote: Sorry I haven't come up with a snazzy name for it yet, but what I have in mind is a new service that is essential for my employer (Rackspace), and might be important for other OpenStack deployments. This new service would be completely optional, of course - only those for whom it is relevant would run it. Let me start by stating the problem: when a customer requests that we create instances for them, nova casts those requests into the queue, where they are eventually acted upon. That usually works great, but in cases where the instance creation fails, we need to detect that failure and re-issue the create request with a different host. This is currently not possible with the asynchronous design of the compute-scheduler interactions. So what I envision is a service that scans a list of recent requests' reservation IDs, and follows up to see if the request was successful or not, and takes action if needed. The blueprint for this can be found at https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with an Etherpad created for ongoing idea exchange at http://etherpad.openstack.org/instance-creation-assurance Hmmm.. having looked over this, I agree that we need to have a way to retry failed builds, however I do not think that having another service essentially polling the builds to find failures is the right way to go. First off, I think it would be better if whatever had the failure responded by sending a request somewhere (a cast) to say Hey, this bombed. Retry it. I wouldn't try doing calls instead of casts, as some have suggested, for performance reasons. (and I could see deadlocking issues) If we step back and look at this, these requests/orders/whatever you call it amount to multi-step workflows. Even for building a single server you have things like allocate this instance on a hypervisor, Assign IP's Attach these volumes, any of which could fail for some reason. And if they do fail, there may be steps need to back-out already completed work. The proper way, IMHO, for this to work is that a request generates a workorder with a set of tasks. This gets sent to something (scheduler, probably) which looks at the first uncompleted task on the workorder, makes the decision on where to send it, and routes the whole workorder there. The service that gets it performs the task (i.e. executes the method), possibly attaching info (like id of newly created instance) to the workorder, and possibly pushing an 'undo' task to the top of a list of tasks to perform if things fail somewhere. Then the whole workorder gets sent back to the origin (again, scheduler?) This looks at the next uncompleted task, and starts the cycle again. Repeat until done. If there is a failure, the scheduler works through the 'undo' list on the workorder, and then makes whatever decisions are needed to redo the tasks elsewhere. The workorder contains the record of the failed attempt, so it doesn't, for example, try to send the server build back to the same hosts that just failed. The workorder acts as an environment for the tasks, and could be passed to tasks (rpc methods) as an attribute of the context object. Anyway, that is my notion. Flame away. -- Monsyne M. Dragon OpenStack/Nova cell 210-441-0965 work x 5014190 This email may include confidential information. If you received it in error, please delete it. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] New nova service proposal
On Aug 26, 2011, at 6:10 PM, Monsyne Dragon wrote: The proper way, IMHO, for this to work is that a request generates a workorder with a set of tasks. This gets sent to something (scheduler, probably) which looks at the first uncompleted task on the workorder, makes the decision on where to send it, and routes the whole workorder there. The service that gets it performs the task (i.e. executes the method), possibly attaching info (like id of newly created instance) to the workorder, and possibly pushing an 'undo' task to the top of a list of tasks to perform if things fail somewhere. Then the whole workorder gets sent back to the origin (again, scheduler?) This looks at the next uncompleted task, and starts the cycle again. Repeat until done. If there is a failure, the scheduler works through the 'undo' list on the workorder, and then makes whatever decisions are needed to redo the tasks elsewhere. The workorder contains the record of the failed attempt, so it doesn't, for example, try to send the server build back to the same hosts that just failed. The workorder acts as an environment for the tasks, and could be passed to tasks (rpc methods) as an attribute of the context object. This sounds very similar to Google's app engine pipeline project (http://code.google.com/p/appengine-pipeline/). ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp