Re: [Openstack] New nova service proposal

2011-08-29 Thread Matt Dietz
I would think we have enough tracking information to support the goal of
identifying failures. In any scenario, some of the failures will simply be
unrecoverable. 

Regarding the process crashing, who's to say the retry process also
wouldn't crash? We could endlessly argue the arbiter/watchdog processes
will crash at each tier. As such, I think it's better to say we need a
simpler mechanism for identifying failures and perhaps a best-effort
retry. 

Retrying can be scary, to say the least. You can't possibly handle all of
the possible failure scenarios, and some of the ones you think you can
might be different in subtle ways such that retrying them only causes more
issues.

I agree with Lamar that we could make things significantly more reliable,
and I think that's where we should start. We may find that, after some
stabilization work, the failure rate is acceptably low and any retry
mechanism is no longer required.

On 8/29/11 11:24 AM, Kevin L. Mitchell kevin.mitch...@rackspace.com
wrote:

On Fri, 2011-08-26 at 23:10 +, Monsyne Dragon wrote:
 First off, I think it would be better if whatever had the failure
 responded by sending a request somewhere (a cast) to say Hey, this
 bombed. Retry it. 

What if the failure was due to the process crashing, so that it can't
possibly send a request/cast off for retry?
-- 
Kevin L. Mitchell kevin.mitch...@rackspace.com

This email may include confidential information. If you received it in
error, please delete it.
___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

This email may include confidential information. If you received it in error, 
please delete it.


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-29 Thread Ed Leafe
On Aug 26, 2011, at 6:10 PM, Monsyne Dragon wrote:

 First off, I think it would be better if whatever had the failure responded 
 by sending a request somewhere (a cast) to say Hey, this bombed. Retry it.  
  I wouldn't try doing calls instead of casts, as some have suggested, for 
 performance reasons. (and I could see deadlocking issues)

That was my initial idea, too, but while it seemed simpler and cleaner 
in a basic deployment, it starts to become problematic when you consider a 
nested zone scenario. One of the defining traits of a child zone is that it has 
*no knowledge* of its parent zone, so message passing back up the chain is 
limited to the API status codes. Since we don't block on the API call until the 
instance is created and verified as running properly, that avenue is not 
available for reporting errors that happen at any point downstream. That's why 
I concluded that the follow-up process must be running in the same zone which 
initiated the request.

I also considered making it part of the existing scheduler service, but 
wasn't sure how to add a time-delayed message to the scheduler queue for the 
follow-up. If that's possible, then there would not need to be a separate 
service; the scheduler can simply follow up itself.


-- Ed Leafe




___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-29 Thread Sandy Walsh
I also considered making it part of the existing scheduler service, 
 but wasn't sure how to add a time-delayed message to the scheduler queue 
 for the follow-up. If that's possible, then there would not need to be a 
 separate service; the scheduler can simply follow up itself.

Unless the scheduler dies. 

These sorts of long running processes have to be controlled by an external 
state machine. It's a well solved problem using business process 
modeling/workflow orchestration.

-S 


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp
This email may include confidential information. If you received it in error, 
please delete it.


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-26 Thread Chris Behrens
I was wondering if some of this could be solved by simply using rpc.call vs 
rpc.cast so that we get appropriate responses, even if they are exceptions.

- Chris

On Aug 26, 2011, at 2:23 PM, Brian Lamar wrote:

 Hey Ed,
 
 I absolutely agree that we need to be confident that all requests will be 
 handled by the system eventually. However, I'm unsure of the need for a new 
 service to be created to handle error cases.
 
 I'm not saying that we can solve every case through our current architecture, 
 but with some subtle tweaks I think it can be made much more reliable.  I 
 feel like if we look at every place where errors *can* occur we can find 
 solutions not involving an external service. Are there any particular 
 bugs/situations that are happening which aren't listed as bugs in Launchpad?
 
 Long story short I'd rather not create an external service which attempts to 
 clean up after poor exception handling / bad logic, but there might be some 
 cases I'm not considering.
 
 Brian
 
 
 
 
 -Original Message-
 From: Ed Leafe e...@leafe.com
 Sent: Friday, August 26, 2011 3:22pm
 To: openstack@lists.launchpad.net
 Subject: [Openstack] New nova service proposal
 
   Sorry I haven't come up with a snazzy name for it yet, but what I have 
 in mind is a new service that is essential for my employer (Rackspace), and 
 might be important for other OpenStack deployments. This new service would be 
 completely optional, of course - only those for whom it is relevant would run 
 it.
 
   Let me start by stating the problem: when a customer requests that we 
 create instances for them, nova casts those requests into the queue, where 
 they are eventually acted upon. That usually works great, but in cases where 
 the instance creation fails, we need to detect that failure and re-issue the 
 create request with a different host. This is currently not possible with the 
 asynchronous design of the compute-scheduler interactions.
 
   So what I envision is a service that scans a list of recent requests' 
 reservation IDs, and follows up to see if the request was successful or not, 
 and takes action if needed. The blueprint for this can be found at 
 https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with 
 an Etherpad created for ongoing idea exchange at 
 http://etherpad.openstack.org/instance-creation-assurance
 
 
 
 -- Ed Leafe
 
 
 
 
 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp
 This email may include confidential information. If you received it in error, 
 please delete it.
 
 
 
 
 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp

This email may include confidential information. If you received it in error, 
please delete it.


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-26 Thread Vishvananda Ishaya
Hey Ed,

Great idea. I think there are a lot of ways we can go with this.  I started 
working on a branch called tasks that did something like this.  It essentially 
allowed you to put every action into a tasks database and you could re-run 
tasks that didn't complete properly.  It was based on the idea that every 
action should be composed of a bunch of idempotent chunks.

I have since discovered that a lot of what I had done is very similar to the 
concept of a task in celery.  In general i think that actions should become 
first class citizens.  Action requests to the api should get back a UUID 
representing the task (or reservation if you prefer), and we should use the 
task object to keep track of the status of the action.  Right now status of 
actions are stored in status fields in the objects that are acted on.  This 
makes for difficult logic when tasks involve multiple objects (boot from volume 
for example). If we have rich information in a task object, a recovery service 
could be written to retry or fix failing tasks.

Regardless of the approach, this seems like a great topic for the summit.  So 
coming up with some proposals would be awesome.

Vish

On Aug 26, 2011, at 12:22 PM, Ed Leafe wrote:

   Sorry I haven't come up with a snazzy name for it yet, but what I have 
 in mind is a new service that is essential for my employer (Rackspace), and 
 might be important for other OpenStack deployments. This new service would be 
 completely optional, of course - only those for whom it is relevant would run 
 it.
 
   Let me start by stating the problem: when a customer requests that we 
 create instances for them, nova casts those requests into the queue, where 
 they are eventually acted upon. That usually works great, but in cases where 
 the instance creation fails, we need to detect that failure and re-issue the 
 create request with a different host. This is currently not possible with the 
 asynchronous design of the compute-scheduler interactions.
 
   So what I envision is a service that scans a list of recent requests' 
 reservation IDs, and follows up to see if the request was successful or not, 
 and takes action if needed. The blueprint for this can be found at 
 https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with 
 an Etherpad created for ongoing idea exchange at 
 http://etherpad.openstack.org/instance-creation-assurance
 
 
 
 -- Ed Leafe
 
 
 
 
 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-26 Thread Monsyne Dragon

On Aug 26, 2011, at 2:22 PM, Ed Leafe wrote:

   Sorry I haven't come up with a snazzy name for it yet, but what I have 
 in mind is a new service that is essential for my employer (Rackspace), and 
 might be important for other OpenStack deployments. This new service would be 
 completely optional, of course - only those for whom it is relevant would run 
 it.
 
   Let me start by stating the problem: when a customer requests that we 
 create instances for them, nova casts those requests into the queue, where 
 they are eventually acted upon. That usually works great, but in cases where 
 the instance creation fails, we need to detect that failure and re-issue the 
 create request with a different host. This is currently not possible with the 
 asynchronous design of the compute-scheduler interactions.
 
   So what I envision is a service that scans a list of recent requests' 
 reservation IDs, and follows up to see if the request was successful or not, 
 and takes action if needed. The blueprint for this can be found at 
 https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with 
 an Etherpad created for ongoing idea exchange at 
 http://etherpad.openstack.org/instance-creation-assurance

Hmmm.. having looked over this, I agree that we need to have a way to retry 
failed builds, however I do not think that having another service essentially 
polling the builds to find failures is the right way to go. 

First off, I think it would be better if whatever had the failure responded by 
sending a request somewhere (a cast) to say Hey, this bombed. Retry it.   I 
wouldn't try doing calls instead of casts, as some have suggested, for 
performance reasons. (and I could see deadlocking issues) 

If we step back and look at this, these requests/orders/whatever you call it 
amount to multi-step workflows.  Even for building a single server you have 
things like allocate this instance on a hypervisor, Assign IP's Attach 
these volumes,  any of which could fail for some reason.   And if they do 
fail, there may be steps need to back-out already completed work. 

The proper way, IMHO, for this to work is that a request generates a workorder 
with a set of tasks.  
This gets sent to something (scheduler, probably) which looks at the first 
uncompleted task on the workorder, makes the decision on where to send it, and 
routes the whole workorder there.  
The service that gets it performs the task (i.e. executes the method), possibly 
attaching  info (like id of newly created instance) to the workorder, and 
possibly pushing an 'undo' task to the top of a list of tasks to perform if 
things fail somewhere.  
Then the whole workorder gets sent back to the origin (again, scheduler?) This 
looks at the next uncompleted task, and starts the cycle again.  
Repeat until done. 

If there is a failure, the scheduler works through the 'undo' list on the 
workorder, and then makes whatever decisions are needed to redo the tasks 
elsewhere.  The workorder contains the record of the failed attempt, so it 
doesn't, for example, try to send the server build back to the same hosts that 
just failed. 

The workorder acts as an environment for the tasks, and could be passed to 
tasks (rpc methods) as an attribute of the context object. 


Anyway, that is my notion.  Flame away. 



--
Monsyne M. Dragon
OpenStack/Nova 
cell 210-441-0965
work x 5014190

This email may include confidential information. If you received it in error, 
please delete it.


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] New nova service proposal

2011-08-26 Thread John Dickinson

On Aug 26, 2011, at 6:10 PM, Monsyne Dragon wrote:
 
 The proper way, IMHO, for this to work is that a request generates a 
 workorder with a set of tasks.  
 This gets sent to something (scheduler, probably) which looks at the first 
 uncompleted task on the workorder, makes the decision on where to send it, 
 and routes the whole workorder there.  
 The service that gets it performs the task (i.e. executes the method), 
 possibly attaching  info (like id of newly created instance) to the 
 workorder, and possibly pushing an 'undo' task to the top of a list of tasks 
 to perform if things fail somewhere.  
 Then the whole workorder gets sent back to the origin (again, scheduler?) 
 This looks at the next uncompleted task, and starts the cycle again.  
 Repeat until done. 
 
 If there is a failure, the scheduler works through the 'undo' list on the 
 workorder, and then makes whatever decisions are needed to redo the tasks 
 elsewhere.  The workorder contains the record of the failed attempt, so it 
 doesn't, for example, try to send the server build back to the same hosts 
 that just failed. 
 
 The workorder acts as an environment for the tasks, and could be passed to 
 tasks (rpc methods) as an attribute of the context object. 


This sounds very similar to Google's app engine pipeline project 
(http://code.google.com/p/appengine-pipeline/).
___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp