Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Mark Washenberger
Can you talk a little more about how you want to apply this failure 
notification? That is, what is the case where you are going to use the 
information that an operation failed? In my head I have an idea of getting code 
simplicity dividends from an everything succeeds approach to some of our 
operations. But it might not really apply to the case you're working on.

Sandy Walsh sandy.wa...@rackspace.com said:

 For orchestration (and now the scheduler improvements) we need to know when an
 operation fails ... and specifically, which resource was involved. In the 
 majority
 of the cases it's an instance_uuid we're looking for, but it could be a 
 security
 group id or a reservation id.
 
 With most of the compute.manager calls the resource id is the third parameter 
 in
 the call (after self  context), but there are some oddities. And sometimes we
 need to know the additional parameters (like a migration id related to an 
 instance
 uuid). So simply enforcing parameter orders may be insufficient and 
 impossible to
 enforce programmatically.
 
 A little background:
 
 In nova, exceptions are generally handled in the RPC or middleware layers as a
 logged event and life goes on. In an attempt to tie this into the notification
 system, a while ago I added stuff to the wrap_exception decorator. I'm sure 
 you've
 seen this nightmare scattered around the code:
 @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())
 
 What started as a simple decorator now takes parameters and the code has 
 become
 nasty.
 
 But it works ... no matter where the exception was generated, the notifier 
 gets:
 *   compute.host_id
 *   method name
 *   and whatever arguments the method takes.
 
 So, we know what operation failed and the host it failed on, but someone 
 needs to
 crack the argument nut to get the goodies. It's a fragile coupling from 
 publisher
 to receiver.
 
 One, less fragile, alternative is to put a try/except block inside every 
 top-level
 nova.compute.manager method and send meaningful exceptions right from the 
 source.
 More fidelity, but messier code. Although explicit is better than implicit 
 keeps
 ringing in my head.
 
 Or, we make a general event parser that anyone can use ... but again, the link
 between the actual method and the parser is fragile. The developers have to
 remember to update both.
 
 Opinions?
 
 -S
 
 
 
 
 
 
 
 
 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp
 



___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Sandy Walsh
Sure, the problem I'm immediately facing is reclaiming resources from the 
Capacity table when something fails. (we claim them immediately in the 
scheduler when the host is selected to lessen the latency).

The other situation is Orchestration needs it for retries, rescheduling, 
rollbacks and cross-service timeouts.

I think it's needed core functionality. I like Fail-Fast for the same reasons, 
but it can get in the way.

-S


From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net 
[openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of 
Mark Washenberger [mark.washenber...@rackspace.com]
Sent: Wednesday, December 07, 2011 11:53 AM
To: openstack@lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit 
vs. implicit

Can you talk a little more about how you want to apply this failure 
notification? That is, what is the case where you are going to use the 
information that an operation failed? In my head I have an idea of getting code 
simplicity dividends from an everything succeeds approach to some of our 
operations. But it might not really apply to the case you're working on.

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Mark Washenberger
Gotcha.

So the way this might work is, for example, when a run_instance fails on 
compute node, it would publish a run_instance for uuid=blah failed event. 
There would be a subscriber associated with the scheduler listening for such 
events--when it receives one it would go check the capacity table and update it 
to reflect the failure. Does that sound about right?

Sandy Walsh sandy.wa...@rackspace.com said:

 Sure, the problem I'm immediately facing is reclaiming resources from the 
 Capacity
 table when something fails. (we claim them immediately in the scheduler when 
 the
 host is selected to lessen the latency).
 
 The other situation is Orchestration needs it for retries, rescheduling, 
 rollbacks
 and cross-service timeouts.
 
 I think it's needed core functionality. I like Fail-Fast for the same 
 reasons, but
 it can get in the way.
 
 -S
 
 
 From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net
 [openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of
 Mark Washenberger [mark.washenber...@rackspace.com]
 Sent: Wednesday, December 07, 2011 11:53 AM
 To: openstack@lists.launchpad.net
 Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit   
  
 vs. implicit
 
 Can you talk a little more about how you want to apply this failure 
 notification?
 That is, what is the case where you are going to use the information that an
 operation failed? In my head I have an idea of getting code simplicity 
 dividends
 from an everything succeeds approach to some of our operations. But it 
 might not
 really apply to the case you're working on.
 



___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Sandy Walsh
Exactly! ... or it could be handled in the notifier itself.


From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net 
[openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of 
Mark Washenberger [mark.washenber...@rackspace.com]
Sent: Wednesday, December 07, 2011 12:36 PM
To: openstack@lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit 
vs. implicit

Gotcha.

So the way this might work is, for example, when a run_instance fails on 
compute node, it would publish a run_instance for uuid=blah failed event. 
There would be a subscriber associated with the scheduler listening for such 
events--when it receives one it would go check the capacity table and update it 
to reflect the failure. Does that sound about right?


___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Yun Mao
Hi Sandy,

I'm wondering if it is possible to change the scheduler's rpc cast to
rpc call. This way the exceptions should be magically propagated back
to the scheduler, right? Naturally the scheduler can find another node
to retry or decide to give up and report failure. If we need to
provision many instances, we can spawn a few green threads for that.

Yun

On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote:
 For orchestration (and now the scheduler improvements) we need to know when 
 an operation fails ... and specifically, which resource was involved. In the 
 majority of the cases it's an instance_uuid we're looking for, but it could 
 be a security group id or a reservation id.

 With most of the compute.manager calls the resource id is the third parameter 
 in the call (after self  context), but there are some oddities. And 
 sometimes we need to know the additional parameters (like a migration id 
 related to an instance uuid). So simply enforcing parameter orders may be 
 insufficient and impossible to enforce programmatically.

 A little background:

 In nova, exceptions are generally handled in the RPC or middleware layers as 
 a logged event and life goes on. In an attempt to tie this into the 
 notification system, a while ago I added stuff to the wrap_exception 
 decorator. I'm sure you've seen this nightmare scattered around the code:
 @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())

 What started as a simple decorator now takes parameters and the code has 
 become nasty.

 But it works ... no matter where the exception was generated, the notifier 
 gets:
 *   compute.host_id
 *   method name
 *   and whatever arguments the method takes.

 So, we know what operation failed and the host it failed on, but someone 
 needs to crack the argument nut to get the goodies. It's a fragile coupling 
 from publisher to receiver.

 One, less fragile, alternative is to put a try/except block inside every 
 top-level nova.compute.manager method and send meaningful exceptions right 
 from the source. More fidelity, but messier code. Although explicit is 
 better than implicit keeps ringing in my head.

 Or, we make a general event parser that anyone can use ... but again, the 
 link between the actual method and the parser is fragile. The developers have 
 to remember to update both.

 Opinions?

 -S








 ___
 Mailing list: https://launchpad.net/~openstack
 Post to     : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Sandy Walsh
True ... this idea has come up before (and is still being kicked around). My 
biggest concern is what happens if that scheduler dies? We need a mechanism 
that can live outside of a single scheduler service. 

The more of these long-running processes we leave in a service the greater the 
impact when something fails. Shouldn't we let the queue provide the resiliency 
and not depend on the worker staying alive? Personally I'm not a fan of 
removing our synchronous nature.


From: Yun Mao [yun...@gmail.com]
Sent: Wednesday, December 07, 2011 1:03 PM
To: Sandy Walsh
Cc: openstack@lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. 
implicit

Hi Sandy,

I'm wondering if it is possible to change the scheduler's rpc cast to
rpc call. This way the exceptions should be magically propagated back
to the scheduler, right? Naturally the scheduler can find another node
to retry or decide to give up and report failure. If we need to
provision many instances, we can spawn a few green threads for that.

Yun

On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote:
 For orchestration (and now the scheduler improvements) we need to know when 
 an operation fails ... and specifically, which resource was involved. In the 
 majority of the cases it's an instance_uuid we're looking for, but it could 
 be a security group id or a reservation id.

 With most of the compute.manager calls the resource id is the third parameter 
 in the call (after self  context), but there are some oddities. And 
 sometimes we need to know the additional parameters (like a migration id 
 related to an instance uuid). So simply enforcing parameter orders may be 
 insufficient and impossible to enforce programmatically.

 A little background:

 In nova, exceptions are generally handled in the RPC or middleware layers as 
 a logged event and life goes on. In an attempt to tie this into the 
 notification system, a while ago I added stuff to the wrap_exception 
 decorator. I'm sure you've seen this nightmare scattered around the code:
 @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())

 What started as a simple decorator now takes parameters and the code has 
 become nasty.

 But it works ... no matter where the exception was generated, the notifier 
 gets:
 *   compute.host_id
 *   method name
 *   and whatever arguments the method takes.

 So, we know what operation failed and the host it failed on, but someone 
 needs to crack the argument nut to get the goodies. It's a fragile coupling 
 from publisher to receiver.

 One, less fragile, alternative is to put a try/except block inside every 
 top-level nova.compute.manager method and send meaningful exceptions right 
 from the source. More fidelity, but messier code. Although explicit is 
 better than implicit keeps ringing in my head.

 Or, we make a general event parser that anyone can use ... but again, the 
 link between the actual method and the parser is fragile. The developers have 
 to remember to update both.

 Opinions?

 -S








 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Sandy Walsh
*removing our Asynchronous nature.

(heh, such a key point to typo on)



From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net 
[openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of 
Sandy Walsh [sandy.wa...@rackspace.com]
Sent: Wednesday, December 07, 2011 1:55 PM
To: Yun Mao
Cc: openstack@lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. 
implicit

True ... this idea has come up before (and is still being kicked around). My 
biggest concern is what happens if that scheduler dies? We need a mechanism 
that can live outside of a single scheduler service.

The more of these long-running processes we leave in a service the greater the 
impact when something fails. Shouldn't we let the queue provide the resiliency 
and not depend on the worker staying alive? Personally I'm not a fan of 
removing our synchronous nature.


From: Yun Mao [yun...@gmail.com]
Sent: Wednesday, December 07, 2011 1:03 PM
To: Sandy Walsh
Cc: openstack@lists.launchpad.net
Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. 
implicit

Hi Sandy,

I'm wondering if it is possible to change the scheduler's rpc cast to
rpc call. This way the exceptions should be magically propagated back
to the scheduler, right? Naturally the scheduler can find another node
to retry or decide to give up and report failure. If we need to
provision many instances, we can spawn a few green threads for that.

Yun

On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote:
 For orchestration (and now the scheduler improvements) we need to know when 
 an operation fails ... and specifically, which resource was involved. In the 
 majority of the cases it's an instance_uuid we're looking for, but it could 
 be a security group id or a reservation id.

 With most of the compute.manager calls the resource id is the third parameter 
 in the call (after self  context), but there are some oddities. And 
 sometimes we need to know the additional parameters (like a migration id 
 related to an instance uuid). So simply enforcing parameter orders may be 
 insufficient and impossible to enforce programmatically.

 A little background:

 In nova, exceptions are generally handled in the RPC or middleware layers as 
 a logged event and life goes on. In an attempt to tie this into the 
 notification system, a while ago I added stuff to the wrap_exception 
 decorator. I'm sure you've seen this nightmare scattered around the code:
 @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())

 What started as a simple decorator now takes parameters and the code has 
 become nasty.

 But it works ... no matter where the exception was generated, the notifier 
 gets:
 *   compute.host_id
 *   method name
 *   and whatever arguments the method takes.

 So, we know what operation failed and the host it failed on, but someone 
 needs to crack the argument nut to get the goodies. It's a fragile coupling 
 from publisher to receiver.

 One, less fragile, alternative is to put a try/except block inside every 
 top-level nova.compute.manager method and send meaningful exceptions right 
 from the source. More fidelity, but messier code. Although explicit is 
 better than implicit keeps ringing in my head.

 Or, we make a general event parser that anyone can use ... but again, the 
 link between the actual method and the parser is fragile. The developers have 
 to remember to update both.

 Opinions?

 -S








 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openstack@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit

2011-12-07 Thread Michael Pittaro
On Wed, Dec 7, 2011 at 7:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote:
 For orchestration (and now the scheduler improvements) we need to know when 
 an operation fails ... and specifically, which resource was involved. In the 
 majority of the cases it's an instance_uuid we're looking for, but it could 
 be a security group id or a reservation id.

 With most of the compute.manager calls the resource id is the third parameter 
 in the call (after self  context), but there are some oddities. And 
 sometimes we need to know the additional parameters (like a migration id 
 related to an instance uuid). So simply enforcing parameter orders may be 
 insufficient and impossible to enforce programmatically.

 A little background:

 In nova, exceptions are generally handled in the RPC or middleware layers as 
 a logged event and life goes on. In an attempt to tie this into the 
 notification system, a while ago I added stuff to the wrap_exception 
 decorator. I'm sure you've seen this nightmare scattered around the code:
 @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())

 What started as a simple decorator now takes parameters and the code has 
 become nasty.

 But it works ... no matter where the exception was generated, the notifier 
 gets:
 *   compute.host_id
 *   method name
 *   and whatever arguments the method takes.

 So, we know what operation failed and the host it failed on, but someone 
 needs to crack the argument nut to get the goodies. It's a fragile coupling 
 from publisher to receiver.

I'm just wondering if we can get the notification message down to
something more standardized, and avoid including the full argument
list.
That is one way to reduce the coupling.

What is the minimum information we need to know when a failure occurs ?
I think we have
operation
host it failed on,
instance_id,
migration_id (maybe)
reservation_id, (maybe)
security group id (maybe)

If we can avoid cracking open the remaining arguments, a list this
long might be manageable.


 One, less fragile, alternative is to put a try/except block inside every 
 top-level nova.compute.manager method and send meaningful exceptions right 
 from the source. More fidelity, but messier code. Although explicit is 
 better than implicit keeps ringing in my head.

I like explicit better than implicit, but I think we need to trigger
off any and all exceptions to make all of this reliable.

 Or, we make a general event parser that anyone can use ... but again, the 
 link between the actual method and the parser is fragile. The developers have 
 to remember to update both.

 Opinions?

 -S



___
Mailing list: https://launchpad.net/~openstack
Post to : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp