Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
Can you talk a little more about how you want to apply this failure notification? That is, what is the case where you are going to use the information that an operation failed? In my head I have an idea of getting code simplicity dividends from an everything succeeds approach to some of our operations. But it might not really apply to the case you're working on. Sandy Walsh sandy.wa...@rackspace.com said: For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id. With most of the compute.manager calls the resource id is the third parameter in the call (after self context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically. A little background: In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code: @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id()) What started as a simple decorator now takes parameters and the code has become nasty. But it works ... no matter where the exception was generated, the notifier gets: * compute.host_id * method name * and whatever arguments the method takes. So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver. One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although explicit is better than implicit keeps ringing in my head. Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both. Opinions? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
Sure, the problem I'm immediately facing is reclaiming resources from the Capacity table when something fails. (we claim them immediately in the scheduler when the host is selected to lessen the latency). The other situation is Orchestration needs it for retries, rescheduling, rollbacks and cross-service timeouts. I think it's needed core functionality. I like Fail-Fast for the same reasons, but it can get in the way. -S From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net [openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of Mark Washenberger [mark.washenber...@rackspace.com] Sent: Wednesday, December 07, 2011 11:53 AM To: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit Can you talk a little more about how you want to apply this failure notification? That is, what is the case where you are going to use the information that an operation failed? In my head I have an idea of getting code simplicity dividends from an everything succeeds approach to some of our operations. But it might not really apply to the case you're working on. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
Gotcha. So the way this might work is, for example, when a run_instance fails on compute node, it would publish a run_instance for uuid=blah failed event. There would be a subscriber associated with the scheduler listening for such events--when it receives one it would go check the capacity table and update it to reflect the failure. Does that sound about right? Sandy Walsh sandy.wa...@rackspace.com said: Sure, the problem I'm immediately facing is reclaiming resources from the Capacity table when something fails. (we claim them immediately in the scheduler when the host is selected to lessen the latency). The other situation is Orchestration needs it for retries, rescheduling, rollbacks and cross-service timeouts. I think it's needed core functionality. I like Fail-Fast for the same reasons, but it can get in the way. -S From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net [openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of Mark Washenberger [mark.washenber...@rackspace.com] Sent: Wednesday, December 07, 2011 11:53 AM To: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit Can you talk a little more about how you want to apply this failure notification? That is, what is the case where you are going to use the information that an operation failed? In my head I have an idea of getting code simplicity dividends from an everything succeeds approach to some of our operations. But it might not really apply to the case you're working on. ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
Exactly! ... or it could be handled in the notifier itself. From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net [openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of Mark Washenberger [mark.washenber...@rackspace.com] Sent: Wednesday, December 07, 2011 12:36 PM To: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit Gotcha. So the way this might work is, for example, when a run_instance fails on compute node, it would publish a run_instance for uuid=blah failed event. There would be a subscriber associated with the scheduler listening for such events--when it receives one it would go check the capacity table and update it to reflect the failure. Does that sound about right? ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
Hi Sandy, I'm wondering if it is possible to change the scheduler's rpc cast to rpc call. This way the exceptions should be magically propagated back to the scheduler, right? Naturally the scheduler can find another node to retry or decide to give up and report failure. If we need to provision many instances, we can spawn a few green threads for that. Yun On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id. With most of the compute.manager calls the resource id is the third parameter in the call (after self context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically. A little background: In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code: @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id()) What started as a simple decorator now takes parameters and the code has become nasty. But it works ... no matter where the exception was generated, the notifier gets: * compute.host_id * method name * and whatever arguments the method takes. So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver. One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although explicit is better than implicit keeps ringing in my head. Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both. Opinions? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
True ... this idea has come up before (and is still being kicked around). My biggest concern is what happens if that scheduler dies? We need a mechanism that can live outside of a single scheduler service. The more of these long-running processes we leave in a service the greater the impact when something fails. Shouldn't we let the queue provide the resiliency and not depend on the worker staying alive? Personally I'm not a fan of removing our synchronous nature. From: Yun Mao [yun...@gmail.com] Sent: Wednesday, December 07, 2011 1:03 PM To: Sandy Walsh Cc: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit Hi Sandy, I'm wondering if it is possible to change the scheduler's rpc cast to rpc call. This way the exceptions should be magically propagated back to the scheduler, right? Naturally the scheduler can find another node to retry or decide to give up and report failure. If we need to provision many instances, we can spawn a few green threads for that. Yun On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id. With most of the compute.manager calls the resource id is the third parameter in the call (after self context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically. A little background: In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code: @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id()) What started as a simple decorator now takes parameters and the code has become nasty. But it works ... no matter where the exception was generated, the notifier gets: * compute.host_id * method name * and whatever arguments the method takes. So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver. One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although explicit is better than implicit keeps ringing in my head. Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both. Opinions? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
*removing our Asynchronous nature. (heh, such a key point to typo on) From: openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net [openstack-bounces+sandy.walsh=rackspace@lists.launchpad.net] on behalf of Sandy Walsh [sandy.wa...@rackspace.com] Sent: Wednesday, December 07, 2011 1:55 PM To: Yun Mao Cc: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit True ... this idea has come up before (and is still being kicked around). My biggest concern is what happens if that scheduler dies? We need a mechanism that can live outside of a single scheduler service. The more of these long-running processes we leave in a service the greater the impact when something fails. Shouldn't we let the queue provide the resiliency and not depend on the worker staying alive? Personally I'm not a fan of removing our synchronous nature. From: Yun Mao [yun...@gmail.com] Sent: Wednesday, December 07, 2011 1:03 PM To: Sandy Walsh Cc: openstack@lists.launchpad.net Subject: Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit Hi Sandy, I'm wondering if it is possible to change the scheduler's rpc cast to rpc call. This way the exceptions should be magically propagated back to the scheduler, right? Naturally the scheduler can find another node to retry or decide to give up and report failure. If we need to provision many instances, we can spawn a few green threads for that. Yun On Wed, Dec 7, 2011 at 10:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id. With most of the compute.manager calls the resource id is the third parameter in the call (after self context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically. A little background: In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code: @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id()) What started as a simple decorator now takes parameters and the code has become nasty. But it works ... no matter where the exception was generated, the notifier gets: * compute.host_id * method name * and whatever arguments the method takes. So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver. One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although explicit is better than implicit keeps ringing in my head. Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both. Opinions? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp
Re: [Openstack] [Orchestration] Handling error events ... explicit vs. implicit
On Wed, Dec 7, 2011 at 7:26 AM, Sandy Walsh sandy.wa...@rackspace.com wrote: For orchestration (and now the scheduler improvements) we need to know when an operation fails ... and specifically, which resource was involved. In the majority of the cases it's an instance_uuid we're looking for, but it could be a security group id or a reservation id. With most of the compute.manager calls the resource id is the third parameter in the call (after self context), but there are some oddities. And sometimes we need to know the additional parameters (like a migration id related to an instance uuid). So simply enforcing parameter orders may be insufficient and impossible to enforce programmatically. A little background: In nova, exceptions are generally handled in the RPC or middleware layers as a logged event and life goes on. In an attempt to tie this into the notification system, a while ago I added stuff to the wrap_exception decorator. I'm sure you've seen this nightmare scattered around the code: @exception.wrap_exception(notifier=notifier, publisher_id=publisher_id()) What started as a simple decorator now takes parameters and the code has become nasty. But it works ... no matter where the exception was generated, the notifier gets: * compute.host_id * method name * and whatever arguments the method takes. So, we know what operation failed and the host it failed on, but someone needs to crack the argument nut to get the goodies. It's a fragile coupling from publisher to receiver. I'm just wondering if we can get the notification message down to something more standardized, and avoid including the full argument list. That is one way to reduce the coupling. What is the minimum information we need to know when a failure occurs ? I think we have operation host it failed on, instance_id, migration_id (maybe) reservation_id, (maybe) security group id (maybe) If we can avoid cracking open the remaining arguments, a list this long might be manageable. One, less fragile, alternative is to put a try/except block inside every top-level nova.compute.manager method and send meaningful exceptions right from the source. More fidelity, but messier code. Although explicit is better than implicit keeps ringing in my head. I like explicit better than implicit, but I think we need to trigger off any and all exceptions to make all of this reliable. Or, we make a general event parser that anyone can use ... but again, the link between the actual method and the parser is fragile. The developers have to remember to update both. Opinions? -S ___ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp