Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver
On 07/28/2014 11:20 AM, Bogdan Dobrelya wrote: Hello. I'd like to bring your attention to major RPC failover issue in impl_rabbit.py [0]. There are several *related* patches and a number of concerns should be considered as well: - Passive exchanges fix [1] (looks like the problem is much deeper than it seems though). - the first version of the fix [2] which makes the producer to declare a queue and bind it to exchange as well as consumer does. - Making all RPC involved reply_* queues durable in order to preserve them in RabbitMQ after failover (there could be a TTL for such a queues as well) - RPC throughput tuning patch [3] I believe the issue [0] should be at least prioritized and assigned to some milestone. [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732 [1] https://review.openstack.org/#/c/109373/ [2] https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e [3] https://review.openstack.org/#/c/109143/ There is a small update for this RabbitMQ RPC failover research: Stan Lagun submitted the patch [0] for related bug [1]. Please don't hesitate to join the review process. Basically the idea of the patch is to address the step 3 (rabbit dies and restarts) for *mirrored rabbit clusters*. Obviously, it changes nothing for single rabbit host case because we cannot failover then we have no cluster. I agree the issue is more common than just impl_rabbit, but at least we could start addressing it from here. Speaking in general, it looks like RPC should be standardized more thoroughly, may be as a some new RFC, and it should provide a rules a) how to handle AMQP connection HA failovers at RPC layer both for drivers and applications, both for client and server side (speaking in terms of RPC) b) how to handle RPC retries in a single AMQP host configurations and in HA as well. That would also have allowed amqp driver developers to borrow some logic from app layers, if needed (and vice versa for app developers) w/o causing a havoc and sorrow as we have now in oslo.messaging :-) [0] https://review.openstack.org/110058 [1] https://bugs.launchpad.net/oslo.messaging/+bug/1349301 -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver
Hello. I'd like to bring your attention to major RPC failover issue in impl_rabbit.py [0]. There are several *related* patches and a number of concerns should be considered as well: - Passive exchanges fix [1] (looks like the problem is much deeper than it seems though). - the first version of the fix [2] which makes the producer to declare a queue and bind it to exchange as well as consumer does. - Making all RPC involved reply_* queues durable in order to preserve them in RabbitMQ after failover (there could be a TTL for such a queues as well) - RPC throughput tuning patch [3] I believe the issue [0] should be at least prioritized and assigned to some milestone. [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732 [1] https://review.openstack.org/#/c/109373/ [2] https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e [3] https://review.openstack.org/#/c/109143/ -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver
On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote: Hello. I'd like to bring your attention to major RPC failover issue in impl_rabbit.py [0]. There are several *related* patches and a number of concerns should be considered as well: - Passive exchanges fix [1] (looks like the problem is much deeper than it seems though). - the first version of the fix [2] which makes the producer to declare a queue and bind it to exchange as well as consumer does. - Making all RPC involved reply_* queues durable in order to preserve them in RabbitMQ after failover (there could be a TTL for such a queues as well) - RPC throughput tuning patch [3] I believe the issue [0] should be at least prioritized and assigned to some milestone. I think the real issue is the lack of clarity around what guarantees are made by the API. Is it the case that an RPC call should never fail (i.e. never time out) due to failover? Either way, the answer to this should be very clear. If failures may occur, then the calling code needs to handle that. If eliminating failures is part of the 'contract' then the library should have a clear strategy for ensuring (and testing) this. Another possible scenario is that the connection is lost immediately after writing the request message to the socket (but before it is processed by the rabbit broker). In this case the issue is that the request is not confirmed, so it can complete before it is 'safe'. In other words requests are unreliable. My own view is that if you want to avoid time outs on failover, the best approach is to have olso.messaging retry the entire request regardless of the point it had reached in the previous attempt. I.e. rather than trying to make delivery of responses reliable, assume that both requests and responses are unreliable and re-issue the request immediately on failover. (The retry logic could even be made independent of any driver if desired). This is perhaps a bigger change, but I think it is more easy to get right and will also be more scalable and performant since it doesn't require replication of every queue and every message. [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732 [1] https://review.openstack.org/#/c/109373/ [2] https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e [3] https://review.openstack.org/#/c/109143/ ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver
On Mon, 28 Jul 2014 10:58:02 +0100, Gordon Sim wrote: On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote: Hello. I'd like to bring your attention to major RPC failover issue in impl_rabbit.py [0]. There are several *related* patches and a number of concerns should be considered as well: - Passive exchanges fix [1] (looks like the problem is much deeper than it seems though). - the first version of the fix [2] which makes the producer to declare a queue and bind it to exchange as well as consumer does. - Making all RPC involved reply_* queues durable in order to preserve them in RabbitMQ after failover (there could be a TTL for such a queues as well) - RPC throughput tuning patch [3] I believe the issue [0] should be at least prioritized and assigned to some milestone. I think the real issue is the lack of clarity around what guarantees are made by the API. Wholeheartedly agree! This lack of explicitness makes it very difficult to add new messaging backends (drivers) to oslo.messaging and expect the API to function uniformly from the application's point of view. The end result is that oslo.messaging's API behavior is somewhat implicitly defined by the characteristics of the rpc backend (broker), rather than olso.messaging itself. In other words: we need to solve this problem in general, not just for the rabbit driver. Is it the case that an RPC call should never fail (i.e. never time out) due to failover? Either way, the answer to this should be very clear. If failures may occur, then the calling code needs to handle that. If eliminating failures is part of the 'contract' then the library should have a clear strategy for ensuring (and testing) this. Another possible scenario is that the connection is lost immediately after writing the request message to the socket (but before it is processed by the rabbit broker). In this case the issue is that the request is not confirmed, so it can complete before it is 'safe'. In other words requests are unreliable. My own view is that if you want to avoid time outs on failover, the best approach is to have olso.messaging retry the entire request regardless of the point it had reached in the previous attempt. I.e. rather than trying to make delivery of responses reliable, assume that both requests and responses are unreliable and re-issue the request immediately on failover. I like this suggestion. By assuming limited reliability from the underlying messaging system, we reduce oslo.messaging's reliance on features provided by any particular messaging implementation (driver/broker). (The retry logic could even be made independent of any driver if desired). Exactly! Having all QOS related code outside of the drivers would guarantee that the behavior of the API is _uniform_ across all drivers. This is perhaps a bigger change, but I think it is more easy to get right and will also be more scalable and performant since it doesn't require replication of every queue and every message. [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732 [1] https://review.openstack.org/#/c/109373/ [2] https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e [3] https://review.openstack.org/#/c/109143/ -- Ken Giusti (kgiu...@gmail.com) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev