Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

2014-07-30 Thread Bogdan Dobrelya
On 07/28/2014 11:20 AM, Bogdan Dobrelya wrote:
 Hello.
 I'd like to bring your attention to major RPC failover issue in
 impl_rabbit.py [0]. There are several *related* patches and a number of
 concerns should be considered as well:
 - Passive exchanges fix [1] (looks like the problem is much deeper than
 it seems though).
 - the first version of the fix [2] which makes the producer to declare a
 queue and bind it to exchange as well as consumer does.
 - Making all RPC involved reply_* queues durable in order to preserve
 them in RabbitMQ after failover (there could be a TTL for such a queues
 as well)
 - RPC throughput tuning patch [3]
 
 I believe the issue [0] should be at least prioritized and assigned to
 some milestone.
 
 [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
 [1] https://review.openstack.org/#/c/109373/
 [2]
 https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
 [3] https://review.openstack.org/#/c/109143/
 

There is a small update for this RabbitMQ RPC failover research:
Stan Lagun submitted the patch [0] for related bug [1].
Please don't hesitate to join the review process.

Basically the idea of the patch is to address the step 3
(rabbit dies and restarts) for *mirrored rabbit clusters*.
Obviously, it changes nothing for single rabbit host case because we
cannot failover then we have no cluster.

I agree the issue is more common than just impl_rabbit, but at least we
could start addressing it from here.

Speaking in general, it looks like RPC should be standardized more
thoroughly, may be as a some new RFC, and it should provide a rules
  a) how to handle AMQP connection HA failovers at RPC layer both for
drivers and applications, both for client and server side (speaking in
terms of RPC)
  b) how to handle RPC retries in a single AMQP host configurations and
in HA as well.
That would also have allowed amqp driver developers to borrow some logic
from app layers, if needed (and vice versa for app developers) w/o
causing a havoc and sorrow as we have now in oslo.messaging :-)

[0] https://review.openstack.org/110058
[1] https://bugs.launchpad.net/oslo.messaging/+bug/1349301

-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

2014-07-28 Thread Bogdan Dobrelya
Hello.
I'd like to bring your attention to major RPC failover issue in
impl_rabbit.py [0]. There are several *related* patches and a number of
concerns should be considered as well:
- Passive exchanges fix [1] (looks like the problem is much deeper than
it seems though).
- the first version of the fix [2] which makes the producer to declare a
queue and bind it to exchange as well as consumer does.
- Making all RPC involved reply_* queues durable in order to preserve
them in RabbitMQ after failover (there could be a TTL for such a queues
as well)
- RPC throughput tuning patch [3]

I believe the issue [0] should be at least prioritized and assigned to
some milestone.

[0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
[1] https://review.openstack.org/#/c/109373/
[2]
https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
[3] https://review.openstack.org/#/c/109143/

-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

2014-07-28 Thread Gordon Sim

On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote:

Hello.
I'd like to bring your attention to major RPC failover issue in
impl_rabbit.py [0]. There are several *related* patches and a number of
concerns should be considered as well:
- Passive exchanges fix [1] (looks like the problem is much deeper than
it seems though).
- the first version of the fix [2] which makes the producer to declare a
queue and bind it to exchange as well as consumer does.
- Making all RPC involved reply_* queues durable in order to preserve
them in RabbitMQ after failover (there could be a TTL for such a queues
as well)
- RPC throughput tuning patch [3]

I believe the issue [0] should be at least prioritized and assigned to
some milestone.


I think the real issue is the lack of clarity around what guarantees are 
made by the API.


Is it the case that an RPC call should never fail (i.e. never time out) 
due to failover? Either way, the answer to this should be very clear.


If failures may occur, then the calling code needs to handle that. If 
eliminating failures is part of the 'contract' then the library should 
have a clear strategy for ensuring (and testing) this.


Another possible scenario is that the connection is lost immediately 
after writing the request message to the socket (but before it is 
processed by the rabbit broker). In this case the issue is that the 
request is not confirmed, so it can complete before it is 'safe'. In 
other words requests are unreliable.


My own view is that if you want to avoid time outs on failover, the best 
approach is to have olso.messaging retry the entire request regardless 
of the point it had reached in the previous attempt. I.e. rather than 
trying to make delivery of responses reliable, assume that both requests 
and responses are unreliable and re-issue the request immediately on 
failover. (The retry logic could even be made independent of any driver 
if desired).


This is perhaps a bigger change, but I think it is more easy to get 
right and will also be more scalable and performant since it doesn't 
require replication of every queue and every message.





[0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
[1] https://review.openstack.org/#/c/109373/
[2]
https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
[3] https://review.openstack.org/#/c/109143/




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

2014-07-28 Thread Ken Giusti
On Mon, 28 Jul 2014 10:58:02 +0100, Gordon Sim wrote:
 On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote:
  Hello.
  I'd like to bring your attention to major RPC failover issue in
  impl_rabbit.py [0]. There are several *related* patches and a number of
  concerns should be considered as well:
  - Passive exchanges fix [1] (looks like the problem is much deeper than
  it seems though).
  - the first version of the fix [2] which makes the producer to declare a
  queue and bind it to exchange as well as consumer does.
  - Making all RPC involved reply_* queues durable in order to preserve
  them in RabbitMQ after failover (there could be a TTL for such a queues
  as well)
  - RPC throughput tuning patch [3]
 
  I believe the issue [0] should be at least prioritized and assigned to
  some milestone.

 I think the real issue is the lack of clarity around what guarantees are
 made by the API.


Wholeheartedly agree!  This lack of explicitness makes it very
difficult to add new messaging backends (drivers) to oslo.messaging
and expect the API to function uniformly from the application's point
of view.  The end result is that oslo.messaging's API behavior is
somewhat implicitly defined by the characteristics of the rpc backend
(broker), rather than olso.messaging itself.

In other words: we need to solve this problem in general, not just for
the rabbit driver.

 Is it the case that an RPC call should never fail (i.e. never time out)
 due to failover? Either way, the answer to this should be very clear.

 If failures may occur, then the calling code needs to handle that. If
 eliminating failures is part of the 'contract' then the library should
 have a clear strategy for ensuring (and testing) this.

 Another possible scenario is that the connection is lost immediately
 after writing the request message to the socket (but before it is
 processed by the rabbit broker). In this case the issue is that the
 request is not confirmed, so it can complete before it is 'safe'. In
 other words requests are unreliable.

 My own view is that if you want to avoid time outs on failover, the best
 approach is to have olso.messaging retry the entire request regardless
 of the point it had reached in the previous attempt. I.e. rather than
 trying to make delivery of responses reliable, assume that both requests
 and responses are unreliable and re-issue the request immediately on
 failover.

I like this suggestion. By assuming limited reliability from the
underlying messaging system, we reduce oslo.messaging's reliance on
features provided by any particular messaging implementation
(driver/broker).

 (The retry logic could even be made independent of any driver
 if desired).

Exactly!  Having all QOS related code outside of the drivers would
guarantee that the behavior of the API is _uniform_ across all
drivers.


 This is perhaps a bigger change, but I think it is more easy to get
 right and will also be more scalable and performant since it doesn't
 require replication of every queue and every message.


 
  [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
  [1] https://review.openstack.org/#/c/109373/
  [2]
  https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
  [3] https://review.openstack.org/#/c/109143/



-- 
Ken Giusti  (kgiu...@gmail.com)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev