subject:"\[openstack\-dev\] \[Nova\] RPC Communication Errors Might Lead to a Bad State"

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-16 Thread Shoham Peller

An RPC-message TTL mechanism sounds like a good solution.

I've opened a launchpad bug, so we can move the discussion there, and see
if we can think of more ideas to solve this:
https://bugs.launchpad.net/nova/+bug/1571175

Also, please see this previous bug on the same issue:
https://bugs.launchpad.net/nova/+bug/1276214

There, the bug was on a migration request. There, the solution was to catch
the RPC-Timeout exception, since the migrate RPC request is issued using a
"call", and not a "cast", hence waiting for nova-conductor to acknowledge
the message before considering it delivered.
We should either consider moving all RPC operations to 'call', or, if we
want to return from the REST request as fast as possible, let's move the
migrate RPC request to 'cast' as well.

I've added all this information to the launchpad bug.

Thanks,
Shoham
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-14 Thread Dan Smith

>> I have wanted to make a change for a while that involves a TTL on
>> messages, along with a deadline record so that we can know when to retry
>> or revert things that were in flight. This requires a lot of machinery
>> to accomplish, and is probably interwoven with the task concept we've
>> had on the back burner for a while. The complexity of moving nova to
>> this sort of scheme means that nobody has picked it up as of yet, but
>> it's certainly in the minds of many of us as something we need to do
>> before too long.
> 
> Are you still thinking of this kind of mechanism deployment?
> We need any kind of RPC handling mechanism at the end of the day.

I'm not sure what you're saying exactly. The above would be something we
integrate with our RPC calls to signal to us when they may have been
dropped or failed. It wouldn't replace the mechanism or need for RPC at
a fundamental level.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-13 Thread Shinobu Kinjo

Hi,

Just coming from my curiosity (inline).

On Thu, Apr 14, 2016 at 12:34 AM, Dan Smith  wrote:
>>   * nova-api should receive an acknowledgement from nova-compute. It is
>> unclear to me why today it uses a non-reply mechanism - probably to
>> free the worker as fast as it can.
>
> Yes, wherever possible, we want the API to return immediately and let
> the action complete later. Making a wholesale change to blocking calls
> from the API to any other service is not a good idea, IMHO.
>
>>   * Change the task_state mechanism to prevent this kind of a stuck
>> state to stay in the DB. nova-compute can be the one that writes the
>> task_state to the DB, but this is not enough of course, but maybe
>> there's another way?
>
> The task_state being set in the API is our way of limiting/locking the
> operation so that if the request is queued for a long time, a user
> doesn't reissue the command a bunch of time and add load to the API
> and/or jam up the queue with a thousand requests to do the same
> operation just because it's taking a while.
>
>>   * nova-api could start a timer for the action to complete. If the
>> shelving operation hasn't completed in X seconds, it will clean it
>> by itself and rollback\try-again.
>
> I have wanted to make a change for a while that involves a TTL on
> messages, along with a deadline record so that we can know when to retry
> or revert things that were in flight. This requires a lot of machinery
> to accomplish, and is probably interwoven with the task concept we've
> had on the back burner for a while. The complexity of moving nova to
> this sort of scheme means that nobody has picked it up as of yet, but
> it's certainly in the minds of many of us as something we need to do
> before too long.

Are you still thinking of this kind of mechanism deployment?
We need any kind of RPC handling mechanism at the end of the day.

>
> --Dan
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-13 Thread Dan Smith

>   * nova-api should receive an acknowledgement from nova-compute. It is
> unclear to me why today it uses a non-reply mechanism - probably to
> free the worker as fast as it can.

Yes, wherever possible, we want the API to return immediately and let
the action complete later. Making a wholesale change to blocking calls
from the API to any other service is not a good idea, IMHO.

>   * Change the task_state mechanism to prevent this kind of a stuck
> state to stay in the DB. nova-compute can be the one that writes the
> task_state to the DB, but this is not enough of course, but maybe
> there's another way?

The task_state being set in the API is our way of limiting/locking the
operation so that if the request is queued for a long time, a user
doesn't reissue the command a bunch of time and add load to the API
and/or jam up the queue with a thousand requests to do the same
operation just because it's taking a while.

>   * nova-api could start a timer for the action to complete. If the
> shelving operation hasn't completed in X seconds, it will clean it
> by itself and rollback\try-again.

I have wanted to make a change for a while that involves a TTL on
messages, along with a deadline record so that we can know when to retry
or revert things that were in flight. This requires a lot of machinery
to accomplish, and is probably interwoven with the task concept we've
had on the back burner for a while. The complexity of moving nova to
this sort of scheme means that nobody has picked it up as of yet, but
it's certainly in the minds of many of us as something we need to do
before too long.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-13 Thread Shoham Peller

Hi all,

There are some cases that a communication failure between the different
nova services, might cause a bad state in the system.

For example, when "shelving" a VM, nova-api puts the VM's task_state as
"shelving", sends an RPC to nova-compute, which shelves the VM, and resets
it's task_state in DB.
But, if for some reason, nova-compute didn't get the message (i.e. the RPC
service was down, there's a bug in the RPC service, nova-compute was down,
there was a temporary network malfunction), the VM is now stuck as
"shelving", and the user can't perform any operation on the stuck VM.
This example applies to a couple of scenarios in the system that involve
communication between different services.

>From nova-api's point-of-view, all it does is sending a message through
RPC, and neither actually checks that the message was received, nor waits
to get a reply or an acknowledgement from the receiver.

Of course, to solve this, a user can "reset-state" on a VM, and try to run
the action again, but this is error-prone and doesn't scale.

Possible solutions might be:

   - nova-api should receive an acknowledgement from nova-compute. It is
   unclear to me why today it uses a non-reply mechanism - probably to free
   the worker as fast as it can.
   - Change the task_state mechanism to prevent this kind of a stuck state
   to stay in the DB. nova-compute can be the one that writes the task_state
   to the DB, but this is not enough of course, but maybe there's another way?
   - nova-api could start a timer for the action to complete. If the
   shelving operation hasn't completed in X seconds, it will clean it by
   itself and rollback\try-again.

What do you think about the problem and the solutions?

Thanks,
Shoham Peller
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

[openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

5 matches

Site Navigation

Mail list logo

Footer information