Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-16 Thread Shoham Peller
An RPC-message TTL mechanism sounds like a good solution.

I've opened a launchpad bug, so we can move the discussion there, and see
if we can think of more ideas to solve this:
https://bugs.launchpad.net/nova/+bug/1571175

Also, please see this previous bug on the same issue:
https://bugs.launchpad.net/nova/+bug/1276214

There, the bug was on a migration request. There, the solution was to catch
the RPC-Timeout exception, since the migrate RPC request is issued using a
"call", and not a "cast", hence waiting for nova-conductor to acknowledge
the message before considering it delivered.
We should either consider moving all RPC operations to 'call', or, if we
want to return from the REST request as fast as possible, let's move the
migrate RPC request to 'cast' as well.

I've added all this information to the launchpad bug.

Thanks,
Shoham
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-14 Thread Dan Smith
>> I have wanted to make a change for a while that involves a TTL on
>> messages, along with a deadline record so that we can know when to retry
>> or revert things that were in flight. This requires a lot of machinery
>> to accomplish, and is probably interwoven with the task concept we've
>> had on the back burner for a while. The complexity of moving nova to
>> this sort of scheme means that nobody has picked it up as of yet, but
>> it's certainly in the minds of many of us as something we need to do
>> before too long.
> 
> Are you still thinking of this kind of mechanism deployment?
> We need any kind of RPC handling mechanism at the end of the day.

I'm not sure what you're saying exactly. The above would be something we
integrate with our RPC calls to signal to us when they may have been
dropped or failed. It wouldn't replace the mechanism or need for RPC at
a fundamental level.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-13 Thread Shinobu Kinjo
Hi,

Just coming from my curiosity (inline).

On Thu, Apr 14, 2016 at 12:34 AM, Dan Smith  wrote:
>>   * nova-api should receive an acknowledgement from nova-compute. It is
>> unclear to me why today it uses a non-reply mechanism - probably to
>> free the worker as fast as it can.
>
> Yes, wherever possible, we want the API to return immediately and let
> the action complete later. Making a wholesale change to blocking calls
> from the API to any other service is not a good idea, IMHO.
>
>>   * Change the task_state mechanism to prevent this kind of a stuck
>> state to stay in the DB. nova-compute can be the one that writes the
>> task_state to the DB, but this is not enough of course, but maybe
>> there's another way?
>
> The task_state being set in the API is our way of limiting/locking the
> operation so that if the request is queued for a long time, a user
> doesn't reissue the command a bunch of time and add load to the API
> and/or jam up the queue with a thousand requests to do the same
> operation just because it's taking a while.
>
>>   * nova-api could start a timer for the action to complete. If the
>> shelving operation hasn't completed in X seconds, it will clean it
>> by itself and rollback\try-again.
>
> I have wanted to make a change for a while that involves a TTL on
> messages, along with a deadline record so that we can know when to retry
> or revert things that were in flight. This requires a lot of machinery
> to accomplish, and is probably interwoven with the task concept we've
> had on the back burner for a while. The complexity of moving nova to
> this sort of scheme means that nobody has picked it up as of yet, but
> it's certainly in the minds of many of us as something we need to do
> before too long.

Are you still thinking of this kind of mechanism deployment?
We need any kind of RPC handling mechanism at the end of the day.

>
> --Dan
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

2016-04-13 Thread Dan Smith
>   * nova-api should receive an acknowledgement from nova-compute. It is
> unclear to me why today it uses a non-reply mechanism - probably to
> free the worker as fast as it can.

Yes, wherever possible, we want the API to return immediately and let
the action complete later. Making a wholesale change to blocking calls
from the API to any other service is not a good idea, IMHO.

>   * Change the task_state mechanism to prevent this kind of a stuck
> state to stay in the DB. nova-compute can be the one that writes the
> task_state to the DB, but this is not enough of course, but maybe
> there's another way?

The task_state being set in the API is our way of limiting/locking the
operation so that if the request is queued for a long time, a user
doesn't reissue the command a bunch of time and add load to the API
and/or jam up the queue with a thousand requests to do the same
operation just because it's taking a while.

>   * nova-api could start a timer for the action to complete. If the
> shelving operation hasn't completed in X seconds, it will clean it
> by itself and rollback\try-again.

I have wanted to make a change for a while that involves a TTL on
messages, along with a deadline record so that we can know when to retry
or revert things that were in flight. This requires a lot of machinery
to accomplish, and is probably interwoven with the task concept we've
had on the back burner for a while. The complexity of moving nova to
this sort of scheme means that nobody has picked it up as of yet, but
it's certainly in the minds of many of us as something we need to do
before too long.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev