Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
An RPC-message TTL mechanism sounds like a good solution. I've opened a launchpad bug, so we can move the discussion there, and see if we can think of more ideas to solve this: https://bugs.launchpad.net/nova/+bug/1571175 Also, please see this previous bug on the same issue: https://bugs.launchpad.net/nova/+bug/1276214 There, the bug was on a migration request. There, the solution was to catch the RPC-Timeout exception, since the migrate RPC request is issued using a "call", and not a "cast", hence waiting for nova-conductor to acknowledge the message before considering it delivered. We should either consider moving all RPC operations to 'call', or, if we want to return from the REST request as fast as possible, let's move the migrate RPC request to 'cast' as well. I've added all this information to the launchpad bug. Thanks, Shoham __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
>> I have wanted to make a change for a while that involves a TTL on >> messages, along with a deadline record so that we can know when to retry >> or revert things that were in flight. This requires a lot of machinery >> to accomplish, and is probably interwoven with the task concept we've >> had on the back burner for a while. The complexity of moving nova to >> this sort of scheme means that nobody has picked it up as of yet, but >> it's certainly in the minds of many of us as something we need to do >> before too long. > > Are you still thinking of this kind of mechanism deployment? > We need any kind of RPC handling mechanism at the end of the day. I'm not sure what you're saying exactly. The above would be something we integrate with our RPC calls to signal to us when they may have been dropped or failed. It wouldn't replace the mechanism or need for RPC at a fundamental level. --Dan __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
Hi, Just coming from my curiosity (inline). On Thu, Apr 14, 2016 at 12:34 AM, Dan Smith wrote: >> * nova-api should receive an acknowledgement from nova-compute. It is >> unclear to me why today it uses a non-reply mechanism - probably to >> free the worker as fast as it can. > > Yes, wherever possible, we want the API to return immediately and let > the action complete later. Making a wholesale change to blocking calls > from the API to any other service is not a good idea, IMHO. > >> * Change the task_state mechanism to prevent this kind of a stuck >> state to stay in the DB. nova-compute can be the one that writes the >> task_state to the DB, but this is not enough of course, but maybe >> there's another way? > > The task_state being set in the API is our way of limiting/locking the > operation so that if the request is queued for a long time, a user > doesn't reissue the command a bunch of time and add load to the API > and/or jam up the queue with a thousand requests to do the same > operation just because it's taking a while. > >> * nova-api could start a timer for the action to complete. If the >> shelving operation hasn't completed in X seconds, it will clean it >> by itself and rollback\try-again. > > I have wanted to make a change for a while that involves a TTL on > messages, along with a deadline record so that we can know when to retry > or revert things that were in flight. This requires a lot of machinery > to accomplish, and is probably interwoven with the task concept we've > had on the back burner for a while. The complexity of moving nova to > this sort of scheme means that nobody has picked it up as of yet, but > it's certainly in the minds of many of us as something we need to do > before too long. Are you still thinking of this kind of mechanism deployment? We need any kind of RPC handling mechanism at the end of the day. > > --Dan > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Email: shin...@linux.com GitHub: shinobu-x Blog: Life with Distributed Computational System based on OpenSource __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
> * nova-api should receive an acknowledgement from nova-compute. It is > unclear to me why today it uses a non-reply mechanism - probably to > free the worker as fast as it can. Yes, wherever possible, we want the API to return immediately and let the action complete later. Making a wholesale change to blocking calls from the API to any other service is not a good idea, IMHO. > * Change the task_state mechanism to prevent this kind of a stuck > state to stay in the DB. nova-compute can be the one that writes the > task_state to the DB, but this is not enough of course, but maybe > there's another way? The task_state being set in the API is our way of limiting/locking the operation so that if the request is queued for a long time, a user doesn't reissue the command a bunch of time and add load to the API and/or jam up the queue with a thousand requests to do the same operation just because it's taking a while. > * nova-api could start a timer for the action to complete. If the > shelving operation hasn't completed in X seconds, it will clean it > by itself and rollback\try-again. I have wanted to make a change for a while that involves a TTL on messages, along with a deadline record so that we can know when to retry or revert things that were in flight. This requires a lot of machinery to accomplish, and is probably interwoven with the task concept we've had on the back burner for a while. The complexity of moving nova to this sort of scheme means that nobody has picked it up as of yet, but it's certainly in the minds of many of us as something we need to do before too long. --Dan __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
Hi all, There are some cases that a communication failure between the different nova services, might cause a bad state in the system. For example, when "shelving" a VM, nova-api puts the VM's task_state as "shelving", sends an RPC to nova-compute, which shelves the VM, and resets it's task_state in DB. But, if for some reason, nova-compute didn't get the message (i.e. the RPC service was down, there's a bug in the RPC service, nova-compute was down, there was a temporary network malfunction), the VM is now stuck as "shelving", and the user can't perform any operation on the stuck VM. This example applies to a couple of scenarios in the system that involve communication between different services. >From nova-api's point-of-view, all it does is sending a message through RPC, and neither actually checks that the message was received, nor waits to get a reply or an acknowledgement from the receiver. Of course, to solve this, a user can "reset-state" on a VM, and try to run the action again, but this is error-prone and doesn't scale. Possible solutions might be: - nova-api should receive an acknowledgement from nova-compute. It is unclear to me why today it uses a non-reply mechanism - probably to free the worker as fast as it can. - Change the task_state mechanism to prevent this kind of a stuck state to stay in the DB. nova-compute can be the one that writes the task_state to the DB, but this is not enough of course, but maybe there's another way? - nova-api could start a timer for the action to complete. If the shelving operation hasn't completed in X seconds, it will clean it by itself and rollback\try-again. What do you think about the problem and the solutions? Thanks, Shoham Peller __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev