Public bug reported: A lot of my notes are in https://review.openstack.org/#/c/591657/ where I was testing a down cell on a devstack deployment.
To simulate a down cell, I changed the database_connection value for the cell1 cell to be an invalid IP (192.0.0.1) and then restarted [email protected]. With the default configs in devstack, the service was hanging trying to respond to a simple GET / request to list versions. It looks like the problem is because each nova.compute.api.API object that gets created for each route handler (for each API worker, which in my case is 2) tries to get the minimum nova-compute service version across all cells: https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261 https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373 https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395 This is a snip of the API log while waiting for the GET / response: http://paste.openstack.org/show/744983/ As a result I got this unhelpful client side error: http://paste.openstack.org/show/744984/ I know that's where the failure was because I was also getting this: Feb 13 00:09:57 downcell [email protected][14623]: DEBUG nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8 None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=14625) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:410}} The minimum nova-compute service version isn't getting cached in nova- api if running under uwsgi anyway for which I reported bug 1815692. The way I worked around the issue was by setting [upgrade_levels]/compute=rocky but that's probably not something we want to rely on when we can set to 'auto' and have the code calculate it for us, but it can hang the API workers. Also note the default database max_attempts and retry_interval are 10 which means for each API object created that hits this, it's going to take 100 seconds to timeout per route handler per API worker. I count 31 route handlers that create an API object, so that's by default 3100 seconds or about ~52 minutes per worker on startup. ** Affects: nova Importance: Medium Status: Confirmed ** Tags: api cells performance -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1815697 Title: [upgrade_levels]compute=auto grinds the API response times when a cell is down Status in OpenStack Compute (nova): Confirmed Bug description: A lot of my notes are in https://review.openstack.org/#/c/591657/ where I was testing a down cell on a devstack deployment. To simulate a down cell, I changed the database_connection value for the cell1 cell to be an invalid IP (192.0.0.1) and then restarted [email protected]. With the default configs in devstack, the service was hanging trying to respond to a simple GET / request to list versions. It looks like the problem is because each nova.compute.api.API object that gets created for each route handler (for each API worker, which in my case is 2) tries to get the minimum nova-compute service version across all cells: https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261 https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373 https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395 This is a snip of the API log while waiting for the GET / response: http://paste.openstack.org/show/744983/ As a result I got this unhelpful client side error: http://paste.openstack.org/show/744984/ I know that's where the failure was because I was also getting this: Feb 13 00:09:57 downcell [email protected][14623]: DEBUG nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8 None None] Not caching compute RPC version_cap, because min service_version is 0. Please ensure a nova-compute service has been started. Defaulting to current version. {{(pid=14625) _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:410}} The minimum nova-compute service version isn't getting cached in nova- api if running under uwsgi anyway for which I reported bug 1815692. The way I worked around the issue was by setting [upgrade_levels]/compute=rocky but that's probably not something we want to rely on when we can set to 'auto' and have the code calculate it for us, but it can hang the API workers. Also note the default database max_attempts and retry_interval are 10 which means for each API object created that hits this, it's going to take 100 seconds to timeout per route handler per API worker. I count 31 route handlers that create an API object, so that's by default 3100 seconds or about ~52 minutes per worker on startup. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1815697/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

