Public bug reported: Using today's master, there is a big performance degradation in GET /allocation_candidates when there is a large number of resource providers (in my tests 1000, each with the same inventory as described in [1]). 17s when querying all three resource classes with http://127.0.0.1:8081/allocation_candidates?resources=VCPU:1,MEMORY_MB:256,DISK_GB:10
Using a limit does not make any difference, the cost is in generating the original data. I did some advanced LOG.debug based benchmarking to determine three places where things are a problem, and maybe even fixed the worst one. See the diff below. The two main culprits are ResourceProvider.get_by_uuid calls looping over the full set. These can be replaced by either using data we already have from early queries, or by changing so we are making single queries. In the diff I've already changed one of them (the second chunk) to use the data that _build_provider_summaries is already getting. (functional tests still pass with this change) The third chunk is because we have a big loop, but I suspect there is some duplication that can be avoided. I have no investigated that closely (yet). -=-=- diff --git a/nova/api/openstack/placement/objects/resource_provider.py b/nova/api/openstack/placement/objects/resource_provider.py index 851f9719e4..e6c894b8fe 100644 --- a/nova/api/openstack/placement/objects/resource_provider.py +++ b/nova/api/openstack/placement/objects/resource_provider.py @@ -3233,6 +3233,8 @@ def _build_provider_summaries(context, usages, prov_traits): if not summary: summary = ProviderSummary( context, + # This is _expensive_ when there are a large number of rps. + # Building the objects differently may be better. resource_provider=ResourceProvider.get_by_uuid(context, uuid=rp_uuid), resources=[], @@ -3519,8 +3521,7 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources, rp_uuid = rp_summary.resource_provider.uuid tree_dict[root_id][rc_id].append( AllocationRequestResource( - ctx, resource_provider=ResourceProvider.get_by_uuid(ctx, - rp_uuid), + ctx, resource_provider=rp_summary.resource_provider, resource_class=_RC_CACHE.string_from_id(rc_id), amount=requested_resources[rc_id])) @@ -3535,6 +3536,8 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources, alloc_prov_ids = [] # Let's look into each tree + # With many resource providers this takes a long time, but each trip + # through the loop is not too bad. for root_id, alloc_dict in tree_dict.items(): # Get request_groups, which is a list of lists of # AllocationRequestResource(ARR) per requested resource class(rc). -=-=- [1] https://github.com/cdent/placeload/blob/master/placeload/__init__.py#L23 ** Affects: nova Importance: High Status: Confirmed ** Tags: placement -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1786055 Title: performance degradation in placement with large number of resource providers Status in OpenStack Compute (nova): Confirmed Bug description: Using today's master, there is a big performance degradation in GET /allocation_candidates when there is a large number of resource providers (in my tests 1000, each with the same inventory as described in [1]). 17s when querying all three resource classes with http://127.0.0.1:8081/allocation_candidates?resources=VCPU:1,MEMORY_MB:256,DISK_GB:10 Using a limit does not make any difference, the cost is in generating the original data. I did some advanced LOG.debug based benchmarking to determine three places where things are a problem, and maybe even fixed the worst one. See the diff below. The two main culprits are ResourceProvider.get_by_uuid calls looping over the full set. These can be replaced by either using data we already have from early queries, or by changing so we are making single queries. In the diff I've already changed one of them (the second chunk) to use the data that _build_provider_summaries is already getting. (functional tests still pass with this change) The third chunk is because we have a big loop, but I suspect there is some duplication that can be avoided. I have no investigated that closely (yet). -=-=- diff --git a/nova/api/openstack/placement/objects/resource_provider.py b/nova/api/openstack/placement/objects/resource_provider.py index 851f9719e4..e6c894b8fe 100644 --- a/nova/api/openstack/placement/objects/resource_provider.py +++ b/nova/api/openstack/placement/objects/resource_provider.py @@ -3233,6 +3233,8 @@ def _build_provider_summaries(context, usages, prov_traits): if not summary: summary = ProviderSummary( context, + # This is _expensive_ when there are a large number of rps. + # Building the objects differently may be better. resource_provider=ResourceProvider.get_by_uuid(context, uuid=rp_uuid), resources=[], @@ -3519,8 +3521,7 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources, rp_uuid = rp_summary.resource_provider.uuid tree_dict[root_id][rc_id].append( AllocationRequestResource( - ctx, resource_provider=ResourceProvider.get_by_uuid(ctx, - rp_uuid), + ctx, resource_provider=rp_summary.resource_provider, resource_class=_RC_CACHE.string_from_id(rc_id), amount=requested_resources[rc_id])) @@ -3535,6 +3536,8 @@ def _alloc_candidates_multiple_providers(ctx, requested_resources, alloc_prov_ids = [] # Let's look into each tree + # With many resource providers this takes a long time, but each trip + # through the loop is not too bad. for root_id, alloc_dict in tree_dict.items(): # Get request_groups, which is a list of lists of # AllocationRequestResource(ARR) per requested resource class(rc). -=-=- [1] https://github.com/cdent/placeload/blob/master/placeload/__init__.py#L23 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1786055/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp