[Yahoo-eng-team] [Bug 1786703] [NEW] Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust

Chris Dent Sun, 12 Aug 2018 09:26:40 -0700

Public bug reported:

NOTE: This may be just a postgresql problem, not sure.


When doing some further experiments with load testing placement, my
resource provider create script, which uses asyncio was able to cause
several 500 errors from the placement service of the following form:

```
cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 
'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 
172.17.0.1 "PUT 
/resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ 
/usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API 
unexpected error: This Session's transaction has been rolled back due to a 
previous exception during flush. To begin a new transaction with this Session, 
first issue Session.rollback(). Original exception was: 
(psycopg2.IntegrityError) duplicate key value violates unique constraint 
"uniq_placement_aggregates0uuid"
2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT 
/resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 
500 len: 997 microversion: 1.29
```

"DETAIL:  Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
exists."


This is because the code at 
https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529
 is not trapping the right error when the server thinks it needs to create a 
new aggregate at the same time that it is already creating it.

It's not clear to me if this is because oslo_db is not transforming the
postgresql error properly or that the generic error there is the wrong
one and we've never noticed before because we don't hit the concurrency
situation hard enough.

** Affects: nova
     Importance: Medium
         Status: New


** Tags: db placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1786703

Title:
  Placement duplicate aggregate uuid handling during concurrent
  aggregate create insufficiently robust

Status in OpenStack Compute (nova):
  New

Bug description:
  NOTE: This may be just a postgresql problem, not sure.

  When doing some further experiments with load testing placement, my
  resource provider create script, which uses asyncio was able to cause
  several 500 errors from the placement service of the following form:

  ```
  cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 
'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
  2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 
172.17.0.1 "PUT 
/resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ 
/usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
  2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API 
unexpected error: This Session's transaction has been rolled back due to a 
previous exception during flush. To begin a new transaction with this Session, 
first issue Session.rollback(). Original exception was: 
(psycopg2.IntegrityError) duplicate key value violates unique constraint 
"uniq_placement_aggregates0uuid"
  2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog 
[req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT 
/resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 
500 len: 997 microversion: 1.29
  ```

  "DETAIL:  Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already
  exists."

  
  This is because the code at 
https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529
 is not trapping the right error when the server thinks it needs to create a 
new aggregate at the same time that it is already creating it.

  It's not clear to me if this is because oslo_db is not transforming
  the postgresql error properly or that the generic error there is the
  wrong one and we've never noticed before because we don't hit the
  concurrency situation hard enough.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1786703/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1786703] [NEW] Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust

Reply via email to