Public bug reported: This came up in the -qa channel when trying to figure out why a neutron test failed and there is a big fat DBDeadlock in the q-svc logs:
http://logs.openstack.org/18/220218/5/check/gate-tempest-dsvm-neutron- dvr/3899ebf/logs/screen-q-svc.txt.gz?level=ERROR#_2015-09-11_17_22_42_284 We find that this shows up a ton in a 7 day check/gate run: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiX2dldF9kbnNfbmFtZXNfZm9yX3BvcnRcIiBBTkQgbWVzc2FnZTpcIkRCRGVhZGxvY2tcIiBBTkQgbWVzc2FnZTpcImlwYXZhaWxhYmlsaXR5cmFuZ2VzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1zdmMudHh0XCIgQU5EIChidWlsZF9xdWV1ZTpcImNoZWNrXCIgT1IgYnVpbGRfcXVldWU6XCJnYXRlXCIpIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDQxOTk2Mjk2ODcxfQ== 498 hits in 7 days, check and gate. The interesting thing is that 85% of those are successful runs. Like this was a successful run where the DBDeadlock shows up: http://logs.openstack.org/20/195820/11/gate/gate-tempest-dsvm-neutron- full/35f6716/logs/screen-q-svc.txt.gz?level=TRACE This is a serviceability / QA issue for anyone trying to deploy neutron at scale - when things go back, how is an operator supposed to be able to cut through the noise in the logs to determine what's actually a real failure and what can be ignored? If these DBDeadlocks are just getting retried with a retry decorator, there should be a way to only trace when we fail and raise up the DBDeadlock error, we shouldn't be logging each time. For example, if we DBDeadlock and retry and then it's OK, don't trace that first DB error. If we retry like 5 times and eventually punt, then trace the error. ** Affects: neutron Importance: Critical Assignee: Kevin Benton (kevinbenton) Status: Confirmed ** Tags: db ** Changed in: neutron Status: New => Confirmed ** Tags added: db -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1494886 Title: Neutron DBDeadlocks a ridiculous amount in successful CI runs Status in neutron: Confirmed Bug description: This came up in the -qa channel when trying to figure out why a neutron test failed and there is a big fat DBDeadlock in the q-svc logs: http://logs.openstack.org/18/220218/5/check/gate-tempest-dsvm-neutron- dvr/3899ebf/logs/screen-q-svc.txt.gz?level=ERROR#_2015-09-11_17_22_42_284 We find that this shows up a ton in a 7 day check/gate run: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiX2dldF9kbnNfbmFtZXNfZm9yX3BvcnRcIiBBTkQgbWVzc2FnZTpcIkRCRGVhZGxvY2tcIiBBTkQgbWVzc2FnZTpcImlwYXZhaWxhYmlsaXR5cmFuZ2VzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1zdmMudHh0XCIgQU5EIChidWlsZF9xdWV1ZTpcImNoZWNrXCIgT1IgYnVpbGRfcXVldWU6XCJnYXRlXCIpIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDQxOTk2Mjk2ODcxfQ== 498 hits in 7 days, check and gate. The interesting thing is that 85% of those are successful runs. Like this was a successful run where the DBDeadlock shows up: http://logs.openstack.org/20/195820/11/gate/gate-tempest-dsvm-neutron- full/35f6716/logs/screen-q-svc.txt.gz?level=TRACE This is a serviceability / QA issue for anyone trying to deploy neutron at scale - when things go back, how is an operator supposed to be able to cut through the noise in the logs to determine what's actually a real failure and what can be ignored? If these DBDeadlocks are just getting retried with a retry decorator, there should be a way to only trace when we fail and raise up the DBDeadlock error, we shouldn't be logging each time. For example, if we DBDeadlock and retry and then it's OK, don't trace that first DB error. If we retry like 5 times and eventually punt, then trace the error. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1494886/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

