I have a fix in the works for Noble. I'm letting it soak overnight and I'll propose it shortly.
** Description changed: + SRU Template + + [Impact] + Watcher's decision-engine accumulates idle SQLAlchemy connections over time + and eventually exhausts its connection pool (size 2 + 50 overflow), causing + the service to report FAILED in `openstack optimize service list`. In a + production Sunbeam 2024.1 deployment this typically takes multiple days to + manifest. Once the pool is exhausted, all background jobs in the decision + engine fail with: + + sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 + reached, connection timed out, timeout 30.00 + + The only workarounds available to operators today are killing the sleeping + MySQL connections out from under watcher and restarting the watcher pods. + Once this happens, Watcher cannot reconcile audits, schedule action plans, + or run any continuous audit workload until manual intervention. + + The update contains the following package updates: + + * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal) + + [Test Case] + The following SRU process was followed: + https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates + + In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the + proposed packages can be let into -updates. + + The OpenStack team will be in charge of attaching the output summary of + the executed tests. The OpenStack team members will not mark + ‘verification-done’ until this has happened. + + ------------------------------------------------------------------------ + Check 1 -- No QueuePool TimeoutError in decision-engine logs + ------------------------------------------------------------------------ + + Original symptom from the bug report: + + [watcher-decision-engine] ERROR apscheduler.executors.default + sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 + reached, connection timed out, timeout 30.00 + + Run: + + sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \ + --since=60m | grep -iEc "queuepool|TimeoutError" + + Pass criterion: 0 (zero matches in the last 60 minutes of logs). + + ------------------------------------------------------------------------ + Check 2 -- Sleeping MySQL connections are not accumulating + ------------------------------------------------------------------------ + + First discover the watcher DB credentials (auto-generated, unique per + deployment): + + sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine -- \ + grep ^connection /etc/watcher/watcher.conf + + The output line has the form: + + connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router- + service...:6446/watcher_api + + Extract <USER> and <PASS> from the URL. Then run the same query as + the bug report: + + sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \ + -c mysql-router -- \ + mysql -u <USER> -p<PASS> \ + -h watcher-mysql-router-service.openstack.svc.cluster.local \ + -P 6446 watcher_api \ + -e "SELECT count(*), state FROM information_schema.processlist + GROUP BY state;" + + Run the query twice, ideally allowing watcher to run overnight between + samples. + + Pass criterion: the count for the empty-state row (sleeping connections) + stays bounded (under ~15) across both samples and does not trend upward. + + On the broken package, this count grows by ~4 per minute + and exceeds the pool ceiling of 52 within ~15 minutes after a pod + restart, at which point Check 3 fails. + + ------------------------------------------------------------------------ + Check 3 -- watcher-decision-engine reports ACTIVE + ------------------------------------------------------------------------ + + Run: + openstack optimize service list + + Pass criterion: every watcher-decision-engine row shows Status = ACTIVE. + + + ------------------------------------- + Original Bug Report Content Below + ------------------------------------- + in the newton release a background job scheduler was added to the Decision Engine. - https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50 to facilitate this the apscheduler lib was introduced as a depency to watcher. apscheduler has a lost of capability but does not officially support eventlet. since its introduction to watcher it has mostly worked partly by accident. over the year as oslo, apscheduler and eventlet have evolved and adapted to newer python release watcher has continued to use apscheduler even though that is not technically supported. with the move to python 3.12 it became apparent that the background jobs executed on the apscheduler BackgroundScheduler instances were accellign shared global state from a non-monkeypatched native thread. that results in greenthread sometimes calling into objects that are using un monkey patched code. for example oslo.db uses time.sleep to yield executions. when that oslo.db function is first imported from a non patched thread if its invoked after that in the main thread it will block. this can by this expction "RuntimeError: do not call blocking functions from the mainloop" here https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/ this has been repdocuded in ci as part of moving the ci jobs to ubutnu 24.04 and python 3.12 https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb to address this issue we need to ensure that the background thread used to schedule background task is properly monkey patched. ** Description changed: SRU Template [Impact] Watcher's decision-engine accumulates idle SQLAlchemy connections over time and eventually exhausts its connection pool (size 2 + 50 overflow), causing the service to report FAILED in `openstack optimize service list`. In a production Sunbeam 2024.1 deployment this typically takes multiple days to manifest. Once the pool is exhausted, all background jobs in the decision engine fail with: - sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 - reached, connection timed out, timeout 30.00 + sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 + reached, connection timed out, timeout 30.00 The only workarounds available to operators today are killing the sleeping MySQL connections out from under watcher and restarting the watcher pods. Once this happens, Watcher cannot reconcile audits, schedule action plans, or run any continuous audit workload until manual intervention. The update contains the following package updates: - * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal) + * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal) [Test Case] The following SRU process was followed: https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates. The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened. ------------------------------------------------------------------------ Check 1 -- No QueuePool TimeoutError in decision-engine logs ------------------------------------------------------------------------ Original symptom from the bug report: - [watcher-decision-engine] ERROR apscheduler.executors.default - sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 - reached, connection timed out, timeout 30.00 + [watcher-decision-engine] ERROR apscheduler.executors.default + sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50 + reached, connection timed out, timeout 30.00 Run: - sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \ - --since=60m | grep -iEc "queuepool|TimeoutError" + sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \ + --since=60m | grep -iEc "queuepool|TimeoutError" Pass criterion: 0 (zero matches in the last 60 minutes of logs). ------------------------------------------------------------------------ Check 2 -- Sleeping MySQL connections are not accumulating ------------------------------------------------------------------------ First discover the watcher DB credentials (auto-generated, unique per deployment): - sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine -- \ - grep ^connection /etc/watcher/watcher.conf + sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine -- \ + grep ^connection /etc/watcher/watcher.conf The output line has the form: - connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router- + connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router- service...:6446/watcher_api Extract <USER> and <PASS> from the URL. Then run the same query as the bug report: - sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \ - -c mysql-router -- \ - mysql -u <USER> -p<PASS> \ - -h watcher-mysql-router-service.openstack.svc.cluster.local \ - -P 6446 watcher_api \ - -e "SELECT count(*), state FROM information_schema.processlist - GROUP BY state;" + sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \ + -c mysql-router -- \ + mysql -u <USER> -p<PASS> \ + -h watcher-mysql-router-service.openstack.svc.cluster.local \ + -P 6446 watcher_api \ + -e "SELECT count(*), state FROM information_schema.processlist + GROUP BY state;" Run the query twice, ideally allowing watcher to run overnight between samples. Pass criterion: the count for the empty-state row (sleeping connections) stays bounded (under ~15) across both samples and does not trend upward. On the broken package, this count grows by ~4 per minute and exceeds the pool ceiling of 52 within ~15 minutes after a pod restart, at which point Check 3 fails. ------------------------------------------------------------------------ Check 3 -- watcher-decision-engine reports ACTIVE ------------------------------------------------------------------------ Run: - openstack optimize service list + openstack optimize service list Pass criterion: every watcher-decision-engine row shows Status = ACTIVE. - - ------------------------------------- + -------------------------------------------------------------------------------- Original Bug Report Content Below - ------------------------------------- + -------------------------------------------------------------------------------- in the newton release a background job scheduler was added to the Decision Engine. https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50 to facilitate this the apscheduler lib was introduced as a depency to watcher. apscheduler has a lost of capability but does not officially support eventlet. since its introduction to watcher it has mostly worked partly by accident. over the year as oslo, apscheduler and eventlet have evolved and adapted to newer python release watcher has continued to use apscheduler even though that is not technically supported. with the move to python 3.12 it became apparent that the background jobs executed on the apscheduler BackgroundScheduler instances were accellign shared global state from a non-monkeypatched native thread. that results in greenthread sometimes calling into objects that are using un monkey patched code. for example oslo.db uses time.sleep to yield executions. when that oslo.db function is first imported from a non patched thread if its invoked after that in the main thread it will block. this can by this expction "RuntimeError: do not call blocking functions from the mainloop" here https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/ this has been repdocuded in ci as part of moving the ci jobs to ubutnu 24.04 and python 3.12 https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb to address this issue we need to ensure that the background thread used to schedule background task is properly monkey patched. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2086710 Title: watcher's use of apscheduler is incompatible with python 3.12 and eventlet To manage notifications about this bug go to: https://bugs.launchpad.net/watcher/+bug/2086710/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
