** Description changed: - We are having an issue with our production MAAS - The web UI is available normally, we can start to deploy, but the result is a failure - systems get stuck during `Loading ephemeral` step: + = How to determine you are seeing this problem = + Does your MAAS server seem to get "hung up", where deployments suddenly start failing w/ lots of connection timeouts to the MAAS server? - ``` - Tue, 15 Dec. 2020 23:08:57 Node - Powered off 'akis'. - Tue, 15 Dec. 2020 23:05:25 Marking node failed - Node operation 'Deploying' timed out after 30 minutes. - Tue, 15 Dec. 2020 22:35:31 Loading ephemeral - Tue, 15 Dec. 2020 22:34:35 Performing PXE boot - Tue, 15 Dec. 2020 22:31:35 Powering node on - Tue, 15 Dec. 2020 22:31:35 Node - Started deploying 'akis'. - Tue, 15 Dec. 2020 22:31:35 Deploying - Tue, 15 Dec. 2020 22:31:09 Node - Acquired 'akis'. - ``` + Get a list of pids of your regiond processes: + $ ps -ef | grep regiond - It's the 3rd time we are seeing this behavior, which is fixed after a - restart. + Run strace on each one to see if one is stuck in a connect() or recv() call: + $ sudo strace -p $pid + recv(... - MAAS version: 2.8.2 (8577-g.a3e674063) + (normally you should see a lot of epoll_ctl() calls go by if not hung) + + If one is hung, use lsof to see what it is connected to: + sudo lsof -i -a -p $pid + + If you see an open connection to your images server, then this maybe + your problem.
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1908452 Title: MAAS stops working and deployment fails after `Loading ephemeral` step To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1908452/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs