** Description changed:

- We are having an issue with our production MAAS
- The web UI is available normally, we can start to deploy, but the result is a 
failure - systems get stuck during `Loading ephemeral` step: 
+ = How to determine you are seeing this problem =
+ Does your MAAS server seem to get "hung up", where deployments suddenly start 
failing w/ lots of connection timeouts to the MAAS server?
  
- ```
- Tue, 15 Dec. 2020 23:08:57    Node - Powered off 'akis'.
- Tue, 15 Dec. 2020 23:05:25    Marking node failed - Node operation 
'Deploying' timed out after 30 minutes.
- Tue, 15 Dec. 2020 22:35:31    Loading ephemeral
- Tue, 15 Dec. 2020 22:34:35    Performing PXE boot
- Tue, 15 Dec. 2020 22:31:35    Powering node on
- Tue, 15 Dec. 2020 22:31:35    Node - Started deploying 'akis'.
- Tue, 15 Dec. 2020 22:31:35    Deploying
- Tue, 15 Dec. 2020 22:31:09    Node - Acquired 'akis'.
- ```
+ Get a list of pids of your regiond processes:
+ $ ps -ef | grep regiond
  
- It's the 3rd time we are seeing this behavior, which is fixed after a
- restart.
+ Run strace on each one to see if one is stuck in a connect() or recv() call:
+ $ sudo strace -p $pid
+ recv(...
  
- MAAS version: 2.8.2 (8577-g.a3e674063)
+ (normally you should see a lot of epoll_ctl() calls go by if not hung)
+ 
+ If one is hung, use lsof to see what it is connected to:
+ sudo lsof -i -a -p $pid
+ 
+ If you see an open connection to your images server, then this maybe
+ your problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1908452

Title:
  MAAS stops working and deployment fails after `Loading ephemeral` step

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1908452/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to