[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
I'm not sure why a "broken" Upstream DNS helps repro this bug, but I was not able to repro when the Upstream DNS was working. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1710278 Title: [2.3a1] nam

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
repro.py attempts to trigger DNS queries during DNS Reloads. It does so by first deploying all 50 machines. Then one-by-one (not all at once!) release a machine, wait, deploy machine, move to next machine. At some point a machine will be releasing (Reloads) while others are starting to deploy (D

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
repro.py attached ** Attachment added: "repro.py" https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1710278/+attachment/5276146/+files/repro.py -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/171

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-10 Thread Sam Lee
OK - I was able to repro again, and this time with MAAS 2.6. Here are the steps PREP WORK 1) Have 50 machines in Ready state with one interface enabled configured as 'Autoassign' to Default VLAN PXE subnet (auto assign so that every deploy/release causes MAAS to reload DNS) 2) Clear out any DNS

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-07-09 Thread Sam Lee
Hi Mark, Still seeing it with 18.04 and 2.6. The sweet spot seems to be when MAAS is receiving lots of DNS requests while simultaneously doing DNS reloads (as you alluded to in this case). I'm attempting to setup a simplified repro scenario which basically will do this: 1) enlist 50+ new machin

[Bug 1710278] Re: [2.3a1] named stuck on reload, DNS broken

2019-06-26 Thread Sam Lee
Mark, Do you have any updated repro steps? I'm seeing this failure with MAAS v2.5.3. I suspect when v2.5 moved the DNS logic from region to rack controller, that some of the mitigation logic was lost and thus this bug manifests more frequently. When I compare our v2.5.3 install from our v2.4.2

[Bug 1446822] Re: maas erase disk cannot be canceled

2019-01-29 Thread Sam Lee
Same here, it takes hours to erase drives on our servers, and even allowing the server to finish erasing the drives, MAAS still showing `Disk Erasing` state. And cannot `Abort` or `Mark Fixed`, as it errors with ``` Error:Node failed to be marked broken, because of the following error: mark-brok

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-05-24 Thread Sam Lee
In our case, we don't need GARP on every boot. Only during MaaS Deploy stage, where MaaS ephemeral boot image is trying to communicate with MaaS region controller (in a different VLAN). The irony is, even if there was a way to add our own GARP instructions in cloud-init config, the region control

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-04-04 Thread Sam Lee
Hi Chris, Yes you are correct, and attached updated pic. Although I don't disagree the PXE/DHCP client should be sending GARPs, but shouldn't any OS that binds to an IP send a GARP as part of its TCP stack initialization? That is, shouldn't the ephemeral boot image itself send a GARP (independent

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
attached pic ** Attachment added: "ascii-art.png" https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1677668/+attachment/4851597/+files/ascii-art.png -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.ne

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
+---+ ++ | ARP CACHE | || | (expires 4 hours) | || | 10.1.1.11 22:22 | | ROUTER | | 10.1.2.100 33:33 | || | |

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
I forgot to mention, the TFTP conversation is happening between the Region Controller (DHCP/TFTP) and the Machine which both live on the same subnet, so the router's ARP Cache is not a factor. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ub

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
yikes! that did not format well...and I can't edit my own comment. Let me try again... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1677668 Title: no GARPs during ephemeral boot To manage notific

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
I forgot to mention, Region and Rack Controllers are in separate VLANs. So the TFTP conversation is happening between the RACK Controller (DHCP/TFTP) and the Machine which both live on the same subnet, so the router's ARP Cache is not a factor. -- You received this bug notification because you ar

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-31 Thread Sam Lee
Hi Chris, Some new clarifications are in order. Please disregard the "ARP Inspection" claim. That feature wasn't even enabled. Here's a very simplified drawing of the setup. +---+

[Bug 1677668] [NEW] no GARPs during ephemeral boot

2017-03-30 Thread Sam Lee
Public bug reported: Deploys time out with an error on the console that says, "Can not apply stage final, no datasource found! Likely bad things to come!" How to duplicate: MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1) 1) Rack Controller and Region Controller in different VLANs 2) Use Cisco ASA

[Bug 1677668] Re: no GARPs during ephemeral boot

2017-03-30 Thread Sam Lee
Forgot to mention that we didn't want to "Static assign" IPs in MaaS. We prefer using "Auto assign" but observed that MaaS will sometimes reuse a previously used IP from a different MaaS machine. But using "Static assign" we can reliably workaround the issue (or in this ticket case, force a failur