Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
and fwiw, I'm not saying this is *the* solution for a problem like this one where there is IO starvation. But it is definitely a step forward. -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
That's what we have done to test the difference. So for the greater audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's at the same time other 4 nodes are PXE booting from MAAS. Before the fix we saw: 1. client would do multiple requests for the same file. 2. maas would run

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
BTW to be clear here I'm saying I don't think the path forward on improving this issue is thinking about how MAAS works and throwing out patches that might improve performance here and there. The path forward is to instrument MAAS on a system with slow i/o and to figure out exactly where it's

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
dm-delay looks very interesting along those lines. https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm- delay.html https://www.kernel.org/doc/Documentation/device-mapper/delay.txt On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs wrote: > On Tue, Feb 6, 2018 at

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez wrote: > I don't have logs anymore as I have since rebuilt my environment, but I can > confirm seeing improvements on a maas server running with high IO (note it > was a single region/rack). > > see inlien: > > > On Tue,

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
I don't have logs anymore as I have since rebuilt my environment, but I can confirm seeing improvements on a maas server running with high IO (note it was a single region/rack). see inlien: On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs wrote: > Andres, it was a single

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Andres, it was a single test in both cases, and in both cases there was almost no delay from MAAS. It's not significant enough to call it positive results. Since neither of you answered yes, I'll assume the answer was no to my question of whether there was anything in my logs or data that showed

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Blake Rouse
Andres did the testing of the changes and has logs to prove the improvement. On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs wrote: > Blake, that's great. Do you have before and after numbers showing the > improvement this change made? > > Do you have any data or logs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
@Jason, I'm comparing pb in #79 vs pb in #90 #79 (non-patched): https://paste.ubuntu.com/26530737/ #90 (patched with lru_cache): https://paste.ubuntu.com/26531873/ Examples I see in #79: 14:02:ec:42:38:dc # makes 9 requests. on line 160+ 14:02:ec:42:28:70 # 8 requests on line 72

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Blake, that's great. Do you have before and after numbers showing the improvement this change made? Do you have any data or logs that led you to believe this was the culprit in the slow responses I saw on my cluster? On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse wrote:

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Blake Rouse
Actually caching does make a difference. That method is not just caching the reading of a file, it caches the searching of the file based on the purpose, the reading of that file from disk (sure can be in kernel cache), the parsing of the template by tempita. All of that is redudant work that is

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
The patch from #84 is adding a cache for reading the template file on the rack controller. I don't understand why this change is being made. This file will almost certainly be in the page cache anyhow as these systems have a lot of free ram. Usually it's best to just let the page cache do its

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Anyhow, I tested with the patch from #84 as requested, here are the results: http://paste.ubuntu.com/26531873/ We're still seeing some retries with it, same as before. But, I think the test is of limited value. It didn't make things worse but we don't have any evidence from the test that it

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Ok - and what about the region controller losing contact with the rack controller log messages? What is that about? On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguez wrote: > fwiw, the deadlocks issues is regiond trying to determine which process > should send updates to

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Launchpad Bug Tracker
This bug was fixed in the package grub2 - 2.02-2ubuntu6 --- grub2 (2.02-2ubuntu6) bionic; urgency=medium [ Steve Langasek ] * debian/patches/bufio_sensible_block_sizes.patch: Don't use arbitrary file fizes as block sizes in bufio: this avoids potentially seeking back in

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
@Jason, Are these tests with archive grub or patched grub? On Tue, Feb 6, 2018 at 11:39 AM, Jason Hobbs wrote: > Andres, > > I ran the test with VMs limited to 9 of 20 cores (cut the core limit > in half for VMs). The first time range from this dump is with the >

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
fwiw, the deadlocks issues is regiond trying to determine which process should send updates to which racks for *dhcp* changes, so this is not at all related to the RPC boot requests for pxe. On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs wrote: > Can you please comment

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
> > > > > Yes, it is not an unknown machine, but that doesn;t change the fact that > > this is working as designed. If the client didn't get a response for the > > request it makes, and the client decides to move on and makes a different > > request, then it is working as designed. Again, the bug

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
The deadlock is not expected behavior. Due to the isolation level, the number of workers (e.g. 12 workers/3regions) and the fact that there could be IO starvation, its surfacing this issue. That said, changes to improve this and prevent the deadlocks are not backportable to 2.3 and are targetted

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguez wrote: > On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez >> wrote: >> > I think there's a misunderstanding

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Can you please comment on the deadlock detected error from the db log in posted in #36 http://paste.ubuntu.com/26530761/ That is not expected behavior is it? Also the fact that MAAS thinks its losing rack/region connections seems like it could be related to this behavior. -- You received this

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs wrote: > On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez > wrote: > > I think there's a misunderstanding on how the network boot process > happens: > > Let's look at pxe linux first. Pxe linux does

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Andres, I ran the test with VMs limited to 9 of 20 cores (cut the core limit in half for VMs). The first time range from this dump is with the cores at their normal limit (18). As you can see, the behavior didn't change much from one set to the other. Both sets had instances where grub started

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez wrote: > I think there's a misunderstanding on how the network boot process happens: > Let's look at pxe linux first. Pxe linux does this: > > 1. tries UUID first # if no answer, it moves on > 2. Tries mac # if no answer,

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
> > > > That being said, because CPU load doesn't show high we are making the > > *assumption* that it is not impacting MAAS, but again, this is an > > assumption. Making the requested change for having at least 4 CPUs > (ideally > > 6) would allow us to determining what are the effects and see

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
+1 Mike. I agree it's a bug, but it there isn't real evidence that it's what causes the long delay. On Mon, Feb 5, 2018 at 7:12 PM, Mike Pontillo wrote: > Ah, I see what you mean there; I used the following filter in Wireshark: > > udp.dstport == 25305 or

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mike Pontillo
Ah, I see what you mean there; I used the following filter in Wireshark: udp.dstport == 25305 or udp.srcport == 25305 This is not the behavior I saw if the TFTP request is answered in a timely manner, so I suspect that the long delay between the initial request and the answer is causing the

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Tue, Feb 06, 2018 at 12:11:21AM -, Mike Pontillo wrote: > Steve, can you be more specific about which packet capture showed the > "stacked OACK" behavior? This was the first packet capture that Jason posted, in comment #30. The udp retransmits shown in packets 6262-6268 each receive an

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguez wrote: > @Jason, > > On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez >> wrote: >> > No new data was provided

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
@Mike, you can see the stacked response behavior in https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap You can tell packet 90573 is a response to the requests for grub.cfg- because its destination port (25305) is the src port the

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mike Pontillo
Steve, can you be more specific about which packet capture showed the "stacked OACK" behavior? I looked at a packet capture Andres pointed me to, and don't see the "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is indicated by the (source port, dest port) tuple, and I see that

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Mon, Feb 05, 2018 at 09:27:15PM -, Andres Rodriguez wrote: > MAAS already has a mechanism to collapse retries into the initial request. Are we certain that this is working correctly? If so, why are packet captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for the

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason, The pcap exactly shows the behavior I was hoping to see, which is grub tries to get X config first, and since it didn't get a response, it moves on and tries to get Y config. On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbs wrote: > On Mon, Feb 5, 2018 at 3:27 PM,

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
I think there's a misunderstanding on how the network boot process happens: Let's look at pxe linux first. Pxe linux does this: 1. tries UUID first # if no answer, it moves on 2. Tries mac # if no answer, it moves on 3. tries full IP address # if no answer, it moves on 4. tries partial IP address

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez wrote: > @Jason, > > > On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez >> wrote: >> > No new data was

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Mon, Feb 05, 2018 at 08:40:56PM -, Jason Hobbs wrote: > @Steve - I don't think it helps with the problem of MAAS taking a long > time to respond to the grub.cfg request. However, it may help with the > part of this bug where grub is hitting an error and asking for keyboard > input.

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason, On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs wrote: > On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez > wrote: > > No new data was provided to mark this New in MAAS: > > > > 1. Changes to the storage seem to have improved things > >

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez wrote: > @Steve, > > MAAS already has a mechanism to collapse retries into the initial request. > In this case, it is the rack that grabs the requests and makes a request to > the region. If retries come within the time

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason, On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs wrote: > On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez > wrote: > > No new data was provided to mark this New in MAAS: > > > > 1. Changes to the storage seem to have improved things > >

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
The packetdump (comment #35) of MAAS not responding to grub's request for the mac specific grub.cfg before grub times out, and then responding immediately to the generic-amd64 grub cfg, clearly shows a race condition in MAAS. MAAS's design of dynamically generating the interface specific grub

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Steve, MAAS already has a mechanism to collapse retries into the initial request. In this case, it is the rack that grabs the requests and makes a request to the region. If retries come within the time that the rack is waiting for a response from the region, these request get "ignored" and the

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
@Steve - I don't think it helps with the problem of MAAS taking a long time to respond to the grub.cfg request. However, it may help with the part of this bug where grub is hitting an error and asking for keyboard input. https://imgur.com/a/as8Sx Maybe that should be a separate bug? It seems

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez wrote: > No new data was provided to mark this New in MAAS: > > 1. Changes to the storage seem to have improved things Yes, it has. That doesn't change whether or not there is a bug in MAAS. Can you please address the

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
Jason's feedback was that, after making the changes to the storage configuration of his environment, deploying the test grubx64.efi doesn't have any effect on the MAAS server's response time to tftp requests. So at this point it's not at all clear that the grub change, while correct, helps with

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mathieu Trudel-Lapierre
** Changed in: grub2 (Ubuntu) Status: Triaged => In Progress ** Changed in: grub2 (Ubuntu) Importance: Undecided => Critical -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title:

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
No new data was provided to mark this New in MAAS: 1. Changes to the storage seem to have improved things 2. No tests have been run with fixed grub that have caused boot failures. 3. AFAIK, the VM config has not changed to use less CPU to compare results and whether this config change causes the

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Chris Gregan
** Changed in: maas Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Jason Hobbs
Here is part of a packet capture on my environment: http://paste.ubuntu.com/26509374/ >From the other tftp server on the deploy: http://paste.ubuntu.com/26509386/ The whole pcap is prohibitively large because it's for multiple hosts. You can see from this that grub is only reading the file once

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Andres Rodriguez
I've tested and I can confirm it made just 1 request instead of 4. I think now we need to test it in Jason's environment to see the differences. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Steve Langasek
Note that the source file is grubnetx64.efi, it should be installed as grubx64.efi in the tftp server directory. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Andres Rodriguez
** Attachment added: "tcpdump.pcap" https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1743249/+attachment/5047711/+files/tcpdump.pcap -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Steve Langasek
Attached is an (unsigned) test grubnetx64.efi, built from xenial grub2 plus my patch. Please deploy this in the maas tftp environment where you are experiencing the timeouts, and give feedback on whether it helps with the primary symptom. ** Attachment added: "grubnetx64.efi"

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Ubuntu Foundations Team Bug Bot
** Tags added: patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to:

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Steve Langasek
Here is a possible fix for grub's repeated requests of the config file. ** Patch added: "bufio_sensible_block_sizes.patch" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047245/+files/bufio_sensible_block_sizes.patch ** Changed in: grub2 (Ubuntu) Status: New => Triaged **

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
here is the complete output of top from comment #48 ** Attachment added: "top.txt.gz" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047072/+files/top.txt.gz -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
I also collected iotop output from the same run: http://paste.ubuntu.com/26502363/ The storage setup on these nodes is writethrough bcache with a 400 GB nvme in front of a 1TB spinning disk. Since it's writethrough, writes have to make it to the spinning disk before being counted as sync'd. The

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
I collected top output from a run (this run did not exhibit this failure): http://paste.ubuntu.com/26502311/ The highest the load average ever gets is 11.85, and it's usually around 3-4. This is a 20 thread system, so it doesn't look like CPU contention is the problem. -- You received this

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Steve, On Thu, Feb 1, 2018 at 1:49 PM, Steve Langasek wrote: > On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote: > > @Jason, > > > Packet 90573 doesn't seem to me as an indication of what you are > > describing. What I see is this: > > > 1. grub

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason, Did you expand the "production environment" section? Memory (MB) CPU (GHz) Disk (GB) Region controller (minus PostgreSQL)20482.0 5 PostgreSQL 20482.0 20 Rack controller 20482.0 20 Ubuntu Server (including logs) 512 0.5 20 -- You

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
Oh I see what you mean, yeah ignore the GHz section, that's wrong. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
FYI those minimum requirements don't mention anything about core/thread count. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason, I would give MAAS at least 6 CPU's. 2 for Region 2 for Postgres 2 for Rack. I would even recommend 4 for region instead of just 2, as MAAS runs 4 region processes. So that would be a total of 8. [2]: https://docs.ubuntu.com/maas/2.3/en/#minimum-requirements -- You received this bug

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Steve Langasek
On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote: > @Jason, > Packet 90573 doesn't seem to me as an indication of what you are > describing. What I see is this: > 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it > gives up because it didn't receive a

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
Andres, You can tell packet 90573 is a response to the requests for grub.cfg- because its destination port (25305) is the src port the request for grub.cfg- was coming from (packets 2 through 38). We're running another test now to collect load information. -- You received this bug notification

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason, Packet 90573 doesn't seem to me as an indication of what you are describing. What I see is this: 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it gives up because it didn't receive a response. 2. grub moves on and requests grub.cfg-default-amd64, and it receives a

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
** Changed in: maas Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
In the pcap from comment #35, MAAS eventually does respond to the interface specific grub request, 61 seconds after the request, after it's already sent the grub.cfg-default-amd64, kernel, and initrd. You can see the responses to the interface specific grub.cfg requests coming back starting at

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
Attaching a pcap from a failure case. In this case, grub tried for 30 seconds to retrieve the interface specific grub.cfg, but never got a response from MAAS. It then gave up and got the amd64-default one instead, which caused the machine to try to enlist and then power off, leading to a failed

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-01-31 Thread Steve Langasek
Regarding grub requesting the same file 4 times, a surprising finding: I'm able to reproduce this with files of a certain length. By chance my grub.cfg was 1 byte shorter than the one maas serves (269 bytes instead of 270), and I saw multiple requests for this file. To reproduce this in a VM

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-01-31 Thread Andres Rodriguez
** Also affects: grub2 (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg