and fwiw, I'm not saying this is *the* solution for a problem like this one
where there is IO starvation. But it is definitely a step forward.
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer
--
You received this bug notification because you are
That's what we have done to test the difference. So for the greater
audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's
at the same time other 4 nodes are PXE booting from MAAS.
Before the fix we saw:
1. client would do multiple requests for the same file.
2. maas would run
BTW to be clear here I'm saying I don't think the path forward on
improving this issue is thinking about how MAAS works and throwing out
patches that might improve performance here and there. The path
forward is to instrument MAAS on a system with slow i/o and to figure
out exactly where it's gett
dm-delay looks very interesting along those lines.
https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-
delay.html
https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs wrote:
> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
>
On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
wrote:
> I don't have logs anymore as I have since rebuilt my environment, but I can
> confirm seeing improvements on a maas server running with high IO (note it
> was a single region/rack).
>
> see inlien:
>
>
> On Tue, Feb 6, 2018 at 5:17 PM, Jaso
I don't have logs anymore as I have since rebuilt my environment, but I can
confirm seeing improvements on a maas server running with high IO (note it
was a single region/rack).
see inlien:
On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs
wrote:
> Andres, it was a single test in both cases, and in
Andres, it was a single test in both cases, and in both cases there was
almost no delay from MAAS. It's not significant enough to call it
positive results.
Since neither of you answered yes, I'll assume the answer was no to my
question of whether there was anything in my logs or data that showed
Andres did the testing of the changes and has logs to prove the
improvement.
On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs
wrote:
> Blake, that's great. Do you have before and after numbers showing the
> improvement this change made?
>
> Do you have any data or logs that led you to believe this
@Jason,
I'm comparing pb in #79 vs pb in #90
#79 (non-patched): https://paste.ubuntu.com/26530737/
#90 (patched with lru_cache): https://paste.ubuntu.com/26531873/
Examples I see in #79:
14:02:ec:42:38:dc # makes 9 requests. on line 160+
14:02:ec:42:28:70 # 8 requests on line 72
14:02:ec:41:d7:
Blake, that's great. Do you have before and after numbers showing the
improvement this change made?
Do you have any data or logs that led you to believe this was the
culprit in the slow responses I saw on my cluster?
On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse wrote:
> Actually caching does mak
Actually caching does make a difference. That method is not just caching
the reading of a file, it caches the searching of the file based on the
purpose, the reading of that file from disk (sure can be in kernel
cache), the parsing of the template by tempita.
All of that is redudant work that is b
The patch from #84 is adding a cache for reading the template file on
the rack controller. I don't understand why this change is being made.
This file will almost certainly be in the page cache anyhow as these
systems have a lot of free ram. Usually it's best to just let the page
cache do its th
Anyhow, I tested with the patch from #84 as requested, here are the
results: http://paste.ubuntu.com/26531873/
We're still seeing some retries with it, same as before.
But, I think the test is of limited value. It didn't make things worse
but we don't have any evidence from the test that it made
Ok - and what about the region controller losing contact with the rack
controller log messages? What is that about?
On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguez
wrote:
> fwiw, the deadlocks issues is regiond trying to determine which process
> should send updates to which racks for *dhcp* ch
This bug was fixed in the package grub2 - 2.02-2ubuntu6
---
grub2 (2.02-2ubuntu6) bionic; urgency=medium
[ Steve Langasek ]
* debian/patches/bufio_sensible_block_sizes.patch: Don't use arbitrary file
fizes as block sizes in bufio: this avoids potentially seeking back in
th
@Jason,
Are these tests with archive grub or patched grub?
On Tue, Feb 6, 2018 at 11:39 AM, Jason Hobbs
wrote:
> Andres,
>
> I ran the test with VMs limited to 9 of 20 cores (cut the core limit
> in half for VMs). The first time range from this dump is with the
> cores at their normal limit (1
fwiw, the deadlocks issues is regiond trying to determine which process
should send updates to which racks for *dhcp* changes, so this is not at
all related to the RPC boot requests for pxe.
On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs
wrote:
> Can you please comment on the deadlock detected err
>
> >
> > Yes, it is not an unknown machine, but that doesn;t change the fact that
> > this is working as designed. If the client didn't get a response for the
> > request it makes, and the client decides to move on and makes a different
> > request, then it is working as designed. Again, the bug h
The deadlock is not expected behavior.
Due to the isolation level, the number of workers (e.g. 12
workers/3regions) and the fact that there could be IO starvation, its
surfacing this issue. That said, changes to improve this and prevent the
deadlocks are not backportable to 2.3 and are targetted f
On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguez
wrote:
> On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs
> wrote:
>
>> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
>> wrote:
>> > I think there's a misunderstanding on how the network boot process
>> happens:
>> > Let's look at pxe linux first.
Can you please comment on the deadlock detected error from the db log in
posted in #36
http://paste.ubuntu.com/26530761/
That is not expected behavior is it? Also the fact that MAAS thinks its
losing rack/region connections seems like it could be related to this
behavior.
--
You received this
On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs
wrote:
> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
> wrote:
> > I think there's a misunderstanding on how the network boot process
> happens:
> > Let's look at pxe linux first. Pxe linux does this:
> >
> > 1. tries UUID first # if no answer, it
Andres,
I ran the test with VMs limited to 9 of 20 cores (cut the core limit
in half for VMs). The first time range from this dump is with the
cores at their normal limit (18).
As you can see, the behavior didn't change much from one set to the
other. Both sets had instances where grub started
On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
wrote:
> I think there's a misunderstanding on how the network boot process happens:
> Let's look at pxe linux first. Pxe linux does this:
>
> 1. tries UUID first # if no answer, it moves on
> 2. Tries mac # if no answer, it moves on
> 3. tries full
>
>
> > That being said, because CPU load doesn't show high we are making the
> > *assumption* that it is not impacting MAAS, but again, this is an
> > assumption. Making the requested change for having at least 4 CPUs
> (ideally
> > 6) would allow us to determining what are the effects and see whe
+1 Mike. I agree it's a bug, but it there isn't real evidence that
it's what causes the long delay.
On Mon, Feb 5, 2018 at 7:12 PM, Mike Pontillo
wrote:
> Ah, I see what you mean there; I used the following filter in Wireshark:
>
> udp.dstport == 25305 or udp.srcport == 25305
>
> This is not
Ah, I see what you mean there; I used the following filter in Wireshark:
udp.dstport == 25305 or udp.srcport == 25305
This is not the behavior I saw if the TFTP request is answered in a
timely manner, so I suspect that the long delay between the initial
request and the answer is causing the t
On Tue, Feb 06, 2018 at 12:11:21AM -, Mike Pontillo wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?
This was the first packet capture that Jason posted, in comment #30. The
udp retransmits shown in packets 6262-6268 each receive an ans
On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguez
wrote:
> @Jason,
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>> wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have imp
@Mike, you can see the stacked response behavior in
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap
You can tell packet 90573 is a response to the requests for
grub.cfg- because its destination port (25305) is the src port
the request
Steve, can you be more specific about which packet capture showed the
"stacked OACK" behavior?
I looked at a packet capture Andres pointed me to, and don't see the
"stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
indicated by the (source port, dest port) tuple, and I see that
On Mon, Feb 05, 2018 at 09:27:15PM -, Andres Rodriguez wrote:
> MAAS already has a mechanism to collapse retries into the initial request.
Are we certain that this is working correctly? If so, why are packet
captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for
the duplic
@Jason,
The pcap exactly shows the behavior I was hoping to see, which is grub
tries to get X config first, and since it didn't get a response, it moves
on and tries to get Y config.
On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbs
wrote:
> On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
> wrote:
I think there's a misunderstanding on how the network boot process happens:
Let's look at pxe linux first. Pxe linux does this:
1. tries UUID first # if no answer, it moves on
2. Tries mac # if no answer, it moves on
3. tries full IP address # if no answer, it moves on
4. tries partial IP address
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
wrote:
> @Jason,
>
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>> wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have i
On Mon, Feb 05, 2018 at 08:40:56PM -, Jason Hobbs wrote:
> @Steve - I don't think it helps with the problem of MAAS taking a long
> time to respond to the grub.cfg request. However, it may help with the
> part of this bug where grub is hitting an error and asking for keyboard
> input. https:/
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
wrote:
> @Steve,
>
> MAAS already has a mechanism to collapse retries into the initial request.
> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for
@Jason,
On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs
wrote:
> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
> wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has. That doesn't change whether or not ther
@Jason,
On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs
wrote:
> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
> wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has. That doesn't change whether or not the
The packetdump (comment #35) of MAAS not responding to grub's request
for the mac specific grub.cfg before grub times out, and then responding
immediately to the generic-amd64 grub cfg, clearly shows a race
condition in MAAS.
MAAS's design of dynamically generating the interface specific grub
conf
@Steve,
MAAS already has a mechanism to collapse retries into the initial request.
In this case, it is the rack that grabs the requests and makes a request to
the region. If retries come within the time that the rack is waiting for a
response from the region, these request get "ignored" and the Ra
@Steve - I don't think it helps with the problem of MAAS taking a long
time to respond to the grub.cfg request. However, it may help with the
part of this bug where grub is hitting an error and asking for keyboard
input. https://imgur.com/a/as8Sx
Maybe that should be a separate bug? It seems li
On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
wrote:
> No new data was provided to mark this New in MAAS:
>
> 1. Changes to the storage seem to have improved things
Yes, it has. That doesn't change whether or not there is a bug in
MAAS. Can you please address the critical log errors that I
Jason's feedback was that, after making the changes to the storage
configuration of his environment, deploying the test grubx64.efi doesn't
have any effect on the MAAS server's response time to tftp requests. So
at this point it's not at all clear that the grub change, while correct,
helps with th
** Changed in: grub2 (Ubuntu)
Status: Triaged => In Progress
** Changed in: grub2 (Ubuntu)
Importance: Undecided => Critical
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
F
No new data was provided to mark this New in MAAS:
1. Changes to the storage seem to have improved things
2. No tests have been run with fixed grub that have caused boot failures.
3. AFAIK, the VM config has not changed to use less CPU to compare results and
whether this config change causes the
** Changed in: maas
Status: Incomplete => New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
To manage notifications
Here is part of a packet capture on my environment:
http://paste.ubuntu.com/26509374/
>From the other tftp server on the deploy:
http://paste.ubuntu.com/26509386/
The whole pcap is prohibitively large because it's for multiple hosts.
You can see from this that grub is only reading the file once
I've tested and I can confirm it made just 1 request instead of 4. I
think now we need to test it in Jason's environment to see the
differences.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
T
Note that the source file is grubnetx64.efi, it should be installed as
grubx64.efi in the tftp server directory.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after
** Attachment added: "tcpdump.pcap"
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1743249/+attachment/5047711/+files/tcpdump.pcap
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title
Attached is an (unsigned) test grubnetx64.efi, built from xenial grub2
plus my patch. Please deploy this in the maas tftp environment where
you are experiencing the timeouts, and give feedback on whether it helps
with the primary symptom.
** Attachment added: "grubnetx64.efi"
https://bugs.lau
** Tags added: patch
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
To manage notifications about this bug go to:
https://b
Here is a possible fix for grub's repeated requests of the config file.
** Patch added: "bufio_sensible_block_sizes.patch"
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047245/+files/bufio_sensible_block_sizes.patch
** Changed in: grub2 (Ubuntu)
Status: New => Triaged
** C
here is the complete output of top from comment #48
** Attachment added: "top.txt.gz"
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047072/+files/top.txt.gz
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.
I also collected iotop output from the same run:
http://paste.ubuntu.com/26502363/
The storage setup on these nodes is writethrough bcache with a 400 GB
nvme in front of a 1TB spinning disk. Since it's writethrough, writes
have to make it to the spinning disk before being counted as sync'd.
The
I collected top output from a run (this run did not exhibit this
failure):
http://paste.ubuntu.com/26502311/
The highest the load average ever gets is 11.85, and it's usually around
3-4. This is a 20 thread system, so it doesn't look like CPU contention
is the problem.
--
You received this bug
@Steve,
On Thu, Feb 1, 2018 at 1:49 PM, Steve Langasek wrote:
> On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote:
> > @Jason,
>
> > Packet 90573 doesn't seem to me as an indication of what you are
> > describing. What I see is this:
>
> > 1. grub makes ~30 requests for PXE confi
@Jason,
Did you expand the "production environment" section?
Memory (MB) CPU (GHz) Disk (GB)
Region controller (minus PostgreSQL)20482.0 5
PostgreSQL 20482.0 20
Rack controller 20482.0 20
Ubuntu Server (including logs) 512 0.5 20
--
You receiv
Oh I see what you mean, yeah ignore the GHz section, that's wrong.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
To manage
FYI those minimum requirements don't mention anything about core/thread
count.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cf
@Jason,
I would give MAAS at least 6 CPU's.
2 for Region
2 for Postgres
2 for Rack.
I would even recommend 4 for region instead of just 2, as MAAS runs 4
region processes. So that would be a total of 8.
[2]: https://docs.ubuntu.com/maas/2.3/en/#minimum-requirements
--
You received this bug no
On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote:
> @Jason,
> Packet 90573 doesn't seem to me as an indication of what you are
> describing. What I see is this:
> 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it
> gives up because it didn't receive a respons
Andres,
You can tell packet 90573 is a response to the requests for
grub.cfg- because its destination port (25305) is the src port the
request for grub.cfg- was coming from (packets 2 through 38).
We're running another test now to collect load information.
--
You received this bug notification
@Jason,
Packet 90573 doesn't seem to me as an indication of what you are
describing. What I see is this:
1. grub makes ~30 requests for PXE config on grub.cfg-, after which it
gives up because it didn't receive a response.
2. grub moves on and requests grub.cfg-default-amd64, and it receives a
** Changed in: maas
Status: Incomplete => New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
To manage notifications
In the pcap from comment #35, MAAS eventually does respond to the
interface specific grub request, 61 seconds after the request, after
it's already sent the grub.cfg-default-amd64, kernel, and initrd. You
can see the responses to the interface specific grub.cfg requests coming
back starting at pack
Attaching a pcap from a failure case. In this case, grub tried for 30
seconds to retrieve the interface specific grub.cfg, but never got a
response from MAAS. It then gave up and got the amd64-default one
instead, which caused the machine to try to enlist and then power off,
leading to a failed d
Regarding grub requesting the same file 4 times, a surprising finding:
I'm able to reproduce this with files of a certain length. By chance my
grub.cfg was 1 byte shorter than the one maas serves (269 bytes instead
of 270), and I saw multiple requests for this file.
To reproduce this in a VM usin
** Also affects: grub2 (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249
Title:
Failed Deployment after timeout trying to retrieve grub cfg
70 matches
Mail list logo