Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
and fwiw, I'm not saying this is *the* solution for a problem like this one where there is IO starvation. But it is definitely a step forward. -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
That's what we have done to test the difference. So for the greater audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's at the same time other 4 nodes are PXE booting from MAAS. Before the fix we saw: 1. client would do multiple requests for the same file. 2. maas would run up to 3 DB requests for the node object to used to render the config. 3. Inspected why we had 3 DB requests for the same config. With this behavior, we determined that what happens is that the rack queries the region, obtains the object, takes a while to generate the config and return it to the client. But before it returns it to the client, the client makes another request and that causes another db query. With this, we confirmed that the collapsing works as expected, provided that this collapsing happens between region/rack communication, but the rack had already received and response and treats the new request as a new db query. With the fix we aw: 1. client would do multiple requests for the same file 2. maas would always perform 1 DB requqest for the node object to render the config. With this, we were able to identify that the rack was taking too long to answer the client, which caused that if a new request came it, it was treated as a new request that was server by the region. With the changes the rack responds faster, hence MAAS collapsed multiple requests, responded in a timely fashion before it can actually be caused to make another request to the db. So the fix does improve things for sure, and we believe is one of the reasons as to why this happened while there's IO starvation. That said, it is not the only thing to improve, as there are other sections that need improvement and as I had earlier said, those involve improving the DB as well. On Tue, Feb 6, 2018 at 6:10 PM, Jason Hobbswrote: > BTW to be clear here I'm saying I don't think the path forward on > improving this issue is thinking about how MAAS works and throwing out > patches that might improve performance here and there. The path > forward is to instrument MAAS on a system with slow i/o and to figure > out exactly where it's getting hung up. > > Jason > > On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs > wrote: > > dm-delay looks very interesting along those lines. > > > > https://www.enodev.fr/posts/emulate-a-slow-block-device- > with-dm-delay.html > > > > https://www.kernel.org/doc/Documentation/device-mapper/delay.txt > > > > On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs > wrote: > >> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez > >> wrote: > >>> I don't have logs anymore as I have since rebuilt my environment, but > I can > >>> confirm seeing improvements on a maas server running with high IO > (note it > >>> was a single region/rack). > >>> > >>> see inlien: > >>> > >>> > >>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs > > >>> wrote: > >>> > Andres, it was a single test in both cases, and in both cases there > was > almost no delay from MAAS. It's not significant enough to call it > positive results. > > > >>> Comment #93 shows there are /some/ improvements when comparing those > two > >>> samples only, but as I have already said, we need data over time to in > both > >>> scenarios to properly compare and determine whether the changes do > make any > >>> material performance improvements with the current conditions of the > >>> samples (both samples are with a fixed io starvation on the > environment). > >>> > >>> > Since neither of you answered yes, I'll assume the answer was no to my > question of whether there was anything in my logs or data that showed > reading the template from disk on the rack controller was the culprit, > and that this fix just represents a guess at what might be causing the > delay. > > >>> > >>> To be fair, your logs do not provide anything concrete to determine > what's > >>> the culprit of the issue on the MAAS side. It provides a lot of clues, > and > >>> we have since then determine that those issues were a result of IO > >>> starvation (from the VM's writing to disk). As such, the only way we > can > >>> *really* see if the patch brings any significant performance > improvements > >>> is to run tests in the environment were you were seeing the issues in > the > >>> first place. > >> > >> I didn't think my logs provided anything concrete! That's because the > >> logging built into MAAS is not sufficient enough to do so. > >> > >> I can't break that environment to test anymore - we got it working > >> thanks to you guy's help and it's a production environment that needs > >> to keep running other tests. > >> > >> It might possible to recreate this on another maas server, using > >> 'stress' or a similar tool to cause disk contention. > >> > >> Jason > >> > >>> As such, if you are willing to test if these
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
BTW to be clear here I'm saying I don't think the path forward on improving this issue is thinking about how MAAS works and throwing out patches that might improve performance here and there. The path forward is to instrument MAAS on a system with slow i/o and to figure out exactly where it's getting hung up. Jason On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbswrote: > dm-delay looks very interesting along those lines. > > https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html > > https://www.kernel.org/doc/Documentation/device-mapper/delay.txt > > On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs wrote: >> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez >> wrote: >>> I don't have logs anymore as I have since rebuilt my environment, but I can >>> confirm seeing improvements on a maas server running with high IO (note it >>> was a single region/rack). >>> >>> see inlien: >>> >>> >>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs >>> wrote: >>> Andres, it was a single test in both cases, and in both cases there was almost no delay from MAAS. It's not significant enough to call it positive results. >>> Comment #93 shows there are /some/ improvements when comparing those two >>> samples only, but as I have already said, we need data over time to in both >>> scenarios to properly compare and determine whether the changes do make any >>> material performance improvements with the current conditions of the >>> samples (both samples are with a fixed io starvation on the environment). >>> >>> Since neither of you answered yes, I'll assume the answer was no to my question of whether there was anything in my logs or data that showed reading the template from disk on the rack controller was the culprit, and that this fix just represents a guess at what might be causing the delay. >>> >>> To be fair, your logs do not provide anything concrete to determine what's >>> the culprit of the issue on the MAAS side. It provides a lot of clues, and >>> we have since then determine that those issues were a result of IO >>> starvation (from the VM's writing to disk). As such, the only way we can >>> *really* see if the patch brings any significant performance improvements >>> is to run tests in the environment were you were seeing the issues in the >>> first place. >> >> I didn't think my logs provided anything concrete! That's because the >> logging built into MAAS is not sufficient enough to do so. >> >> I can't break that environment to test anymore - we got it working >> thanks to you guy's help and it's a production environment that needs >> to keep running other tests. >> >> It might possible to recreate this on another maas server, using >> 'stress' or a similar tool to cause disk contention. >> >> Jason >> >>> As such, if you are willing to test if these make any material difference, >>> I would unfix your environment and do two runs (one without the fix, and >>> one with the fix). That's the only way we can really compare and be certain >>> in *your* environment. >>> -- You received this bug notification because you are subscribed to MAAS. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions Launchpad-Notification-Type: bug Launchpad-Bug: product=maas; milestone=2.4.x; status=New; importance=Undecided; assignee=None; Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com; Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch Launchpad-Bug-Information-Type: Public Launchpad-Bug-Private: no Launchpad-Bug-Security-Vulnerability: no Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor jason-hobbs mpontillo vorlon Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) Launchpad-Message-Rationale: Subscriber (MAAS) Launchpad-Message-For: andreserl >>> >>> >>> -- >>> Andres Rodriguez (RoAkSoAx) >>> Ubuntu Server Developer >>> MSc. Telecom & Networking >>> Systems Engineer >>> >>> -- >>> You received this bug notification because you are subscribed to the bug >>> report. >>> https://bugs.launchpad.net/bugs/1743249 >>> >>> Title: >>> Failed Deployment after timeout trying to retrieve grub cfg >>> >>> Status in MAAS: >>> New >>> Status in grub2 package in Ubuntu: >>> Fix Released >>> >>> Bug description: >>> A node failed to deploy after it failed to retrieve a grub.cfg from >>> MAAS due to a timeout. In the logs, it's clear that the server tried >>> to retrieve the grub cfg many times, over
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
dm-delay looks very interesting along those lines. https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm- delay.html https://www.kernel.org/doc/Documentation/device-mapper/delay.txt On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbswrote: > On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez > wrote: >> I don't have logs anymore as I have since rebuilt my environment, but I can >> confirm seeing improvements on a maas server running with high IO (note it >> was a single region/rack). >> >> see inlien: >> >> >> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs >> wrote: >> >>> Andres, it was a single test in both cases, and in both cases there was >>> almost no delay from MAAS. It's not significant enough to call it >>> positive results. >>> >>> >> Comment #93 shows there are /some/ improvements when comparing those two >> samples only, but as I have already said, we need data over time to in both >> scenarios to properly compare and determine whether the changes do make any >> material performance improvements with the current conditions of the >> samples (both samples are with a fixed io starvation on the environment). >> >> >>> Since neither of you answered yes, I'll assume the answer was no to my >>> question of whether there was anything in my logs or data that showed >>> reading the template from disk on the rack controller was the culprit, >>> and that this fix just represents a guess at what might be causing the >>> delay. >>> >> >> To be fair, your logs do not provide anything concrete to determine what's >> the culprit of the issue on the MAAS side. It provides a lot of clues, and >> we have since then determine that those issues were a result of IO >> starvation (from the VM's writing to disk). As such, the only way we can >> *really* see if the patch brings any significant performance improvements >> is to run tests in the environment were you were seeing the issues in the >> first place. > > I didn't think my logs provided anything concrete! That's because the > logging built into MAAS is not sufficient enough to do so. > > I can't break that environment to test anymore - we got it working > thanks to you guy's help and it's a production environment that needs > to keep running other tests. > > It might possible to recreate this on another maas server, using > 'stress' or a similar tool to cause disk contention. > > Jason > >> As such, if you are willing to test if these make any material difference, >> I would unfix your environment and do two runs (one without the fix, and >> one with the fix). That's the only way we can really compare and be certain >> in *your* environment. >> >>> >>> -- >>> You received this bug notification because you are subscribed to MAAS. >>> https://bugs.launchpad.net/bugs/1743249 >>> >>> Title: >>> Failed Deployment after timeout trying to retrieve grub cfg >>> >>> To manage notifications about this bug go to: >>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions >>> >>> Launchpad-Notification-Type: bug >>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New; >>> importance=Undecided; assignee=None; >>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; >>> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com; >>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch >>> Launchpad-Bug-Information-Type: Public >>> Launchpad-Bug-Private: no >>> Launchpad-Bug-Security-Vulnerability: no >>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor >>> jason-hobbs mpontillo vorlon >>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) >>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) >>> Launchpad-Message-Rationale: Subscriber (MAAS) >>> Launchpad-Message-For: andreserl >>> >> >> >> -- >> Andres Rodriguez (RoAkSoAx) >> Ubuntu Server Developer >> MSc. Telecom & Networking >> Systems Engineer >> >> -- >> You received this bug notification because you are subscribed to the bug >> report. >> https://bugs.launchpad.net/bugs/1743249 >> >> Title: >> Failed Deployment after timeout trying to retrieve grub cfg >> >> Status in MAAS: >> New >> Status in grub2 package in Ubuntu: >> Fix Released >> >> Bug description: >> A node failed to deploy after it failed to retrieve a grub.cfg from >> MAAS due to a timeout. In the logs, it's clear that the server tried >> to retrieve the grub cfg many times, over about 30 seconds: >> >> http://paste.ubuntu.com/26387256/ >> >> We see the same thing for other hosts around the same time: >> >> http://paste.ubuntu.com/26387262/ >> >> It seems like MAAS is taking way too long to respond to these >> requests. >> >> This is very similar to bug 1724677, which was happening pre- >> metldown/spectre. The only difference is we don't see "[critical] TFTP >> back-end failed" in the logs anymore. >> >> I connected to the console on this system and it had errors about >>
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguezwrote: > I don't have logs anymore as I have since rebuilt my environment, but I can > confirm seeing improvements on a maas server running with high IO (note it > was a single region/rack). > > see inlien: > > > On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs > wrote: > >> Andres, it was a single test in both cases, and in both cases there was >> almost no delay from MAAS. It's not significant enough to call it >> positive results. >> >> > Comment #93 shows there are /some/ improvements when comparing those two > samples only, but as I have already said, we need data over time to in both > scenarios to properly compare and determine whether the changes do make any > material performance improvements with the current conditions of the > samples (both samples are with a fixed io starvation on the environment). > > >> Since neither of you answered yes, I'll assume the answer was no to my >> question of whether there was anything in my logs or data that showed >> reading the template from disk on the rack controller was the culprit, >> and that this fix just represents a guess at what might be causing the >> delay. >> > > To be fair, your logs do not provide anything concrete to determine what's > the culprit of the issue on the MAAS side. It provides a lot of clues, and > we have since then determine that those issues were a result of IO > starvation (from the VM's writing to disk). As such, the only way we can > *really* see if the patch brings any significant performance improvements > is to run tests in the environment were you were seeing the issues in the > first place. I didn't think my logs provided anything concrete! That's because the logging built into MAAS is not sufficient enough to do so. I can't break that environment to test anymore - we got it working thanks to you guy's help and it's a production environment that needs to keep running other tests. It might possible to recreate this on another maas server, using 'stress' or a similar tool to cause disk contention. Jason > As such, if you are willing to test if these make any material difference, > I would unfix your environment and do two runs (one without the fix, and > one with the fix). That's the only way we can really compare and be certain > in *your* environment. > >> >> -- >> You received this bug notification because you are subscribed to MAAS. >> https://bugs.launchpad.net/bugs/1743249 >> >> Title: >> Failed Deployment after timeout trying to retrieve grub cfg >> >> To manage notifications about this bug go to: >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions >> >> Launchpad-Notification-Type: bug >> Launchpad-Bug: product=maas; milestone=2.4.x; status=New; >> importance=Undecided; assignee=None; >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; >> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com; >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch >> Launchpad-Bug-Information-Type: Public >> Launchpad-Bug-Private: no >> Launchpad-Bug-Security-Vulnerability: no >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor >> jason-hobbs mpontillo vorlon >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) >> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) >> Launchpad-Message-Rationale: Subscriber (MAAS) >> Launchpad-Message-For: andreserl >> > > > -- > Andres Rodriguez (RoAkSoAx) > Ubuntu Server Developer > MSc. Telecom & Networking > Systems Engineer > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > Fix Released > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- >
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
I don't have logs anymore as I have since rebuilt my environment, but I can confirm seeing improvements on a maas server running with high IO (note it was a single region/rack). see inlien: On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbswrote: > Andres, it was a single test in both cases, and in both cases there was > almost no delay from MAAS. It's not significant enough to call it > positive results. > > Comment #93 shows there are /some/ improvements when comparing those two samples only, but as I have already said, we need data over time to in both scenarios to properly compare and determine whether the changes do make any material performance improvements with the current conditions of the samples (both samples are with a fixed io starvation on the environment). > Since neither of you answered yes, I'll assume the answer was no to my > question of whether there was anything in my logs or data that showed > reading the template from disk on the rack controller was the culprit, > and that this fix just represents a guess at what might be causing the > delay. > To be fair, your logs do not provide anything concrete to determine what's the culprit of the issue on the MAAS side. It provides a lot of clues, and we have since then determine that those issues were a result of IO starvation (from the VM's writing to disk). As such, the only way we can *really* see if the patch brings any significant performance improvements is to run tests in the environment were you were seeing the issues in the first place. As such, if you are willing to test if these make any material difference, I would unfix your environment and do two runs (one without the fix, and one with the fix). That's the only way we can really compare and be certain in *your* environment. > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor > jason-hobbs mpontillo vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres, it was a single test in both cases, and in both cases there was almost no delay from MAAS. It's not significant enough to call it positive results. Since neither of you answered yes, I'll assume the answer was no to my question of whether there was anything in my logs or data that showed reading the template from disk on the rack controller was the culprit, and that this fix just represents a guess at what might be causing the delay. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres did the testing of the changes and has logs to prove the improvement. On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbswrote: > Blake, that's great. Do you have before and after numbers showing the > improvement this change made? > > Do you have any data or logs that led you to believe this was the > culprit in the slow responses I saw on my cluster? > > On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse > wrote: > > Actually caching does make a difference. That method is not just caching > > the reading of a file, it caches the searching of the file based on the > > purpose, the reading of that file from disk (sure can be in kernel > > cache), the parsing of the template by tempita. > > > > All of that is redudant work that is being done on every single request. > > Searching the filesystem and reading the file from cache is all syscalls > > even if they come from the kernel cache. Since MAAS is async based that > > means that coroutine will be placed on hold while we wait for the result > > to be loaded from the kernel into the memory of the process. That gives > > other coroutines time to do other things, which means that coroutine > > doesn't get to execute until others are done or blocked by there own > > async request. > > > > Caching this information can greatly improve that by not requiring the > > coroutine to be pushed back into the eventloop while it is waiting for > > data from the kernel and without this change when the data comes back it > > still has to be processed by tempita which will take time and block the > > eventloop from completing other work. > > > > So its not simply that we should use the kernel to cache reads from the > > disk there is a lot more involved here. We have noticed improvements > > with this change on systems that are being ran with large number of VM's > > because of the reduction of IO. > > > > -- > > You received this bug notification because you are subscribed to the bug > > report. > > https://bugs.launchpad.net/bugs/1743249 > > > > Title: > > Failed Deployment after timeout trying to retrieve grub cfg > > > > Status in MAAS: > > New > > Status in grub2 package in Ubuntu: > > Fix Released > > > > Bug description: > > A node failed to deploy after it failed to retrieve a grub.cfg from > > MAAS due to a timeout. In the logs, it's clear that the server tried > > to retrieve the grub cfg many times, over about 30 seconds: > > > > http://paste.ubuntu.com/26387256/ > > > > We see the same thing for other hosts around the same time: > > > > http://paste.ubuntu.com/26387262/ > > > > It seems like MAAS is taking way too long to respond to these > > requests. > > > > This is very similar to bug 1724677, which was happening pre- > > metldown/spectre. The only difference is we don't see "[critical] TFTP > > back-end failed" in the logs anymore. > > > > I connected to the console on this system and it had errors about > > timing out retrieving the grub-cfg, then it had an error message along > > the lines of "error not an ip" and then "double free". After I > > connected but before I could get a screenshot the system rebooted and > > was directed by maas to power off, which it did successfully after > > booting to linux. > > > > Full logs are available here: > > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > > > To manage notifications about this bug go to: > > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > Fix Released > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: >
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, I'm comparing pb in #79 vs pb in #90 #79 (non-patched): https://paste.ubuntu.com/26530737/ #90 (patched with lru_cache): https://paste.ubuntu.com/26531873/ Examples I see in #79: 14:02:ec:42:38:dc # makes 9 requests. on line 160+ 14:02:ec:42:28:70 # 8 requests on line 72 14:02:ec:41:d7:38 # 7 on line 92. In #90 i see: 14:02:ec:41:d7:44 # makes 6 requests on line 7 14:02:ec:42:38:dc # makes 5 requests on line 19 So it is interesting to see that in #79 more machines make more requests than those in #90. Obviously we need more data over time to really tell the difference in both scenarios, but based on the current logs, it does /apparently/ show an improvement. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Blake, that's great. Do you have before and after numbers showing the improvement this change made? Do you have any data or logs that led you to believe this was the culprit in the slow responses I saw on my cluster? On Tue, Feb 6, 2018 at 3:12 PM, Blake Rousewrote: > Actually caching does make a difference. That method is not just caching > the reading of a file, it caches the searching of the file based on the > purpose, the reading of that file from disk (sure can be in kernel > cache), the parsing of the template by tempita. > > All of that is redudant work that is being done on every single request. > Searching the filesystem and reading the file from cache is all syscalls > even if they come from the kernel cache. Since MAAS is async based that > means that coroutine will be placed on hold while we wait for the result > to be loaded from the kernel into the memory of the process. That gives > other coroutines time to do other things, which means that coroutine > doesn't get to execute until others are done or blocked by there own > async request. > > Caching this information can greatly improve that by not requiring the > coroutine to be pushed back into the eventloop while it is waiting for > data from the kernel and without this change when the data comes back it > still has to be processed by tempita which will take time and block the > eventloop from completing other work. > > So its not simply that we should use the kernel to cache reads from the > disk there is a lot more involved here. We have noticed improvements > with this change on systems that are being ran with large number of VM's > because of the reduction of IO. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > Fix Released > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Actually caching does make a difference. That method is not just caching the reading of a file, it caches the searching of the file based on the purpose, the reading of that file from disk (sure can be in kernel cache), the parsing of the template by tempita. All of that is redudant work that is being done on every single request. Searching the filesystem and reading the file from cache is all syscalls even if they come from the kernel cache. Since MAAS is async based that means that coroutine will be placed on hold while we wait for the result to be loaded from the kernel into the memory of the process. That gives other coroutines time to do other things, which means that coroutine doesn't get to execute until others are done or blocked by there own async request. Caching this information can greatly improve that by not requiring the coroutine to be pushed back into the eventloop while it is waiting for data from the kernel and without this change when the data comes back it still has to be processed by tempita which will take time and block the eventloop from completing other work. So its not simply that we should use the kernel to cache reads from the disk there is a lot more involved here. We have noticed improvements with this change on systems that are being ran with large number of VM's because of the reduction of IO. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
The patch from #84 is adding a cache for reading the template file on the rack controller. I don't understand why this change is being made. This file will almost certainly be in the page cache anyhow as these systems have a lot of free ram. Usually it's best to just let the page cache do its thing and not try to re-implement it in userspace, unless you really know what you're doing. I haven't seen any logs that indicate that there was a bottleneck reading the template file. Do you have some data along those lines? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Anyhow, I tested with the patch from #84 as requested, here are the results: http://paste.ubuntu.com/26531873/ We're still seeing some retries with it, same as before. But, I think the test is of limited value. It didn't make things worse but we don't have any evidence from the test that it made things better. We're not seeing big delays on a regular basis anymore after changing the storage configuration to reduce contention. With or without this patch, we don't see any big delays. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Ok - and what about the region controller losing contact with the rack controller log messages? What is that about? On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguezwrote: > fwiw, the deadlocks issues is regiond trying to determine which process > should send updates to which racks for *dhcp* changes, so this is not at > all related to the RPC boot requests for pxe. > > On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs > wrote: > >> Can you please comment on the deadlock detected error from the db log in >> posted in #36 >> >> http://paste.ubuntu.com/26530761/ >> >> That is not expected behavior is it? Also the fact that MAAS thinks its >> losing rack/region connections seems like it could be related to this >> behavior. >> >> -- >> You received this bug notification because you are subscribed to MAAS. >> https://bugs.launchpad.net/bugs/1743249 >> >> Title: >> Failed Deployment after timeout trying to retrieve grub cfg >> >> To manage notifications about this bug go to: >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions >> >> Launchpad-Notification-Type: bug >> Launchpad-Bug: product=maas; milestone=2.4.x; status=New; >> importance=Undecided; assignee=None; >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; >> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch >> Launchpad-Bug-Information-Type: Public >> Launchpad-Bug-Private: no >> Launchpad-Bug-Security-Vulnerability: no >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs >> mpontillo vorlon >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) >> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) >> Launchpad-Message-Rationale: Subscriber (MAAS) >> Launchpad-Message-For: andreserl >> > > > -- > Andres Rodriguez (RoAkSoAx) > Ubuntu Server Developer > MSc. Telecom & Networking > Systems Engineer > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > In Progress > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
This bug was fixed in the package grub2 - 2.02-2ubuntu6 --- grub2 (2.02-2ubuntu6) bionic; urgency=medium [ Steve Langasek ] * debian/patches/bufio_sensible_block_sizes.patch: Don't use arbitrary file fizes as block sizes in bufio: this avoids potentially seeking back in the files unnecessarily, which may require re-open files that cannot be seeked into, such as via TFTP. (LP: #1743249) -- Mathieu Trudel-LapierreMon, 05 Feb 2018 11:58:09 -0500 ** Changed in: grub2 (Ubuntu) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, Are these tests with archive grub or patched grub? On Tue, Feb 6, 2018 at 11:39 AM, Jason Hobbswrote: > Andres, > > I ran the test with VMs limited to 9 of 20 cores (cut the core limit > in half for VMs). The first time range from this dump is with the > cores at their normal limit (18). > > As you can see, the behavior didn't change much from one set to the > other. Both sets had instances where grub started doing retries, > although in neither case did it take very long. > > http://paste.ubuntu.com/26530737/ > > So it seems that changing the CPU limits for the VMs doesn't change > the results drastically, which lines up with the data showing CPU > utilization never gets over 50%. > > Jason > > > On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez > wrote: > >> > >> > >> > That being said, because CPU load doesn't show high we are making the > >> > *assumption* that it is not impacting MAAS, but again, this is an > >> > assumption. Making the requested change for having at least 4 CPUs > >> (ideally > >> > 6) would allow us to determining what are the effects and see whether > >> > there's any difference on behavior and would help identify what other > >> > issues. > >> > > >> > Without having the comparison then we are making it more difficult to > >> > isolate the problem. > >> > >> To improve performance the typical pattern is 1) identify the > >> bottleneck 2) eliminate that as the bottleneck 3) repeat. > >> > >> We have not identified CPU as a bottleneck. The top data says it is > >> not! > >> > > > > Jason, > > > > That doesn't change the fact that we are requesting tests to be run with > > different CPU configuration for VM's, so we can make a *comparison* and > see > > if there is any material difference or none at all with the current > > conditions. While I agree with you that the data /seems/ to show that > there > > is not issue with CPU, that doesn't change the fact that we don't have > any > > data to compare with, as there could still be an impact even if it is > > minimum. > > > > Without the data, we cannot certainly assert that there's no issue caused > > by CPU usage because we don't have a reference or point of comparison. So > > while all fingers seem to be pointing to storage, It strongly believe it > is > > worth gathering the data now and fully discard. > > > > If this is something that your environment is unable to do, I would > > appreciate that you clarify that instead of asserting that there's no > > performance impact in MAAS due to CPU usage, when we don't really know > for > > sure (e.g. we don't know if MAAS behaves differently with less CPU usage > in > > the current conditions, and that's data worth gathering to be able to > > better support you in the future). > > > > -- > > Andres Rodriguez (RoAkSoAx) > > Ubuntu Server Developer > > MSc. Telecom & Networking > > Systems Engineer > > > > -- > > You received this bug notification because you are subscribed to the bug > > report. > > https://bugs.launchpad.net/bugs/1743249 > > > > Title: > > Failed Deployment after timeout trying to retrieve grub cfg > > > > Status in MAAS: > > New > > Status in grub2 package in Ubuntu: > > In Progress > > > > Bug description: > > A node failed to deploy after it failed to retrieve a grub.cfg from > > MAAS due to a timeout. In the logs, it's clear that the server tried > > to retrieve the grub cfg many times, over about 30 seconds: > > > > http://paste.ubuntu.com/26387256/ > > > > We see the same thing for other hosts around the same time: > > > > http://paste.ubuntu.com/26387262/ > > > > It seems like MAAS is taking way too long to respond to these > > requests. > > > > This is very similar to bug 1724677, which was happening pre- > > metldown/spectre. The only difference is we don't see "[critical] TFTP > > back-end failed" in the logs anymore. > > > > I connected to the console on this system and it had errors about > > timing out retrieving the grub-cfg, then it had an error message along > > the lines of "error not an ip" and then "double free". After I > > connected but before I could get a screenshot the system rebooted and > > was directed by maas to power off, which it did successfully after > > booting to linux. > > > > Full logs are available here: > > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > > > To manage notifications about this bug go to: > > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type:
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
fwiw, the deadlocks issues is regiond trying to determine which process should send updates to which racks for *dhcp* changes, so this is not at all related to the RPC boot requests for pxe. On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbswrote: > Can you please comment on the deadlock detected error from the db log in > posted in #36 > > http://paste.ubuntu.com/26530761/ > > That is not expected behavior is it? Also the fact that MAAS thinks its > losing rack/region connections seems like it could be related to this > behavior. > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs > mpontillo vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
> > > > > Yes, it is not an unknown machine, but that doesn;t change the fact that > > this is working as designed. If the client didn't get a response for the > > request it makes, and the client decides to move on and makes a different > > request, then it is working as designed. Again, the bug here is not on > the > > clients behavior, the bug here is on the fact that the response is not > > being done in a timely manner. > > Yes, agreed 100%. It's not a client bug, it's a server bug. > > > > >> > >> > So this is *not* a race condition in MAAS. This is working as designed > >> and > >> > is expected. The problem here is that MAAS takes too long to answer > the > >> > initial request, which causes grub to timeout and move on to request a > >> > different config file. > >> > >> Yes, because there is a race condition in the design - the MAC > >> specific file has to be generated before grub times out. It could > >> instead be generated before the node ever starts booting, allowing it > >> to be served just as fast as the -default-amd64 file is, eliminating > >> that race condition. > >> > > > > It is not a race condition. It is doing exactly what it was told to do. > It > > request X thing, didn't get a response, then it requested Y thing, and > got > > a response. The fact that there's no response when X happens on a > /timely/ > > manner is not a race, its a bug on the server side. So, if the machine > were > > to not be known to MAAS, it would work as expected. But since it is known > > and the response doesn't come on a timely manner for grub, it moves on. > > This is the same behavior pxe, uboot and other network bootloaders > follow. > > Right - it's a bug on the server side! That's what I've been saying. > > > And yes, you could argue that the config could be generated before the > node > > starts booting, but what you are not considering is that the node can > boot > > from any rack controller really and that would require maas to send the > > same file to all rack controllers in the same vlan the machine is booting > > from and write files onto the disk dynamically, which in fact, can impact > > performance even more. The fact the config is generated on the fly is > > because it is generated for the specific rack controller where the > machine > > is booting from and that;'s the intended design. > > I never suggested the files had to be written to disk, but yes, they > would need to be sent to each rack controller that it could boot from. > > I know it's the intended design, but it has a race condition built in > that could be eliminated with another design. That's all I'm saying. > > It sounds like you agree and you point out there would be trade offs, > and that's fine. > Actually we dont believe this is a good change. In fact, this will cause booting issues and overall performance issues. We already know of two areas where this can be improved. One is non-backportable to 2.3, the other one is this: https://paste.ubuntu.com/26530972/ Is there any chance you can test that patch, or do you want me to put a patched package somehwere? > > Jason > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs > mpontillo vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
The deadlock is not expected behavior. Due to the isolation level, the number of workers (e.g. 12 workers/3regions) and the fact that there could be IO starvation, its surfacing this issue. That said, changes to improve this and prevent the deadlocks are not backportable to 2.3 and are targetted for 2.4 On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbswrote: > Can you please comment on the deadlock detected error from the db log in > posted in #36 > > http://paste.ubuntu.com/26530761/ > > That is not expected behavior is it? Also the fact that MAAS thinks its > losing rack/region connections seems like it could be related to this > behavior. > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs > mpontillo vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguezwrote: > On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez >> wrote: >> > I think there's a misunderstanding on how the network boot process >> happens: >> > Let's look at pxe linux first. Pxe linux does this: >> > >> > 1. tries UUID first # if no answer, it moves on >> > 2. Tries mac # if no answer, it moves on >> > 3. tries full IP address # if no answer, it moves on >> > 4. tries partial IP address # if no answer, it moves on >> > 5. does 4 >> > 6. does 4 >> > [...] >> > 7. boots default. >> > >> > This can be seen in here: >> > >> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d >> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd >> > /mybootdir/pxelinux.cfg/C0A8025B >> > /mybootdir/pxelinux.cfg/C0A8025 >> > /mybootdir/pxelinux.cfg/C0A802 >> > /mybootdir/pxelinux.cfg/C0A80 >> > /mybootdir/pxelinux.cfg/C0A8 >> > /mybootdir/pxelinux.cfg/C0A >> > /mybootdir/pxelinux.cfg/C0 >> > /mybootdir/pxelinux.cfg/C >> > /mybootdir/pxelinux.cfg/default >> > >> > >> > That said, in the case of grub, this behavior is similar. You have >> > described this behavior in comment #16. So what is it that's happening: >> > >> > 1. grub is trying grub.cfg- address multiple times, but since it >> > doesn't get a response, it gives it. >> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead. >> > >> > That said, the requests are handled completely different. The - >> > requests actually accesses the *node* object in the database by >> searching >> > it with the mac address where the request is made. With this node object, >> > we generate the config file. >> > >> > In comparison, the -default-amd64 does *not* access the node object. It >> > just access two config settings and the db query is *much* cheaper. Also, >> > we have to keep in mind that after grub has done many retries, this >> returns >> > rather fast in comparison because it is not only cheaper, but at that >> point >> > MAAS may be with way less load of queued DB requests. Either way, grub >> > giving up means that it wont expect for the initial request, but it will >> > expect a new response for the new file it asked for. >> > >> > That said, this is working *exactly* as expected, because this >> effectively >> > tells grub "if config for your MAC address was not returned, you can >> safely >> > assume you are an unknown machine to MAAS", hence grub requests a >> different >> > config file to start the enlistment process. >> >> Except it's not an unknown machine, and MAAS treating it like one is >> bad behavior and a bug. > > >> This is not "working exactly as expected". "Working exactly as >> expected" would be my machine being deployed when I asked for it to >> be. >> > > Yes, it is not an unknown machine, but that doesn;t change the fact that > this is working as designed. If the client didn't get a response for the > request it makes, and the client decides to move on and makes a different > request, then it is working as designed. Again, the bug here is not on the > clients behavior, the bug here is on the fact that the response is not > being done in a timely manner. Yes, agreed 100%. It's not a client bug, it's a server bug. > >> >> > So this is *not* a race condition in MAAS. This is working as designed >> and >> > is expected. The problem here is that MAAS takes too long to answer the >> > initial request, which causes grub to timeout and move on to request a >> > different config file. >> >> Yes, because there is a race condition in the design - the MAC >> specific file has to be generated before grub times out. It could >> instead be generated before the node ever starts booting, allowing it >> to be served just as fast as the -default-amd64 file is, eliminating >> that race condition. >> > > It is not a race condition. It is doing exactly what it was told to do. It > request X thing, didn't get a response, then it requested Y thing, and got > a response. The fact that there's no response when X happens on a /timely/ > manner is not a race, its a bug on the server side. So, if the machine were > to not be known to MAAS, it would work as expected. But since it is known > and the response doesn't come on a timely manner for grub, it moves on. > This is the same behavior pxe, uboot and other network bootloaders follow. Right - it's a bug on the server side! That's what I've been saying. > And yes, you could argue that the config could be generated before the node > starts booting, but what you are not considering is that the node can boot > from any rack controller really and that would require maas to send the > same file to all rack controllers in the same vlan the machine is booting > from and write files onto the disk dynamically, which in fact, can impact > performance even more. The fact the config is generated on the fly is > because it
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Can you please comment on the deadlock detected error from the db log in posted in #36 http://paste.ubuntu.com/26530761/ That is not expected behavior is it? Also the fact that MAAS thinks its losing rack/region connections seems like it could be related to this behavior. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbswrote: > On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez > wrote: > > I think there's a misunderstanding on how the network boot process > happens: > > Let's look at pxe linux first. Pxe linux does this: > > > > 1. tries UUID first # if no answer, it moves on > > 2. Tries mac # if no answer, it moves on > > 3. tries full IP address # if no answer, it moves on > > 4. tries partial IP address # if no answer, it moves on > > 5. does 4 > > 6. does 4 > > [...] > > 7. boots default. > > > > This can be seen in here: > > > > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d > > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd > > /mybootdir/pxelinux.cfg/C0A8025B > > /mybootdir/pxelinux.cfg/C0A8025 > > /mybootdir/pxelinux.cfg/C0A802 > > /mybootdir/pxelinux.cfg/C0A80 > > /mybootdir/pxelinux.cfg/C0A8 > > /mybootdir/pxelinux.cfg/C0A > > /mybootdir/pxelinux.cfg/C0 > > /mybootdir/pxelinux.cfg/C > > /mybootdir/pxelinux.cfg/default > > > > > > That said, in the case of grub, this behavior is similar. You have > > described this behavior in comment #16. So what is it that's happening: > > > > 1. grub is trying grub.cfg- address multiple times, but since it > > doesn't get a response, it gives it. > > 2. Once it gives up, grub.cfg-default-amd64 is tried instead. > > > > That said, the requests are handled completely different. The - > > requests actually accesses the *node* object in the database by > searching > > it with the mac address where the request is made. With this node object, > > we generate the config file. > > > > In comparison, the -default-amd64 does *not* access the node object. It > > just access two config settings and the db query is *much* cheaper. Also, > > we have to keep in mind that after grub has done many retries, this > returns > > rather fast in comparison because it is not only cheaper, but at that > point > > MAAS may be with way less load of queued DB requests. Either way, grub > > giving up means that it wont expect for the initial request, but it will > > expect a new response for the new file it asked for. > > > > That said, this is working *exactly* as expected, because this > effectively > > tells grub "if config for your MAC address was not returned, you can > safely > > assume you are an unknown machine to MAAS", hence grub requests a > different > > config file to start the enlistment process. > > Except it's not an unknown machine, and MAAS treating it like one is > bad behavior and a bug. > This is not "working exactly as expected". "Working exactly as > expected" would be my machine being deployed when I asked for it to > be. > Yes, it is not an unknown machine, but that doesn;t change the fact that this is working as designed. If the client didn't get a response for the request it makes, and the client decides to move on and makes a different request, then it is working as designed. Again, the bug here is not on the clients behavior, the bug here is on the fact that the response is not being done in a timely manner. > > > So this is *not* a race condition in MAAS. This is working as designed > and > > is expected. The problem here is that MAAS takes too long to answer the > > initial request, which causes grub to timeout and move on to request a > > different config file. > > Yes, because there is a race condition in the design - the MAC > specific file has to be generated before grub times out. It could > instead be generated before the node ever starts booting, allowing it > to be served just as fast as the -default-amd64 file is, eliminating > that race condition. > It is not a race condition. It is doing exactly what it was told to do. It request X thing, didn't get a response, then it requested Y thing, and got a response. The fact that there's no response when X happens on a /timely/ manner is not a race, its a bug on the server side. So, if the machine were to not be known to MAAS, it would work as expected. But since it is known and the response doesn't come on a timely manner for grub, it moves on. This is the same behavior pxe, uboot and other network bootloaders follow. And yes, you could argue that the config could be generated before the node starts booting, but what you are not considering is that the node can boot from any rack controller really and that would require maas to send the same file to all rack controllers in the same vlan the machine is booting from and write files onto the disk dynamically, which in fact, can impact performance even more. The fact the config is generated on the fly is because it is generated for the specific rack controller where the machine is booting from and that;'s the intended design. > > Jason > > > On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs > > wrote: > > > >> The packetdump (comment #35) of MAAS not responding to grub's request > >> for the mac specific grub.cfg before grub times
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres, I ran the test with VMs limited to 9 of 20 cores (cut the core limit in half for VMs). The first time range from this dump is with the cores at their normal limit (18). As you can see, the behavior didn't change much from one set to the other. Both sets had instances where grub started doing retries, although in neither case did it take very long. http://paste.ubuntu.com/26530737/ So it seems that changing the CPU limits for the VMs doesn't change the results drastically, which lines up with the data showing CPU utilization never gets over 50%. Jason On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguezwrote: >> >> >> > That being said, because CPU load doesn't show high we are making the >> > *assumption* that it is not impacting MAAS, but again, this is an >> > assumption. Making the requested change for having at least 4 CPUs >> (ideally >> > 6) would allow us to determining what are the effects and see whether >> > there's any difference on behavior and would help identify what other >> > issues. >> > >> > Without having the comparison then we are making it more difficult to >> > isolate the problem. >> >> To improve performance the typical pattern is 1) identify the >> bottleneck 2) eliminate that as the bottleneck 3) repeat. >> >> We have not identified CPU as a bottleneck. The top data says it is >> not! >> > > Jason, > > That doesn't change the fact that we are requesting tests to be run with > different CPU configuration for VM's, so we can make a *comparison* and see > if there is any material difference or none at all with the current > conditions. While I agree with you that the data /seems/ to show that there > is not issue with CPU, that doesn't change the fact that we don't have any > data to compare with, as there could still be an impact even if it is > minimum. > > Without the data, we cannot certainly assert that there's no issue caused > by CPU usage because we don't have a reference or point of comparison. So > while all fingers seem to be pointing to storage, It strongly believe it is > worth gathering the data now and fully discard. > > If this is something that your environment is unable to do, I would > appreciate that you clarify that instead of asserting that there's no > performance impact in MAAS due to CPU usage, when we don't really know for > sure (e.g. we don't know if MAAS behaves differently with less CPU usage in > the current conditions, and that's data worth gathering to be able to > better support you in the future). > > -- > Andres Rodriguez (RoAkSoAx) > Ubuntu Server Developer > MSc. Telecom & Networking > Systems Engineer > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > In Progress > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguezwrote: > I think there's a misunderstanding on how the network boot process happens: > Let's look at pxe linux first. Pxe linux does this: > > 1. tries UUID first # if no answer, it moves on > 2. Tries mac # if no answer, it moves on > 3. tries full IP address # if no answer, it moves on > 4. tries partial IP address # if no answer, it moves on > 5. does 4 > 6. does 4 > [...] > 7. boots default. > > This can be seen in here: > > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd > /mybootdir/pxelinux.cfg/C0A8025B > /mybootdir/pxelinux.cfg/C0A8025 > /mybootdir/pxelinux.cfg/C0A802 > /mybootdir/pxelinux.cfg/C0A80 > /mybootdir/pxelinux.cfg/C0A8 > /mybootdir/pxelinux.cfg/C0A > /mybootdir/pxelinux.cfg/C0 > /mybootdir/pxelinux.cfg/C > /mybootdir/pxelinux.cfg/default > > > That said, in the case of grub, this behavior is similar. You have > described this behavior in comment #16. So what is it that's happening: > > 1. grub is trying grub.cfg- address multiple times, but since it > doesn't get a response, it gives it. > 2. Once it gives up, grub.cfg-default-amd64 is tried instead. > > That said, the requests are handled completely different. The - > requests actually accesses the *node* object in the database by searching > it with the mac address where the request is made. With this node object, > we generate the config file. > > In comparison, the -default-amd64 does *not* access the node object. It > just access two config settings and the db query is *much* cheaper. Also, > we have to keep in mind that after grub has done many retries, this returns > rather fast in comparison because it is not only cheaper, but at that point > MAAS may be with way less load of queued DB requests. Either way, grub > giving up means that it wont expect for the initial request, but it will > expect a new response for the new file it asked for. > > That said, this is working *exactly* as expected, because this effectively > tells grub "if config for your MAC address was not returned, you can safely > assume you are an unknown machine to MAAS", hence grub requests a different > config file to start the enlistment process. Except it's not an unknown machine, and MAAS treating it like one is bad behavior and a bug. This is not "working exactly as expected". "Working exactly as expected" would be my machine being deployed when I asked for it to be. > So this is *not* a race condition in MAAS. This is working as designed and > is expected. The problem here is that MAAS takes too long to answer the > initial request, which causes grub to timeout and move on to request a > different config file. Yes, because there is a race condition in the design - the MAC specific file has to be generated before grub times out. It could instead be generated before the node ever starts booting, allowing it to be served just as fast as the -default-amd64 file is, eliminating that race condition. Jason > On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs > wrote: > >> The packetdump (comment #35) of MAAS not responding to grub's request >> for the mac specific grub.cfg before grub times out, and then responding >> immediately to the generic-amd64 grub cfg, clearly shows a race >> condition in MAAS. >> >> MAAS's design of dynamically generating the interface specific grub >> config only after it receives the tftp request for it is susceptible to >> a race condition where grub times out before MAAS can respond. >> >> That design is not the only possible design. All the information >> required for the interface specific grub.cfg is available before the >> machine ever powers on, and could be made available on the rack >> controllers at that time too. >> >> Doing so would eliminate that race condition, or at least reduce the >> opportunity greatly, as we see MAAS has no problems immediately >> responding and serving files that it doesn't need to dynamically >> generate at request time. >> >> There is still some question around what in the environment is >> contributing to MAAS not responding faster, and what MAAS is doing while >> it takes 60+ seconds to respond to the request, but that doesn't change >> the fact that the current MAAS design is racy (and that's a bug). >> >> Whatever we change in the environment to reduce the likelihood of >> hitting this issue there doesn't solve the underlying race condition in >> MAAS, and leaves open the possibility of hitting the issue other places >> too. >> >> -- >> You received this bug notification because you are subscribed to MAAS. >> https://bugs.launchpad.net/bugs/1743249 >> >> Title: >> Failed Deployment after timeout trying to retrieve grub cfg >> >> To manage notifications about this bug go to: >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions >> >> Launchpad-Notification-Type: bug >> Launchpad-Bug: product=maas; milestone=2.4.x;
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
> > > > That being said, because CPU load doesn't show high we are making the > > *assumption* that it is not impacting MAAS, but again, this is an > > assumption. Making the requested change for having at least 4 CPUs > (ideally > > 6) would allow us to determining what are the effects and see whether > > there's any difference on behavior and would help identify what other > > issues. > > > > Without having the comparison then we are making it more difficult to > > isolate the problem. > > To improve performance the typical pattern is 1) identify the > bottleneck 2) eliminate that as the bottleneck 3) repeat. > > We have not identified CPU as a bottleneck. The top data says it is > not! > Jason, That doesn't change the fact that we are requesting tests to be run with different CPU configuration for VM's, so we can make a *comparison* and see if there is any material difference or none at all with the current conditions. While I agree with you that the data /seems/ to show that there is not issue with CPU, that doesn't change the fact that we don't have any data to compare with, as there could still be an impact even if it is minimum. Without the data, we cannot certainly assert that there's no issue caused by CPU usage because we don't have a reference or point of comparison. So while all fingers seem to be pointing to storage, It strongly believe it is worth gathering the data now and fully discard. If this is something that your environment is unable to do, I would appreciate that you clarify that instead of asserting that there's no performance impact in MAAS due to CPU usage, when we don't really know for sure (e.g. we don't know if MAAS behaves differently with less CPU usage in the current conditions, and that's data worth gathering to be able to better support you in the future). -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
+1 Mike. I agree it's a bug, but it there isn't real evidence that it's what causes the long delay. On Mon, Feb 5, 2018 at 7:12 PM, Mike Pontillowrote: > Ah, I see what you mean there; I used the following filter in Wireshark: > > udp.dstport == 25305 or udp.srcport == 25305 > > This is not the behavior I saw if the TFTP request is answered in a > timely manner, so I suspect that the long delay between the initial > request and the answer is causing the timeouts to occur in the TFTP > code, which causes this separate "stacked OACK" bug. > > By the time the "stacked OACKs" are sent, it's been over one minute, and > the client isn't listening for a reply any more. So yes, "stacked OACKs" > are a real bug, but right now I think they're just distracting from the > root cause (the long delay). > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > In Progress > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Ah, I see what you mean there; I used the following filter in Wireshark: udp.dstport == 25305 or udp.srcport == 25305 This is not the behavior I saw if the TFTP request is answered in a timely manner, so I suspect that the long delay between the initial request and the answer is causing the timeouts to occur in the TFTP code, which causes this separate "stacked OACK" bug. By the time the "stacked OACKs" are sent, it's been over one minute, and the client isn't listening for a reply any more. So yes, "stacked OACKs" are a real bug, but right now I think they're just distracting from the root cause (the long delay). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Tue, Feb 06, 2018 at 12:11:21AM -, Mike Pontillo wrote: > Steve, can you be more specific about which packet capture showed the > "stacked OACK" behavior? This was the first packet capture that Jason posted, in comment #30. The udp retransmits shown in packets 6262-6268 each receive an answering packet in 6270-6271,6273-6277, in addition to 6269 as an answer to 6261. For whatever reason, wireshark here does not decipher these duplicate OACK packets as OACK, but an examination of the raw packets shows that's clearly what they are. > I looked at a packet capture Andres pointed me to, and don't see the > "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is > indicated by the (source port, dest port) tuple, and I see that MAAS > correctly OACKs each individual transaction (per RFC 2347) - not the > retry packets within the same transaction. Packets 6269-6271,6273-6277 are all answers to the same port on the client. They don't have the same source port, because MAAS has allocated a separate source port for each of these. It's not acking a separate individual transaction, it's MAAS /creating/ a separate transaction (with the allocation of a separate source port) for each one. RFC2347 does not speak to this; the discussion of the port negotiation is in RFC1350 §4: In order to create a connection, each end of the connection chooses a TID for itself, to be used for the duration of that connection. The TID's chosen for a connection should be randomly chosen, so that the probability that the same number is chosen twice in immediate succession is very low. Every packet has associated with it the two TID's of the ends of the connection, the source TID and the destination TID. These TID's are handed to the supporting UDP (or other datagram protocol) as the source and destination ports. A requesting host chooses its source TID as described above, and sends its initial request to the known TID 69 decimal (105 octal) on the serving host. The response to the request, under normal operation, uses a TID chosen by the server as its source TID and the TID chosen for the previous message by the requestor as its destination TID. The two chosen TID's are then used for the remainder of the transfer. MAAS responds to 8 udp retransmits on srcport=25305, dstport=69 by sending 8 independent OACK packets back to dstport=25305 each from a different source port. Since Andres confirms that these duplicate acks still only result in one database query, this may be a negligible bug if the only impact is duplicate small udp packets. OTOH, depending on how MAAS implements this, it could also result in port exhaustion on the server if unanswered OACKs are allowed to linger. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguezwrote: > @Jason, > > On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez >> wrote: >> > No new data was provided to mark this New in MAAS: >> > >> > 1. Changes to the storage seem to have improved things >> >> Yes, it has. That doesn't change whether or not there is a bug in >> MAAS. Can you please address the critical log errors that I mentioned >> in comment #36? This seems like enough to establish something is >> going wrong in MAAS. >> > > The tftp issue shows no evidence this is causing any booting failures. We > have seen this issue before and confirmed that it doesn't cause boot > issues. See [1]. If you want to try it, it is available in > ppa:maas/proposed. > > [1].https://bugs.launchpad.net/maas/+bug/1376483 It's been "Fixed" multiple times before, in your link above and also in bug 1724677, but we still see them, very suspiciously around the time of failures. I'm not convinced these are actually understood. Do you have a specific commit or some idea of what change that addresses these in 2.4? > As far as the postgresql logs with "maas@maasdb ERROR: could not serialize > access due to concurrent update" that is *not* a bug in MAAS or an issue. > That's perfectly normal messages with the isolation level the MAAS DB is > running with. This basically means something else is trying to update the > db while something else is updating it, and MAAS already handles this by > doing retries. That is just one type of db error in the log. There are many more. Here's one that says there was a deadlock detected. That's not normal OK behavior is it? http://paste.ubuntu.com/26527181/ >> > 2. No tests have been run with fixed grub that have caused boot >> failures. >> >> The comments from #56 were testing with the fixed grub - sorry if that >> wasn't clear. >> >> > 3. AFAIK, the VM config has not changed to use less CPU to compare >> results and whether this config change causes the bugs in question. >> >> The CPU load data from comments #48 and #50 shows that CPU load is not >> the problem. The max load average was under 12 on a 20 thread system. >> That means there was lots of free CPU time, and that this workload is >> not CPU bound. >> > > CPU load is not CPU utilization. We know that at the time there's 6 other > VM's with 150%+ CPU usage are writing to the disk because they are being > deployed and/or configured (e.g. software installation). Correct me if > wrong, but this can cause the prioritization of whatever is writing to disk > over anything else, like the MAAS processes access for resources. The 150%+ number you are seeing is that process using all of 1 core (hyperthread) and 50% of another (it's a multithreaded process). This does not mean the process is using 150% of the entire CPU capacity. We don't just have load average - we also have a breakdown of CPU utilization from top, every 5 seconds: %Cpu(s): 23.5 us, 6.3 sy, 0.0 ni, 62.9 id, 7.2 wa, 0.0 hi, 0.1 si, 0.0 st The top man page has more to say about this line, but have a look at the 'id' number. It's the % of cpu time spent in the idle process (nothing to do) in the sample period (5 seconds in the above logs). The lowest that number ever goes in the logs I posted is 52%, meaning over any 5 second period, we never use more than half of the available CPU capacity. > That being said, because CPU load doesn't show high we are making the > *assumption* that it is not impacting MAAS, but again, this is an > assumption. Making the requested change for having at least 4 CPUs (ideally > 6) would allow us to determining what are the effects and see whether > there's any difference on behavior and would help identify what other > issues. > > Without having the comparison then we are making it more difficult to > isolate the problem. To improve performance the typical pattern is 1) identify the bottleneck 2) eliminate that as the bottleneck 3) repeat. We have not identified CPU as a bottleneck. The top data says it is not! In the absence of data showing the CPU as being the bottleneck, reducing CPU usage doesn't help identify the performance blocker, because it may just move the bottleneck. For example, it may cause the processes that are doing disk I/O to not get scheduled to run as much, which may then reduce the amount of disk I/O they can do, which may alleviate the issue, but not because MAAS was CPU starved before and now isn't. Better to reduce the storage contention in the first place, if the data shows that storage contention is the bottleneck. In this case we had data from iotop that indicated storage contention as the bottleneck, and reducing it seems to have alleviated the problem, as we haven't hit the failure since then. We're going to take more steps to alleviate storage contention even more soon, by making sure
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Mike, you can see the stacked response behavior in https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap You can tell packet 90573 is a response to the requests for grub.cfg- because its destination port (25305) is the src port the request for grub.cfg- was coming from (packets 2 through 38). On Mon, Feb 5, 2018 at 6:11 PM, Mike Pontillowrote: > Steve, can you be more specific about which packet capture showed the > "stacked OACK" behavior? > > I looked at a packet capture Andres pointed me to, and don't see the > "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is > indicated by the (source port, dest port) tuple, and I see that MAAS > correctly OACKs each individual transaction (per RFC 2347) - not the > retry packets within the same transaction. Subsequently (in the same > second, after the client ACKs the data packet) it re-requests the same > file (which is the bug in grub that I understand is fixed), and then the > client starts a new transaction and MAAS correctly issues another OACK. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > In Progress > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Steve, can you be more specific about which packet capture showed the "stacked OACK" behavior? I looked at a packet capture Andres pointed me to, and don't see the "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is indicated by the (source port, dest port) tuple, and I see that MAAS correctly OACKs each individual transaction (per RFC 2347) - not the retry packets within the same transaction. Subsequently (in the same second, after the client ACKs the data packet) it re-requests the same file (which is the bug in grub that I understand is fixed), and then the client starts a new transaction and MAAS correctly issues another OACK. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 05, 2018 at 09:27:15PM -, Andres Rodriguez wrote: > MAAS already has a mechanism to collapse retries into the initial request. Are we certain that this is working correctly? If so, why are packet captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for the duplicate incoming requests? It's clear to me that MAAS's handling at the wire level is incorrect - 10 retries of the same tftp request should result in a single OACK, not 10 of them (unless MAAS receives a retry *after* it has sent its OACK). I don't know if that also means that MAAS is inefficiently translating these into database requests on the backend. It had been suggested in this bug log and on IRC that MAAS *was* sending duplicate db requests for each of these packets; OTOH the timing of the stacked responses shows no latency in between them that would imply additional db round-trips. I think someone needs to directly inspect the behavior of a running MAAS server in this scenario to be sure. > In this case, it is the rack that grabs the requests and makes a request to > the region. If retries come within the time that the rack is waiting for a > response from the region, these request get "ignored" and the Rack will > only answer the first request. That is absolutely contradicted by the packet captures. The rack does not ignore the additional requests, it answers *ALL* of the requests. It's only the *client* that consolidates the duplicate responses from MAAS. (And then, because of a grub bug higher up the stack, re-requests the same file that it has already received.) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, The pcap exactly shows the behavior I was hoping to see, which is grub tries to get X config first, and since it didn't get a response, it moves on and tries to get Y config. On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbswrote: > On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez > wrote: > > @Steve, > > > > MAAS already has a mechanism to collapse retries into the initial > request. > > In this case, it is the rack that grabs the requests and makes a request > to > > the region. If retries come within the time that the rack is waiting for > a > > response from the region, these request get "ignored" and the Rack will > > only answer the first request. This is what the logs show after testing > > with fixed grub, where grub makes multiple requests and MAAS answers > > seconds after does requests, but only answers once. This is because the > > requests were collapsed on the maas side. > > > > If, however, the retries come in after the region has answered the rack, > > they these requests will be served. > > This is not true. MAAS is responding to every single request grub > makes for the file - the tcpdump logs show it. And these are not > "read 4 times" requests - they are retries because grub didn't get a > response. > > This pcap shows MAAS responding to every request for grub.cfg-: > https://bugs.launchpad.net/maas/+bug/1743249/+attachment/ > 5046952/+files/spearow-fall-back-to-default-amd64.pcap > > Jason > > > > > On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek < > steve.langa...@canonical.com > >> wrote: > > > >> Jason's feedback was that, after making the changes to the storage > >> configuration of his environment, deploying the test grubx64.efi doesn't > >> have any effect on the MAAS server's response time to tftp requests. So > >> at this point it's not at all clear that the grub change, while correct, > >> helps with this high-level symptom. > >> > >> It has also been suggested that each udp retry is generating a separate > >> database query from MAAS. That is absolutely a MAAS bug if true, and > >> not something that can or should be fixed in GRUB. > >> > >> ** Changed in: grub2 (Ubuntu) > >>Importance: Critical => Medium > >> > >> -- > >> You received this bug notification because you are subscribed to MAAS. > >> https://bugs.launchpad.net/bugs/1743249 > >> > >> Title: > >> Failed Deployment after timeout trying to retrieve grub cfg > >> > >> To manage notifications about this bug go to: > >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > >> > >> Launchpad-Notification-Type: bug > >> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete; > >> importance=Undecided; assignee=None; > >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > >> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > >> Launchpad-Bug-Information-Type: Public > >> Launchpad-Bug-Private: no > >> Launchpad-Bug-Security-Vulnerability: no > >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs > vorlon > >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > >> Launchpad-Bug-Modifier: Steve Langasek (vorlon) > >> Launchpad-Message-Rationale: Subscriber (MAAS) > >> Launchpad-Message-For: andreserl > >> > > > > > > -- > > Andres Rodriguez (RoAkSoAx) > > Ubuntu Server Developer > > MSc. Telecom & Networking > > Systems Engineer > > > > -- > > You received this bug notification because you are subscribed to the bug > > report. > > https://bugs.launchpad.net/bugs/1743249 > > > > Title: > > Failed Deployment after timeout trying to retrieve grub cfg > > > > Status in MAAS: > > New > > Status in grub2 package in Ubuntu: > > In Progress > > > > Bug description: > > A node failed to deploy after it failed to retrieve a grub.cfg from > > MAAS due to a timeout. In the logs, it's clear that the server tried > > to retrieve the grub cfg many times, over about 30 seconds: > > > > http://paste.ubuntu.com/26387256/ > > > > We see the same thing for other hosts around the same time: > > > > http://paste.ubuntu.com/26387262/ > > > > It seems like MAAS is taking way too long to respond to these > > requests. > > > > This is very similar to bug 1724677, which was happening pre- > > metldown/spectre. The only difference is we don't see "[critical] TFTP > > back-end failed" in the logs anymore. > > > > I connected to the console on this system and it had errors about > > timing out retrieving the grub-cfg, then it had an error message along > > the lines of "error not an ip" and then "double free". After I > > connected but before I could get a screenshot the system rebooted and > > was directed by maas to power off, which it did successfully after > > booting to linux. > > > > Full logs are available here: > > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > >
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
I think there's a misunderstanding on how the network boot process happens: Let's look at pxe linux first. Pxe linux does this: 1. tries UUID first # if no answer, it moves on 2. Tries mac # if no answer, it moves on 3. tries full IP address # if no answer, it moves on 4. tries partial IP address # if no answer, it moves on 5. does 4 6. does 4 [...] 7. boots default. This can be seen in here: /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd /mybootdir/pxelinux.cfg/C0A8025B /mybootdir/pxelinux.cfg/C0A8025 /mybootdir/pxelinux.cfg/C0A802 /mybootdir/pxelinux.cfg/C0A80 /mybootdir/pxelinux.cfg/C0A8 /mybootdir/pxelinux.cfg/C0A /mybootdir/pxelinux.cfg/C0 /mybootdir/pxelinux.cfg/C /mybootdir/pxelinux.cfg/default That said, in the case of grub, this behavior is similar. You have described this behavior in comment #16. So what is it that's happening: 1. grub is trying grub.cfg- address multiple times, but since it doesn't get a response, it gives it. 2. Once it gives up, grub.cfg-default-amd64 is tried instead. That said, the requests are handled completely different. The - requests actually accesses the *node* object in the database by searching it with the mac address where the request is made. With this node object, we generate the config file. In comparison, the -default-amd64 does *not* access the node object. It just access two config settings and the db query is *much* cheaper. Also, we have to keep in mind that after grub has done many retries, this returns rather fast in comparison because it is not only cheaper, but at that point MAAS may be with way less load of queued DB requests. Either way, grub giving up means that it wont expect for the initial request, but it will expect a new response for the new file it asked for. That said, this is working *exactly* as expected, because this effectively tells grub "if config for your MAC address was not returned, you can safely assume you are an unknown machine to MAAS", hence grub requests a different config file to start the enlistment process. So this is *not* a race condition in MAAS. This is working as designed and is expected. The problem here is that MAAS takes too long to answer the initial request, which causes grub to timeout and move on to request a different config file. On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbswrote: > The packetdump (comment #35) of MAAS not responding to grub's request > for the mac specific grub.cfg before grub times out, and then responding > immediately to the generic-amd64 grub cfg, clearly shows a race > condition in MAAS. > > MAAS's design of dynamically generating the interface specific grub > config only after it receives the tftp request for it is susceptible to > a race condition where grub times out before MAAS can respond. > > That design is not the only possible design. All the information > required for the interface specific grub.cfg is available before the > machine ever powers on, and could be made available on the rack > controllers at that time too. > > Doing so would eliminate that race condition, or at least reduce the > opportunity greatly, as we see MAAS has no problems immediately > responding and serving files that it doesn't need to dynamically > generate at request time. > > There is still some question around what in the environment is > contributing to MAAS not responding faster, and what MAAS is doing while > it takes 60+ seconds to respond to the request, but that doesn't change > the fact that the current MAAS design is racy (and that's a bug). > > Whatever we change in the environment to reduce the likelihood of > hitting this issue there doesn't solve the underlying race condition in > MAAS, and leaves open the possibility of hitting the issue other places > too. > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguezwrote: > @Jason, > > > On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs > wrote: > >> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez >> wrote: >> > No new data was provided to mark this New in MAAS: >> > >> > 1. Changes to the storage seem to have improved things >> >> Yes, it has. That doesn't change whether or not there is a bug in >> MAAS. Can you please address the critical log errors that I mentioned >> in comment #36? This seems like enough to establish something is >> going wrong in MAAS. >> >> > The bugs you have raised in #36 have already been fixed. Where? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 05, 2018 at 08:40:56PM -, Jason Hobbs wrote: > @Steve - I don't think it helps with the problem of MAAS taking a long > time to respond to the grub.cfg request. However, it may help with the > part of this bug where grub is hitting an error and asking for keyboard > input. https://imgur.com/a/as8Sx > Maybe that should be a separate bug? It seems like grub should never > ask for user keyboard input on a server. Perhaps that bug is fixed as a side effect of the grub change. But what do you think the correct behavior should be when grub cannot find the file that it needs in order to boot? Should grub enter a boot loop, retrying endlessly? Should it try to halt the system? Why is either of these options more correct than putting the machine to a console prompt? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbswrote: > On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez > wrote: > > No new data was provided to mark this New in MAAS: > > > > 1. Changes to the storage seem to have improved things > > Yes, it has. That doesn't change whether or not there is a bug in > MAAS. Can you please address the critical log errors that I mentioned > in comment #36? This seems like enough to establish something is > going wrong in MAAS. > The tftp issue shows no evidence this is causing any booting failures. We have seen this issue before and confirmed that it doesn't cause boot issues. See [1]. If you want to try it, it is available in ppa:maas/proposed. [1].https://bugs.launchpad.net/maas/+bug/1376483 As far as the postgresql logs with "maas@maasdb ERROR: could not serialize access due to concurrent update" that is *not* a bug in MAAS or an issue. That's perfectly normal messages with the isolation level the MAAS DB is running with. This basically means something else is trying to update the db while something else is updating it, and MAAS already handles this by doing retries. > > 2. No tests have been run with fixed grub that have caused boot > failures. > > The comments from #56 were testing with the fixed grub - sorry if that > wasn't clear. > > > 3. AFAIK, the VM config has not changed to use less CPU to compare > results and whether this config change causes the bugs in question. > > The CPU load data from comments #48 and #50 shows that CPU load is not > the problem. The max load average was under 12 on a 20 thread system. > That means there was lots of free CPU time, and that this workload is > not CPU bound. > CPU load is not CPU utilization. We know that at the time there's 6 other VM's with 150%+ CPU usage are writing to the disk because they are being deployed and/or configured (e.g. software installation). Correct me if wrong, but this can cause the prioritization of whatever is writing to disk over anything else, like the MAAS processes access for resources. That being said, because CPU load doesn't show high we are making the *assumption* that it is not impacting MAAS, but again, this is an assumption. Making the requested change for having at least 4 CPUs (ideally 6) would allow us to determining what are the effects and see whether there's any difference on behavior and would help identify what other issues. Without having the comparison then we are making it more difficult to isolate the problem. > > Jason > > > ** Changed in: maas >Status: Incomplete => New > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguezwrote: > @Steve, > > MAAS already has a mechanism to collapse retries into the initial request. > In this case, it is the rack that grabs the requests and makes a request to > the region. If retries come within the time that the rack is waiting for a > response from the region, these request get "ignored" and the Rack will > only answer the first request. This is what the logs show after testing > with fixed grub, where grub makes multiple requests and MAAS answers > seconds after does requests, but only answers once. This is because the > requests were collapsed on the maas side. > > If, however, the retries come in after the region has answered the rack, > they these requests will be served. This is not true. MAAS is responding to every single request grub makes for the file - the tcpdump logs show it. And these are not "read 4 times" requests - they are retries because grub didn't get a response. This pcap shows MAAS responding to every request for grub.cfg-: https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap Jason > > On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek > wrote: > >> Jason's feedback was that, after making the changes to the storage >> configuration of his environment, deploying the test grubx64.efi doesn't >> have any effect on the MAAS server's response time to tftp requests. So >> at this point it's not at all clear that the grub change, while correct, >> helps with this high-level symptom. >> >> It has also been suggested that each udp retry is generating a separate >> database query from MAAS. That is absolutely a MAAS bug if true, and >> not something that can or should be fixed in GRUB. >> >> ** Changed in: grub2 (Ubuntu) >>Importance: Critical => Medium >> >> -- >> You received this bug notification because you are subscribed to MAAS. >> https://bugs.launchpad.net/bugs/1743249 >> >> Title: >> Failed Deployment after timeout trying to retrieve grub cfg >> >> To manage notifications about this bug go to: >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions >> >> Launchpad-Notification-Type: bug >> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete; >> importance=Undecided; assignee=None; >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; >> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch >> Launchpad-Bug-Information-Type: Public >> Launchpad-Bug-Private: no >> Launchpad-Bug-Security-Vulnerability: no >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) >> Launchpad-Bug-Modifier: Steve Langasek (vorlon) >> Launchpad-Message-Rationale: Subscriber (MAAS) >> Launchpad-Message-For: andreserl >> > > > -- > Andres Rodriguez (RoAkSoAx) > Ubuntu Server Developer > MSc. Telecom & Networking > Systems Engineer > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > Status in MAAS: > New > Status in grub2 package in Ubuntu: > In Progress > > Bug description: > A node failed to deploy after it failed to retrieve a grub.cfg from > MAAS due to a timeout. In the logs, it's clear that the server tried > to retrieve the grub cfg many times, over about 30 seconds: > > http://paste.ubuntu.com/26387256/ > > We see the same thing for other hosts around the same time: > > http://paste.ubuntu.com/26387262/ > > It seems like MAAS is taking way too long to respond to these > requests. > > This is very similar to bug 1724677, which was happening pre- > metldown/spectre. The only difference is we don't see "[critical] TFTP > back-end failed" in the logs anymore. > > I connected to the console on this system and it had errors about > timing out retrieving the grub-cfg, then it had an error message along > the lines of "error not an ip" and then "double free". After I > connected but before I could get a screenshot the system rebooted and > was directed by maas to power off, which it did successfully after > booting to linux. > > Full logs are available here: > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa- > ed277a020e7c/cpe_cloud_395/infra-logs.tar > > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to:
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbswrote: > On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez > wrote: > > No new data was provided to mark this New in MAAS: > > > > 1. Changes to the storage seem to have improved things > > Yes, it has. That doesn't change whether or not there is a bug in > MAAS. Can you please address the critical log errors that I mentioned > in comment #36? This seems like enough to establish something is > going wrong in MAAS. > > The bugs you have raised in #36 have already been fixed. > > 2. No tests have been run with fixed grub that have caused boot > failures. > > The comments from #56 were testing with the fixed grub - sorry if that > wasn't clear. > > > 3. AFAIK, the VM config has not changed to use less CPU to compare > results and whether this config change causes the bugs in question. > > The CPU load data from comments #48 and #50 shows that CPU load is not > the problem. The max load average was under 12 on a 20 thread system. > That means there was lots of free CPU time, and that this workload is > not CPU bound. > > Jason > > > ** Changed in: maas >Status: Incomplete => New > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=New; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
The packetdump (comment #35) of MAAS not responding to grub's request for the mac specific grub.cfg before grub times out, and then responding immediately to the generic-amd64 grub cfg, clearly shows a race condition in MAAS. MAAS's design of dynamically generating the interface specific grub config only after it receives the tftp request for it is susceptible to a race condition where grub times out before MAAS can respond. That design is not the only possible design. All the information required for the interface specific grub.cfg is available before the machine ever powers on, and could be made available on the rack controllers at that time too. Doing so would eliminate that race condition, or at least reduce the opportunity greatly, as we see MAAS has no problems immediately responding and serving files that it doesn't need to dynamically generate at request time. There is still some question around what in the environment is contributing to MAAS not responding faster, and what MAAS is doing while it takes 60+ seconds to respond to the request, but that doesn't change the fact that the current MAAS design is racy (and that's a bug). Whatever we change in the environment to reduce the likelihood of hitting this issue there doesn't solve the underlying race condition in MAAS, and leaves open the possibility of hitting the issue other places too. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Steve, MAAS already has a mechanism to collapse retries into the initial request. In this case, it is the rack that grabs the requests and makes a request to the region. If retries come within the time that the rack is waiting for a response from the region, these request get "ignored" and the Rack will only answer the first request. This is what the logs show after testing with fixed grub, where grub makes multiple requests and MAAS answers seconds after does requests, but only answers once. This is because the requests were collapsed on the maas side. If, however, the retries come in after the region has answered the rack, they these requests will be served. On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasekwrote: > Jason's feedback was that, after making the changes to the storage > configuration of his environment, deploying the test grubx64.efi doesn't > have any effect on the MAAS server's response time to tftp requests. So > at this point it's not at all clear that the grub change, while correct, > helps with this high-level symptom. > > It has also been suggested that each udp retry is generating a separate > database query from MAAS. That is absolutely a MAAS bug if true, and > not something that can or should be fixed in GRUB. > > ** Changed in: grub2 (Ubuntu) >Importance: Critical => Medium > > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=In Progress; importance=Medium; assignee=mathieu...@gmail.com; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Steve Langasek (vorlon) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Steve - I don't think it helps with the problem of MAAS taking a long time to respond to the grub.cfg request. However, it may help with the part of this bug where grub is hitting an error and asking for keyboard input. https://imgur.com/a/as8Sx Maybe that should be a separate bug? It seems like grub should never ask for user keyboard input on a server. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguezwrote: > No new data was provided to mark this New in MAAS: > > 1. Changes to the storage seem to have improved things Yes, it has. That doesn't change whether or not there is a bug in MAAS. Can you please address the critical log errors that I mentioned in comment #36? This seems like enough to establish something is going wrong in MAAS. > 2. No tests have been run with fixed grub that have caused boot failures. The comments from #56 were testing with the fixed grub - sorry if that wasn't clear. > 3. AFAIK, the VM config has not changed to use less CPU to compare results and whether this config change causes the bugs in question. The CPU load data from comments #48 and #50 shows that CPU load is not the problem. The max load average was under 12 on a 20 thread system. That means there was lots of free CPU time, and that this workload is not CPU bound. Jason ** Changed in: maas Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Jason's feedback was that, after making the changes to the storage configuration of his environment, deploying the test grubx64.efi doesn't have any effect on the MAAS server's response time to tftp requests. So at this point it's not at all clear that the grub change, while correct, helps with this high-level symptom. It has also been suggested that each udp retry is generating a separate database query from MAAS. That is absolutely a MAAS bug if true, and not something that can or should be fixed in GRUB. ** Changed in: grub2 (Ubuntu) Importance: Critical => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Changed in: grub2 (Ubuntu) Status: Triaged => In Progress ** Changed in: grub2 (Ubuntu) Importance: Undecided => Critical -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
No new data was provided to mark this New in MAAS: 1. Changes to the storage seem to have improved things 2. No tests have been run with fixed grub that have caused boot failures. 3. AFAIK, the VM config has not changed to use less CPU to compare results and whether this config change causes the bugs in question. As such, marking this incomplete until we can verify that 1 and 2 and 3 don't make any difference or with those changes, we continue to see the issues. ** Changed in: maas Status: New => Incomplete -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Changed in: maas Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Here is part of a packet capture on my environment: http://paste.ubuntu.com/26509374/ >From the other tftp server on the deploy: http://paste.ubuntu.com/26509386/ The whole pcap is prohibitively large because it's for multiple hosts. You can see from this that grub is only reading the file once now, so that grub bug has been fixed. You can also see that MAAS is still taking a while to respond to the request sometimes - 6 seconds in this capture. There were no failures on this run, but we don't have failures every time, so that doesn't prove anything. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
I've tested and I can confirm it made just 1 request instead of 4. I think now we need to test it in Jason's environment to see the differences. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Note that the source file is grubnetx64.efi, it should be installed as grubx64.efi in the tftp server directory. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Attachment added: "tcpdump.pcap" https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1743249/+attachment/5047711/+files/tcpdump.pcap -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Attached is an (unsigned) test grubnetx64.efi, built from xenial grub2 plus my patch. Please deploy this in the maas tftp environment where you are experiencing the timeouts, and give feedback on whether it helps with the primary symptom. ** Attachment added: "grubnetx64.efi" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047679/+files/grubnetx64.efi -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Tags added: patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Here is a possible fix for grub's repeated requests of the config file. ** Patch added: "bufio_sensible_block_sizes.patch" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047245/+files/bufio_sensible_block_sizes.patch ** Changed in: grub2 (Ubuntu) Status: New => Triaged ** Changed in: grub2 (Ubuntu) Assignee: (unassigned) => Mathieu Trudel-Lapierre (cyphermox) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
here is the complete output of top from comment #48 ** Attachment added: "top.txt.gz" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047072/+files/top.txt.gz -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
I also collected iotop output from the same run: http://paste.ubuntu.com/26502363/ The storage setup on these nodes is writethrough bcache with a 400 GB nvme in front of a 1TB spinning disk. Since it's writethrough, writes have to make it to the spinning disk before being counted as sync'd. The write numbers look high for random i/o on a spinning disk. It seems possible that the slow MAAS performance is due to postgresql waiting for writes to disk to complete, and MAAS threads blocking on that, so that servicing DB reads is blocked on the commits completing first. The VMs running on the machine are using this same bcache setup for their storage pool. It looks like most of the disk write traffic is coming from the VMs. Based on this data we'll make two changes to our setup which I think should help alleviate this problem: - move the VMs storage hosting to separate disk. - change the storage setup to use writeback bcache. ** Attachment added: "iotop.txt.gz" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047065/+files/iotop.txt.gz -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
I collected top output from a run (this run did not exhibit this failure): http://paste.ubuntu.com/26502311/ The highest the load average ever gets is 11.85, and it's usually around 3-4. This is a 20 thread system, so it doesn't look like CPU contention is the problem. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Steve, On Thu, Feb 1, 2018 at 1:49 PM, Steve Langasekwrote: > On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote: > > @Jason, > > > Packet 90573 doesn't seem to me as an indication of what you are > > describing. What I see is this: > > > 1. grub makes ~30 requests for PXE config on grub.cfg-, after which > it gives up because it didn't receive a response. > > 2. grub moves on and requests grub.cfg-default-amd64, and it receives a > response from MAAS. > > > Now, the difference between the above, is that 1 does *database* > > lookups, while 2 does not. In other words, 1 causes a request to obtain > > the 'node' object based on the MAC to provide, and if grub is making 30+ > > requests, then this can definitely flood the db with requests. > > Then as I've said on IRC, this is a bug in maas, because 30 udp retries > should not generate 30 requests to the database. > > GRUB is *not* wrong to retransmit its udp packets when it doesn't get a > response. If each of these increases the load in MAAS, then MAAS should be > fixed. > The case where GRUB retrieves the same file multiple times is a GRUB bug, > but I don't see any evidence linking this GRUB bug to the timeout and > fallback problem in Jason's latest trace. I agree with you if we are only considering this 1 system. Let's not forget that we have other systems booting at around the same time, each of which may be making at least 4 requests (for those grub systems) that may or may not be answered immediately after each request. But if requests are being served at the same time that more requests come in, I do see how making multiple requests can indeed be causing the degraded performance. Specially, now that we've learned that we have multiple VM's in the same host, all consuming 18 CPU's, on a 20 CPU system, and when MAAS alone, runs 5 processes that we typically recommend a dedicated CPU for each. > -- > You received this bug notification because you are subscribed to MAAS. > https://bugs.launchpad.net/bugs/1743249 > > Title: > Failed Deployment after timeout trying to retrieve grub cfg > > To manage notifications about this bug go to: > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions > > Launchpad-Notification-Type: bug > Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete; > importance=Undecided; assignee=None; > Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main; > status=New; importance=Undecided; assignee=None; > Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine > Launchpad-Bug-Information-Type: Public > Launchpad-Bug-Private: no > Launchpad-Bug-Security-Vulnerability: no > Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon > Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs) > Launchpad-Bug-Modifier: Steve Langasek (vorlon) > Launchpad-Message-Rationale: Subscriber (MAAS) > Launchpad-Message-For: andreserl > -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer MSc. Telecom & Networking Systems Engineer -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, Did you expand the "production environment" section? Memory (MB) CPU (GHz) Disk (GB) Region controller (minus PostgreSQL)20482.0 5 PostgreSQL 20482.0 20 Rack controller 20482.0 20 Ubuntu Server (including logs) 512 0.5 20 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Oh I see what you mean, yeah ignore the GHz section, that's wrong. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
FYI those minimum requirements don't mention anything about core/thread count. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, I would give MAAS at least 6 CPU's. 2 for Region 2 for Postgres 2 for Rack. I would even recommend 4 for region instead of just 2, as MAAS runs 4 region processes. So that would be a total of 8. [2]: https://docs.ubuntu.com/maas/2.3/en/#minimum-requirements -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote: > @Jason, > Packet 90573 doesn't seem to me as an indication of what you are > describing. What I see is this: > 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it > gives up because it didn't receive a response. > 2. grub moves on and requests grub.cfg-default-amd64, and it receives a > response from MAAS. > Now, the difference between the above, is that 1 does *database* > lookups, while 2 does not. In other words, 1 causes a request to obtain > the 'node' object based on the MAC to provide, and if grub is making 30+ > requests, then this can definitely flood the db with requests. Then as I've said on IRC, this is a bug in maas, because 30 udp retries should not generate 30 requests to the database. GRUB is *not* wrong to retransmit its udp packets when it doesn't get a response. If each of these increases the load in MAAS, then MAAS should be fixed. The case where GRUB retrieves the same file multiple times is a GRUB bug, but I don't see any evidence linking this GRUB bug to the timeout and fallback problem in Jason's latest trace. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Andres, You can tell packet 90573 is a response to the requests for grub.cfg- because its destination port (25305) is the src port the request for grub.cfg- was coming from (packets 2 through 38). We're running another test now to collect load information. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
@Jason, Packet 90573 doesn't seem to me as an indication of what you are describing. What I see is this: 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it gives up because it didn't receive a response. 2. grub moves on and requests grub.cfg-default-amd64, and it receives a response from MAAS. Now, the difference between the above, is that 1 does *database* lookups, while 2 does not. In other words, 1 causes a request to obtain the 'node' object based on the MAC to provide, and if grub is making 30+ requests, then this can definitely flood the db with requests. That said, based on my understanding of how your environment is configured, you have other 3 VM's in the system PXE booting from MAAS + other machines at the same time, where each VM has assigned to itself 8 CPU's on a system that has 20 CPU's (that means that the VM's alone, in other words, you are over committing CPU), combined with other machines PXE booting off MAAS at the same time, plus the performance implications of the recent kernel, then it does seem to me that all of the other things could be impacting maas in contending resources, when we already know postgresql is running in degraded performance due to the newer kernels. That said, did you disable spectre features and rebooted your machine? Did you test this by NOT running VM's in the same system as MAAS or at least, reducing the number of cores each VM access to (since there's 3 VM's, with 8 cores each, that means 24 cores on a 20 core system). Also, do you have any CPU load at the time of failure? ** Changed in: maas Status: New => Incomplete -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Changed in: maas Status: Incomplete => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
In the pcap from comment #35, MAAS eventually does respond to the interface specific grub request, 61 seconds after the request, after it's already sent the grub.cfg-default-amd64, kernel, and initrd. You can see the responses to the interface specific grub.cfg requests coming back starting at packet 90573. While Steve's finding in #33/34 seem to indicate a grub bug, this seems like a MAAS problem occurring before that grub bug even has a chance to take effect. I'm attaching MAAS logs from this same test run. >From the maas logs, the requests start at 01:02:49 logs-2018-02-01-01.04.49/10.244.40.30/var/log/maas/rackd.log There are some "critical" tftp errors logged in the same file not long afterwards: http://paste.ubuntu.com/26501394/ There are errors in postgresl's log around the same time too: (logs-2018-02-01-01.04.49/10.244.40.30/var/log/postgresql$ vim postgresql-9.5-ha.log) http://paste.ubuntu.com/26501399/ ** Attachment added: "infra-logs.tar" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046970/+files/infra-logs.tar -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Attaching a pcap from a failure case. In this case, grub tried for 30 seconds to retrieve the interface specific grub.cfg, but never got a response from MAAS. It then gave up and got the amd64-default one instead, which caused the machine to try to enlist and then power off, leading to a failed deployment. ** Attachment added: "spearow-fall-back-to-default-amd64.pcap" https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Regarding grub requesting the same file 4 times, a surprising finding: I'm able to reproduce this with files of a certain length. By chance my grub.cfg was 1 byte shorter than the one maas serves (269 bytes instead of 270), and I saw multiple requests for this file. To reproduce this in a VM using UEFI: - set up dhcp to point to bootx64.efi - set up tftp with bootx64.efi and grubx64.efi but not grub/grub.cfg - create files of varying sizes and access them using 'source (pxe)/config-file-on-server' A simple file consisting of nothing but newlines is sufficient. confirmed "good" file lengths: 1,2,3,4,266,268,270 confirmed "bad" file lengths: 267,269,271,584,595,627 No pattern established yet. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
** Also affects: grub2 (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1743249 Title: Failed Deployment after timeout trying to retrieve grub cfg To manage notifications about this bug go to: https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs