Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
and fwiw, I'm not saying this is *the* solution for a problem like this one
where there is IO starvation. But it is definitely a step forward.


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
That's what we have done to test the difference. So for the greater
audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's
at the same time other 4 nodes are PXE booting from MAAS.

Before the fix we saw:
1. client would do multiple requests for the same file.
2. maas would run up to 3 DB requests for the node object to used to render
the config.
3. Inspected why we had 3 DB requests for the same config.

With this behavior, we determined that what happens is that the rack
queries the region, obtains the object, takes a while to generate the
config and return it to the client. But before it returns it to the client,
the client makes another request and that causes another db query. With
this, we confirmed that the collapsing works as expected, provided that
this collapsing happens between region/rack communication, but the rack had
already received and response and treats the new request as a new db query.

With the fix we aw:
1. client would do multiple requests for the same file
2. maas would always perform 1 DB requqest for the node object to render
the config.

With this, we were able to identify that the rack was taking too long to
answer the client, which caused that if a new request came it, it was
treated as a new request that was server by the region. With the changes
the rack responds faster, hence MAAS collapsed multiple requests, responded
in a timely fashion before it can actually be caused to make another
request to the db.

So the fix does improve things for sure, and we believe is one of the
reasons as to why this happened while there's IO starvation. That said, it
is not the only thing to improve, as there are other sections that need
improvement and as I had earlier said, those involve improving the DB as
well.

On Tue, Feb 6, 2018 at 6:10 PM, Jason Hobbs 
wrote:

> BTW to be clear here I'm saying I don't think the path forward on
> improving this issue is thinking about how MAAS works and throwing out
> patches that might improve performance here and there.  The path
> forward is to instrument MAAS on a system with slow i/o and to figure
> out exactly where it's getting hung up.
>
> Jason
>
> On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs 
> wrote:
> > dm-delay looks very interesting along those lines.
> >
> > https://www.enodev.fr/posts/emulate-a-slow-block-device-
> with-dm-delay.html
> >
> > https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
> >
> > On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs 
> wrote:
> >> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
> >>  wrote:
> >>> I don't have logs anymore as I have since rebuilt my environment, but
> I can
> >>> confirm seeing improvements on a maas server running with high IO
> (note it
> >>> was a single region/rack).
> >>>
> >>> see inlien:
> >>>
> >>>
> >>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs  >
> >>> wrote:
> >>>
>  Andres, it was a single test in both cases, and in both cases there
> was
>  almost no delay from MAAS.  It's not significant enough to call it
>  positive results.
> 
> 
> >>> Comment #93 shows there are /some/ improvements when comparing those
> two
> >>> samples only, but as I have already said, we need data over time to in
> both
> >>> scenarios to properly compare and determine whether the changes do
> make any
> >>> material performance improvements with the current conditions of the
> >>> samples (both samples are with a fixed io starvation on the
> environment).
> >>>
> >>>
>  Since neither of you answered yes, I'll assume the answer was no to my
>  question of whether there was anything in my logs or data that showed
>  reading the template from disk on the rack controller was the culprit,
>  and that this fix just represents a guess at what might be causing the
>  delay.
> 
> >>>
> >>> To be fair, your logs do not provide anything concrete to determine
> what's
> >>> the culprit of the issue on the MAAS side. It provides a lot of clues,
> and
> >>> we have since then determine that those issues were a result of IO
> >>> starvation (from the VM's writing to disk). As such, the only way we
> can
> >>> *really* see if the patch brings any significant performance
> improvements
> >>> is to run tests in the environment were you were seeing the issues in
> the
> >>> first place.
> >>
> >> I didn't think my logs provided anything concrete!  That's because the
> >> logging built into MAAS is not sufficient enough to do so.
> >>
> >> I can't break that environment to test anymore - we got it working
> >> thanks to you guy's help and it's a production environment that needs
> >> to keep running other tests.
> >>
> >> It might possible to recreate this on another maas server, using
> >> 'stress' or a similar tool to cause disk contention.
> >>
> >> Jason
> >>
> >>> As such, if you are willing to test if these 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
BTW to be clear here I'm saying I don't think the path forward on
improving this issue is thinking about how MAAS works and throwing out
patches that might improve performance here and there.  The path
forward is to instrument MAAS on a system with slow i/o and to figure
out exactly where it's getting hung up.

Jason

On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs  wrote:
> dm-delay looks very interesting along those lines.
>
> https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html
>
> https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
>
> On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs  wrote:
>> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
>>  wrote:
>>> I don't have logs anymore as I have since rebuilt my environment, but I can
>>> confirm seeing improvements on a maas server running with high IO (note it
>>> was a single region/rack).
>>>
>>> see inlien:
>>>
>>>
>>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs 
>>> wrote:
>>>
 Andres, it was a single test in both cases, and in both cases there was
 almost no delay from MAAS.  It's not significant enough to call it
 positive results.


>>> Comment #93 shows there are /some/ improvements when comparing those two
>>> samples only, but as I have already said, we need data over time to in both
>>> scenarios to properly compare and determine whether the changes do make any
>>> material performance improvements with the current conditions of the
>>> samples (both samples are with a fixed io starvation on the environment).
>>>
>>>
 Since neither of you answered yes, I'll assume the answer was no to my
 question of whether there was anything in my logs or data that showed
 reading the template from disk on the rack controller was the culprit,
 and that this fix just represents a guess at what might be causing the
 delay.

>>>
>>> To be fair, your logs do not provide anything concrete to determine what's
>>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>>> we have since then determine that those issues were a result of IO
>>> starvation (from the VM's writing to disk). As such, the only way we can
>>> *really* see if the patch brings any significant performance improvements
>>> is to run tests in the environment were you were seeing the issues in the
>>> first place.
>>
>> I didn't think my logs provided anything concrete!  That's because the
>> logging built into MAAS is not sufficient enough to do so.
>>
>> I can't break that environment to test anymore - we got it working
>> thanks to you guy's help and it's a production environment that needs
>> to keep running other tests.
>>
>> It might possible to recreate this on another maas server, using
>> 'stress' or a similar tool to cause disk contention.
>>
>> Jason
>>
>>> As such, if you are willing to test if these make any material difference,
>>> I would unfix your environment and do two runs (one without the fix, and
>>> one with the fix). That's the only way we can really compare and be certain
>>> in *your* environment.
>>>

 --
 You received this bug notification because you are subscribed to MAAS.
 https://bugs.launchpad.net/bugs/1743249

 Title:
   Failed Deployment after timeout trying to retrieve grub cfg

 To manage notifications about this bug go to:
 https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

 Launchpad-Notification-Type: bug
 Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
 importance=Undecided; assignee=None;
 Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
 status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com;
 Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
 Launchpad-Bug-Information-Type: Public
 Launchpad-Bug-Private: no
 Launchpad-Bug-Security-Vulnerability: no
 Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
 jason-hobbs mpontillo vorlon
 Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
 Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
 Launchpad-Message-Rationale: Subscriber (MAAS)
 Launchpad-Message-For: andreserl

>>>
>>>
>>> --
>>> Andres Rodriguez (RoAkSoAx)
>>> Ubuntu Server Developer
>>> MSc. Telecom & Networking
>>> Systems Engineer
>>>
>>> --
>>> You received this bug notification because you are subscribed to the bug
>>> report.
>>> https://bugs.launchpad.net/bugs/1743249
>>>
>>> Title:
>>>   Failed Deployment after timeout trying to retrieve grub cfg
>>>
>>> Status in MAAS:
>>>   New
>>> Status in grub2 package in Ubuntu:
>>>   Fix Released
>>>
>>> Bug description:
>>>   A node failed to deploy after it failed to retrieve a grub.cfg from
>>>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>>>   to retrieve the grub cfg many times, over 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
dm-delay looks very interesting along those lines.

https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-
delay.html

https://www.kernel.org/doc/Documentation/device-mapper/delay.txt

On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs  wrote:
> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
>  wrote:
>> I don't have logs anymore as I have since rebuilt my environment, but I can
>> confirm seeing improvements on a maas server running with high IO (note it
>> was a single region/rack).
>>
>> see inlien:
>>
>>
>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs 
>> wrote:
>>
>>> Andres, it was a single test in both cases, and in both cases there was
>>> almost no delay from MAAS.  It's not significant enough to call it
>>> positive results.
>>>
>>>
>> Comment #93 shows there are /some/ improvements when comparing those two
>> samples only, but as I have already said, we need data over time to in both
>> scenarios to properly compare and determine whether the changes do make any
>> material performance improvements with the current conditions of the
>> samples (both samples are with a fixed io starvation on the environment).
>>
>>
>>> Since neither of you answered yes, I'll assume the answer was no to my
>>> question of whether there was anything in my logs or data that showed
>>> reading the template from disk on the rack controller was the culprit,
>>> and that this fix just represents a guess at what might be causing the
>>> delay.
>>>
>>
>> To be fair, your logs do not provide anything concrete to determine what's
>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>> we have since then determine that those issues were a result of IO
>> starvation (from the VM's writing to disk). As such, the only way we can
>> *really* see if the patch brings any significant performance improvements
>> is to run tests in the environment were you were seeing the issues in the
>> first place.
>
> I didn't think my logs provided anything concrete!  That's because the
> logging built into MAAS is not sufficient enough to do so.
>
> I can't break that environment to test anymore - we got it working
> thanks to you guy's help and it's a production environment that needs
> to keep running other tests.
>
> It might possible to recreate this on another maas server, using
> 'stress' or a similar tool to cause disk contention.
>
> Jason
>
>> As such, if you are willing to test if these make any material difference,
>> I would unfix your environment and do two runs (one without the fix, and
>> one with the fix). That's the only way we can really compare and be certain
>> in *your* environment.
>>
>>>
>>> --
>>> You received this bug notification because you are subscribed to MAAS.
>>> https://bugs.launchpad.net/bugs/1743249
>>>
>>> Title:
>>>   Failed Deployment after timeout trying to retrieve grub cfg
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>>
>>> Launchpad-Notification-Type: bug
>>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>>> importance=Undecided; assignee=None;
>>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>>> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com;
>>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>>> Launchpad-Bug-Information-Type: Public
>>> Launchpad-Bug-Private: no
>>> Launchpad-Bug-Security-Vulnerability: no
>>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
>>> jason-hobbs mpontillo vorlon
>>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>>> Launchpad-Message-Rationale: Subscriber (MAAS)
>>> Launchpad-Message-For: andreserl
>>>
>>
>>
>> --
>> Andres Rodriguez (RoAkSoAx)
>> Ubuntu Server Developer
>> MSc. Telecom & Networking
>> Systems Engineer
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>>   Failed Deployment after timeout trying to retrieve grub cfg
>>
>> Status in MAAS:
>>   New
>> Status in grub2 package in Ubuntu:
>>   Fix Released
>>
>> Bug description:
>>   A node failed to deploy after it failed to retrieve a grub.cfg from
>>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>>   to retrieve the grub cfg many times, over about 30 seconds:
>>
>>   http://paste.ubuntu.com/26387256/
>>
>>   We see the same thing for other hosts around the same time:
>>
>>   http://paste.ubuntu.com/26387262/
>>
>>   It seems like MAAS is taking way too long to respond to these
>>   requests.
>>
>>   This is very similar to bug 1724677, which was happening pre-
>>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>>   back-end failed" in the logs anymore.
>>
>>   I connected to the console on this system and it had errors about
>> 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
 wrote:
> I don't have logs anymore as I have since rebuilt my environment, but I can
> confirm seeing improvements on a maas server running with high IO (note it
> was a single region/rack).
>
> see inlien:
>
>
> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs 
> wrote:
>
>> Andres, it was a single test in both cases, and in both cases there was
>> almost no delay from MAAS.  It's not significant enough to call it
>> positive results.
>>
>>
> Comment #93 shows there are /some/ improvements when comparing those two
> samples only, but as I have already said, we need data over time to in both
> scenarios to properly compare and determine whether the changes do make any
> material performance improvements with the current conditions of the
> samples (both samples are with a fixed io starvation on the environment).
>
>
>> Since neither of you answered yes, I'll assume the answer was no to my
>> question of whether there was anything in my logs or data that showed
>> reading the template from disk on the rack controller was the culprit,
>> and that this fix just represents a guess at what might be causing the
>> delay.
>>
>
> To be fair, your logs do not provide anything concrete to determine what's
> the culprit of the issue on the MAAS side. It provides a lot of clues, and
> we have since then determine that those issues were a result of IO
> starvation (from the VM's writing to disk). As such, the only way we can
> *really* see if the patch brings any significant performance improvements
> is to run tests in the environment were you were seeing the issues in the
> first place.

I didn't think my logs provided anything concrete!  That's because the
logging built into MAAS is not sufficient enough to do so.

I can't break that environment to test anymore - we got it working
thanks to you guy's help and it's a production environment that needs
to keep running other tests.

It might possible to recreate this on another maas server, using
'stress' or a similar tool to cause disk contention.

Jason

> As such, if you are willing to test if these make any material difference,
> I would unfix your environment and do two runs (one without the fix, and
> one with the fix). That's the only way we can really compare and be certain
> in *your* environment.
>
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>>   Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
>> jason-hobbs mpontillo vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   Fix Released
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
I don't have logs anymore as I have since rebuilt my environment, but I can
confirm seeing improvements on a maas server running with high IO (note it
was a single region/rack).

see inlien:


On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs 
wrote:

> Andres, it was a single test in both cases, and in both cases there was
> almost no delay from MAAS.  It's not significant enough to call it
> positive results.
>
>
Comment #93 shows there are /some/ improvements when comparing those two
samples only, but as I have already said, we need data over time to in both
scenarios to properly compare and determine whether the changes do make any
material performance improvements with the current conditions of the
samples (both samples are with a fixed io starvation on the environment).


> Since neither of you answered yes, I'll assume the answer was no to my
> question of whether there was anything in my logs or data that showed
> reading the template from disk on the rack controller was the culprit,
> and that this fix just represents a guess at what might be causing the
> delay.
>

To be fair, your logs do not provide anything concrete to determine what's
the culprit of the issue on the MAAS side. It provides a lot of clues, and
we have since then determine that those issues were a result of IO
starvation (from the VM's writing to disk). As such, the only way we can
*really* see if the patch brings any significant performance improvements
is to run tests in the environment were you were seeing the issues in the
first place.

As such, if you are willing to test if these make any material difference,
I would unfix your environment and do two runs (one without the fix, and
one with the fix). That's the only way we can really compare and be certain
in *your* environment.

>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
> jason-hobbs mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Andres, it was a single test in both cases, and in both cases there was
almost no delay from MAAS.  It's not significant enough to call it
positive results.

Since neither of you answered yes, I'll assume the answer was no to my
question of whether there was anything in my logs or data that showed
reading the template from disk on the rack controller was the culprit,
and that this fix just represents a guess at what might be causing the
delay.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Blake Rouse
Andres did the testing of the changes and has logs to prove the
improvement.


On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs 
wrote:

> Blake, that's great.  Do you have before and after numbers showing the
> improvement this change made?
>
> Do you have any data or logs that led you to believe this was the
> culprit in the slow responses I saw on my cluster?
>
> On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse 
> wrote:
> > Actually caching does make a difference. That method is not just caching
> > the reading of a file, it caches the searching of the file based on the
> > purpose, the reading of that file from disk (sure can be in kernel
> > cache), the parsing of the template by tempita.
> >
> > All of that is redudant work that is being done on every single request.
> > Searching the filesystem and reading the file from cache is all syscalls
> > even if they come from the kernel cache. Since MAAS is async based that
> > means that coroutine will be placed on hold while we wait for the result
> > to be loaded from the kernel into the memory of the process. That gives
> > other coroutines time to do other things, which means that coroutine
> > doesn't get to execute until others are done or blocked by there own
> > async request.
> >
> > Caching this information can greatly improve that by not requiring the
> > coroutine to be pushed back into the eventloop while it is waiting for
> > data from the kernel and without this change when the data comes back it
> > still has to be processed by tempita which will take time and block the
> > eventloop from completing other work.
> >
> > So its not simply that we should use the kernel to cache reads from the
> > disk there is a lot more involved here. We have noticed improvements
> > with this change on systems that are being ran with large number of VM's
> > because of the reduction of IO.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> >   Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> >   New
> > Status in grub2 package in Ubuntu:
> >   Fix Released
> >
> > Bug description:
> >   A node failed to deploy after it failed to retrieve a grub.cfg from
> >   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >   to retrieve the grub cfg many times, over about 30 seconds:
> >
> >   http://paste.ubuntu.com/26387256/
> >
> >   We see the same thing for other hosts around the same time:
> >
> >   http://paste.ubuntu.com/26387262/
> >
> >   It seems like MAAS is taking way too long to respond to these
> >   requests.
> >
> >   This is very similar to bug 1724677, which was happening pre-
> >   metldown/spectre. The only difference is we don't see "[critical] TFTP
> >   back-end failed" in the logs anymore.
> >
> >   I connected to the console on this system and it had errors about
> >   timing out retrieving the grub-cfg, then it had an error message along
> >   the lines of "error not an ip" and then "double free".  After I
> >   connected but before I could get a screenshot the system rebooted and
> >   was directed by maas to power off, which it did successfully after
> >   booting to linux.
> >
> >   Full logs are available here:
> >   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >   ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> >   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   Fix Released
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
@Jason,

I'm comparing pb in #79 vs pb in #90

#79 (non-patched): https://paste.ubuntu.com/26530737/
#90 (patched with lru_cache): https://paste.ubuntu.com/26531873/

Examples I see in #79:

14:02:ec:42:38:dc # makes 9 requests. on line 160+
14:02:ec:42:28:70 # 8 requests on line 72
14:02:ec:41:d7:38 # 7 on line 92.

In #90 i see:

14:02:ec:41:d7:44 # makes 6 requests on line 7
14:02:ec:42:38:dc # makes 5 requests on line 19


So it is interesting to see that in #79 more machines make more requests than 
those in #90. Obviously we need more data over time to really tell the 
difference in both scenarios, but based on the current logs, it does 
/apparently/ show an improvement.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Blake, that's great.  Do you have before and after numbers showing the
improvement this change made?

Do you have any data or logs that led you to believe this was the
culprit in the slow responses I saw on my cluster?

On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse  wrote:
> Actually caching does make a difference. That method is not just caching
> the reading of a file, it caches the searching of the file based on the
> purpose, the reading of that file from disk (sure can be in kernel
> cache), the parsing of the template by tempita.
>
> All of that is redudant work that is being done on every single request.
> Searching the filesystem and reading the file from cache is all syscalls
> even if they come from the kernel cache. Since MAAS is async based that
> means that coroutine will be placed on hold while we wait for the result
> to be loaded from the kernel into the memory of the process. That gives
> other coroutines time to do other things, which means that coroutine
> doesn't get to execute until others are done or blocked by there own
> async request.
>
> Caching this information can greatly improve that by not requiring the
> coroutine to be pushed back into the eventloop while it is waiting for
> data from the kernel and without this change when the data comes back it
> still has to be processed by tempita which will take time and block the
> eventloop from completing other work.
>
> So its not simply that we should use the kernel to cache reads from the
> disk there is a lot more involved here. We have noticed improvements
> with this change on systems that are being ran with large number of VM's
> because of the reduction of IO.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   Fix Released
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Blake Rouse
Actually caching does make a difference. That method is not just caching
the reading of a file, it caches the searching of the file based on the
purpose, the reading of that file from disk (sure can be in kernel
cache), the parsing of the template by tempita.

All of that is redudant work that is being done on every single request.
Searching the filesystem and reading the file from cache is all syscalls
even if they come from the kernel cache. Since MAAS is async based that
means that coroutine will be placed on hold while we wait for the result
to be loaded from the kernel into the memory of the process. That gives
other coroutines time to do other things, which means that coroutine
doesn't get to execute until others are done or blocked by there own
async request.

Caching this information can greatly improve that by not requiring the
coroutine to be pushed back into the eventloop while it is waiting for
data from the kernel and without this change when the data comes back it
still has to be processed by tempita which will take time and block the
eventloop from completing other work.

So its not simply that we should use the kernel to cache reads from the
disk there is a lot more involved here. We have noticed improvements
with this change on systems that are being ran with large number of VM's
because of the reduction of IO.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
The patch from #84 is adding a cache for reading the template file on
the rack controller.  I don't understand why this change is being made.

This file will almost certainly be in the page cache anyhow as these
systems have a lot of free ram.  Usually it's best to just let the page
cache do its thing and not try to re-implement it in userspace, unless
you really know what you're doing.

I haven't seen any logs that indicate that there was a bottleneck
reading the template file. Do you have some data along those lines?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Anyhow, I tested with the patch from #84 as requested, here are the
results: http://paste.ubuntu.com/26531873/

We're still seeing some retries with it, same as before.

But, I think the test is of limited value.  It didn't make things worse
but we don't have any evidence from the test that it made things better.

We're not seeing big delays on a regular basis anymore after changing
the storage configuration to reduce contention.  With or without this
patch, we don't see any big delays.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Ok - and what about the region controller losing contact with the rack
controller log messages? What is that about?

On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguez
 wrote:
> fwiw, the deadlocks issues is regiond trying to determine which process
> should send updates to which racks for *dhcp* changes, so this is not at
> all related to the RPC boot requests for pxe.
>
> On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs 
> wrote:
>
>> Can you please comment on the deadlock detected error from the db log in
>> posted in #36
>>
>> http://paste.ubuntu.com/26530761/
>>
>> That is not expected behavior is it?  Also the fact that MAAS thinks its
>> losing rack/region connections seems like it could be related to this
>> behavior.
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>>   Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
>> mpontillo vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Launchpad Bug Tracker
This bug was fixed in the package grub2 - 2.02-2ubuntu6

---
grub2 (2.02-2ubuntu6) bionic; urgency=medium

  [ Steve Langasek ]
  * debian/patches/bufio_sensible_block_sizes.patch: Don't use arbitrary file
fizes as block sizes in bufio: this avoids potentially seeking back in
the files unnecessarily, which may require re-open files that cannot be
seeked into, such as via TFTP. (LP: #1743249)

 -- Mathieu Trudel-Lapierre   Mon, 05 Feb 2018
11:58:09 -0500

** Changed in: grub2 (Ubuntu)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
@Jason,

Are these tests with archive grub or patched grub?

On Tue, Feb 6, 2018 at 11:39 AM, Jason Hobbs 
wrote:

> Andres,
>
> I ran the test with VMs limited to 9 of 20 cores (cut the core limit
> in half for VMs).  The first time range from this dump is with the
> cores at their normal limit (18).
>
> As you can see, the behavior didn't change much from one set to the
> other.  Both sets had instances where grub started doing retries,
> although in neither case did it take very long.
>
> http://paste.ubuntu.com/26530737/
>
> So it seems that changing the CPU limits for the VMs doesn't change
> the results drastically, which lines up with the data showing CPU
> utilization never gets over 50%.
>
> Jason
>
>
> On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez
>  wrote:
> >>
> >>
> >> > That being said, because CPU load doesn't show high we are making the
> >> > *assumption* that it is not impacting MAAS, but again, this is an
> >> > assumption. Making the requested change for having at least 4 CPUs
> >> (ideally
> >> > 6) would allow us to determining what are the effects and see whether
> >> > there's any difference on behavior and would help identify what other
> >> > issues.
> >> >
> >> > Without having the comparison then we are making it more difficult to
> >> > isolate the problem.
> >>
> >> To improve performance the typical pattern is 1) identify the
> >> bottleneck 2) eliminate that as the bottleneck 3) repeat.
> >>
> >> We have not identified CPU as a bottleneck.  The top data says it is
> >> not!
> >>
> >
> > Jason,
> >
> > That doesn't change the fact that we are requesting tests to be run with
> > different CPU configuration for VM's, so we can make a *comparison* and
> see
> > if there is any material difference or none at all with the current
> > conditions. While I agree with you that the data /seems/ to show that
> there
> > is not issue with CPU, that doesn't change the fact that we don't have
> any
> > data to compare with, as there could still be an impact even if it is
> > minimum.
> >
> > Without the data, we cannot certainly assert that there's no issue caused
> > by CPU usage because we don't have a reference or point of comparison. So
> > while all fingers seem to be pointing to storage, It strongly believe it
> is
> > worth gathering the data now and fully discard.
> >
> > If this is something that your environment is unable to do, I would
> > appreciate that you clarify that instead of asserting that there's no
> > performance impact in MAAS due to CPU usage, when we don't really know
> for
> > sure (e.g. we don't know if MAAS behaves differently with less CPU usage
> in
> > the current conditions, and that's data worth gathering to be able to
> > better support you in the future).
> >
> > --
> > Andres Rodriguez (RoAkSoAx)
> > Ubuntu Server Developer
> > MSc. Telecom & Networking
> > Systems Engineer
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> >   Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> >   New
> > Status in grub2 package in Ubuntu:
> >   In Progress
> >
> > Bug description:
> >   A node failed to deploy after it failed to retrieve a grub.cfg from
> >   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >   to retrieve the grub cfg many times, over about 30 seconds:
> >
> >   http://paste.ubuntu.com/26387256/
> >
> >   We see the same thing for other hosts around the same time:
> >
> >   http://paste.ubuntu.com/26387262/
> >
> >   It seems like MAAS is taking way too long to respond to these
> >   requests.
> >
> >   This is very similar to bug 1724677, which was happening pre-
> >   metldown/spectre. The only difference is we don't see "[critical] TFTP
> >   back-end failed" in the logs anymore.
> >
> >   I connected to the console on this system and it had errors about
> >   timing out retrieving the grub-cfg, then it had an error message along
> >   the lines of "error not an ip" and then "double free".  After I
> >   connected but before I could get a screenshot the system rebooted and
> >   was directed by maas to power off, which it did successfully after
> >   booting to linux.
> >
> >   Full logs are available here:
> >   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >   ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> >   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
fwiw, the deadlocks issues is regiond trying to determine which process
should send updates to which racks for *dhcp* changes, so this is not at
all related to the RPC boot requests for pxe.

On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs 
wrote:

> Can you please comment on the deadlock detected error from the db log in
> posted in #36
>
> http://paste.ubuntu.com/26530761/
>
> That is not expected behavior is it?  Also the fact that MAAS thinks its
> losing rack/region connections seems like it could be related to this
> behavior.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
>
> >
> > Yes, it is not an unknown machine, but that doesn;t change the fact that
> > this is working as designed. If the client didn't get a response for the
> > request it makes, and the client decides to move on and makes a different
> > request, then it is working as designed. Again, the bug here is not on
> the
> > clients behavior, the bug here is on the fact that the response is not
> > being done in a timely manner.
>
> Yes, agreed 100%.  It's not a client bug, it's a server bug.
>
> >
> >>
> >> > So this is *not* a race condition in MAAS. This is working as designed
> >> and
> >> > is expected. The problem here is that MAAS takes too long to answer
> the
> >> > initial request, which causes grub to timeout and move on to request a
> >> > different config file.
> >>
> >> Yes, because there is a race condition in the design - the MAC
> >> specific file has to be generated before grub times out.  It could
> >> instead be generated before the node ever starts booting, allowing it
> >> to be served just as fast as the -default-amd64 file is, eliminating
> >> that race condition.
> >>
> >
> > It is not a race condition. It is doing exactly what it was told to do.
> It
> > request X thing, didn't get a response, then it requested Y thing, and
> got
> > a response. The fact that there's no response when X happens on a
> /timely/
> > manner is not a race, its a bug on the server side. So, if the machine
> were
> > to not be known to MAAS, it would work as expected. But since it is known
> > and the response doesn't come on a timely manner for grub, it moves on.
> > This is the same behavior pxe, uboot and other network bootloaders
> follow.
>
> Right - it's a bug on the server side!  That's what I've been saying.
>
> > And yes, you could argue that the config could be generated before the
> node
> > starts booting, but what you are not considering is that the node can
> boot
> > from any rack controller really and that would require maas to send the
> > same file to all rack controllers in the same vlan the machine is booting
> > from and write files onto the disk dynamically, which in fact, can impact
> > performance even more. The fact the config is generated on the fly is
> > because it is generated for the specific rack controller where the
> machine
> > is booting from and that;'s the intended design.
>
> I never suggested the files had to be written to disk, but yes, they
> would need to be sent to each rack controller that it could boot from.
>
> I know it's the intended design, but it has a race condition built in
> that could be eliminated with another design.  That's all I'm saying.
>
> It sounds like you agree and you point out there would be trade offs,
> and that's fine.
>

Actually we dont believe this is a good change. In fact, this will cause
booting issues and overall performance issues.

We already know of two areas where this can be improved. One is
non-backportable to 2.3, the other one is this:

https://paste.ubuntu.com/26530972/

Is there any chance you can test that patch, or do you want me to put a
patched package somehwere?

>
> Jason
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
The deadlock is not expected behavior.

Due to the isolation level, the number of workers (e.g. 12
workers/3regions) and the fact that there could be IO starvation, its
surfacing this issue. That said, changes to improve this and prevent the
deadlocks are not backportable to 2.3 and are targetted for 2.4

On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs 
wrote:

> Can you please comment on the deadlock detected error from the db log in
> posted in #36
>
> http://paste.ubuntu.com/26530761/
>
> That is not expected behavior is it?  Also the fact that MAAS thinks its
> losing rack/region connections seems like it could be related to this
> behavior.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguez
 wrote:
> On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs 
> wrote:
>
>> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
>>  wrote:
>> > I think there's a misunderstanding on how the network boot process
>> happens:
>> > Let's look at pxe linux first. Pxe linux does this:
>> >
>> > 1. tries UUID first # if no answer, it moves on
>> > 2. Tries mac # if no answer, it moves on
>> > 3. tries full IP address # if no answer, it moves on
>> > 4. tries partial IP address # if no answer, it moves on
>> > 5. does 4
>> > 6. does 4
>> > [...]
>> > 7. boots default.
>> >
>> > This can be seen in here:
>> >
>> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
>> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
>> > /mybootdir/pxelinux.cfg/C0A8025B
>> > /mybootdir/pxelinux.cfg/C0A8025
>> > /mybootdir/pxelinux.cfg/C0A802
>> > /mybootdir/pxelinux.cfg/C0A80
>> > /mybootdir/pxelinux.cfg/C0A8
>> > /mybootdir/pxelinux.cfg/C0A
>> > /mybootdir/pxelinux.cfg/C0
>> > /mybootdir/pxelinux.cfg/C
>> > /mybootdir/pxelinux.cfg/default
>> >
>> >
>> > That said, in the case of grub, this behavior is similar. You have
>> > described this behavior in comment #16. So what is it that's happening:
>> >
>> > 1. grub is trying grub.cfg- address multiple times, but since it
>> > doesn't get a response, it gives it.
>> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
>> >
>> > That said, the requests are handled completely different. The -
>> > requests actually accesses the *node* object in the database  by
>> searching
>> > it with the mac address where the request is made. With this node object,
>> > we generate the config file.
>> >
>> > In comparison, the -default-amd64 does *not* access the node object. It
>> > just access two config settings and the db query is *much* cheaper. Also,
>> > we have to keep in mind that after grub has done many retries, this
>> returns
>> > rather fast in comparison because it is not only cheaper, but at that
>> point
>> > MAAS may be with way less load of queued DB requests. Either way, grub
>> > giving up means that it wont expect for the initial request, but it will
>> > expect a new response for the new file it asked for.
>> >
>> > That said, this is working *exactly* as expected, because this
>> effectively
>> > tells grub "if config for your MAC address was not returned, you can
>> safely
>> > assume you are an unknown machine to MAAS", hence grub requests a
>> different
>> > config file to start the enlistment process.
>>
>> Except it's not an unknown machine, and MAAS treating it like one is
>> bad behavior and a bug.
>
>
>> This is not "working exactly as expected".  "Working exactly as
>> expected" would be my machine being deployed when I asked for it to
>> be.
>>
>
> Yes, it is not an unknown machine, but that doesn;t change the fact that
> this is working as designed. If the client didn't get a response for the
> request it makes, and the client decides to move on and makes a different
> request, then it is working as designed. Again, the bug here is not on the
> clients behavior, the bug here is on the fact that the response is not
> being done in a timely manner.

Yes, agreed 100%.  It's not a client bug, it's a server bug.

>
>>
>> > So this is *not* a race condition in MAAS. This is working as designed
>> and
>> > is expected. The problem here is that MAAS takes too long to answer the
>> > initial request, which causes grub to timeout and move on to request a
>> > different config file.
>>
>> Yes, because there is a race condition in the design - the MAC
>> specific file has to be generated before grub times out.  It could
>> instead be generated before the node ever starts booting, allowing it
>> to be served just as fast as the -default-amd64 file is, eliminating
>> that race condition.
>>
>
> It is not a race condition. It is doing exactly what it was told to do. It
> request X thing, didn't get a response, then it requested Y thing, and got
> a response. The fact that there's no response when X happens on a /timely/
> manner is not a race, its a bug on the server side. So, if the machine were
> to not be known to MAAS, it would work as expected. But since it is known
> and the response doesn't come on a timely manner for grub, it moves on.
> This is the same behavior pxe, uboot and other network bootloaders follow.

Right - it's a bug on the server side!  That's what I've been saying.

> And yes, you could argue that the config could be generated before the node
> starts booting, but what you are not considering is that the node can boot
> from any rack controller really and that would require maas to send the
> same file to all rack controllers in the same vlan the machine is booting
> from and write files onto the disk dynamically, which in fact, can impact
> performance even more. The fact the config is generated on the fly is
> because it 

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Can you please comment on the deadlock detected error from the db log in
posted in #36

http://paste.ubuntu.com/26530761/

That is not expected behavior is it?  Also the fact that MAAS thinks its
losing rack/region connections seems like it could be related to this
behavior.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Andres Rodriguez
On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs 
wrote:

> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
>  wrote:
> > I think there's a misunderstanding on how the network boot process
> happens:
> > Let's look at pxe linux first. Pxe linux does this:
> >
> > 1. tries UUID first # if no answer, it moves on
> > 2. Tries mac # if no answer, it moves on
> > 3. tries full IP address # if no answer, it moves on
> > 4. tries partial IP address # if no answer, it moves on
> > 5. does 4
> > 6. does 4
> > [...]
> > 7. boots default.
> >
> > This can be seen in here:
> >
> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
> > /mybootdir/pxelinux.cfg/C0A8025B
> > /mybootdir/pxelinux.cfg/C0A8025
> > /mybootdir/pxelinux.cfg/C0A802
> > /mybootdir/pxelinux.cfg/C0A80
> > /mybootdir/pxelinux.cfg/C0A8
> > /mybootdir/pxelinux.cfg/C0A
> > /mybootdir/pxelinux.cfg/C0
> > /mybootdir/pxelinux.cfg/C
> > /mybootdir/pxelinux.cfg/default
> >
> >
> > That said, in the case of grub, this behavior is similar. You have
> > described this behavior in comment #16. So what is it that's happening:
> >
> > 1. grub is trying grub.cfg- address multiple times, but since it
> > doesn't get a response, it gives it.
> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
> >
> > That said, the requests are handled completely different. The -
> > requests actually accesses the *node* object in the database  by
> searching
> > it with the mac address where the request is made. With this node object,
> > we generate the config file.
> >
> > In comparison, the -default-amd64 does *not* access the node object. It
> > just access two config settings and the db query is *much* cheaper. Also,
> > we have to keep in mind that after grub has done many retries, this
> returns
> > rather fast in comparison because it is not only cheaper, but at that
> point
> > MAAS may be with way less load of queued DB requests. Either way, grub
> > giving up means that it wont expect for the initial request, but it will
> > expect a new response for the new file it asked for.
> >
> > That said, this is working *exactly* as expected, because this
> effectively
> > tells grub "if config for your MAC address was not returned, you can
> safely
> > assume you are an unknown machine to MAAS", hence grub requests a
> different
> > config file to start the enlistment process.
>
> Except it's not an unknown machine, and MAAS treating it like one is
> bad behavior and a bug.


> This is not "working exactly as expected".  "Working exactly as
> expected" would be my machine being deployed when I asked for it to
> be.
>

Yes, it is not an unknown machine, but that doesn;t change the fact that
this is working as designed. If the client didn't get a response for the
request it makes, and the client decides to move on and makes a different
request, then it is working as designed. Again, the bug here is not on the
clients behavior, the bug here is on the fact that the response is not
being done in a timely manner.


>
> > So this is *not* a race condition in MAAS. This is working as designed
> and
> > is expected. The problem here is that MAAS takes too long to answer the
> > initial request, which causes grub to timeout and move on to request a
> > different config file.
>
> Yes, because there is a race condition in the design - the MAC
> specific file has to be generated before grub times out.  It could
> instead be generated before the node ever starts booting, allowing it
> to be served just as fast as the -default-amd64 file is, eliminating
> that race condition.
>

It is not a race condition. It is doing exactly what it was told to do. It
request X thing, didn't get a response, then it requested Y thing, and got
a response. The fact that there's no response when X happens on a /timely/
manner is not a race, its a bug on the server side. So, if the machine were
to not be known to MAAS, it would work as expected. But since it is known
and the response doesn't come on a timely manner for grub, it moves on.
This is the same behavior pxe, uboot and other network bootloaders follow.

And yes, you could argue that the config could be generated before the node
starts booting, but what you are not considering is that the node can boot
from any rack controller really and that would require maas to send the
same file to all rack controllers in the same vlan the machine is booting
from and write files onto the disk dynamically, which in fact, can impact
performance even more. The fact the config is generated on the fly is
because it is generated for the specific rack controller where the machine
is booting from and that;'s the intended design.

>
> Jason
>
> > On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs 
> > wrote:
> >
> >> The packetdump (comment #35) of MAAS not responding to grub's request
> >> for the mac specific grub.cfg before grub times 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
Andres,

I ran the test with VMs limited to 9 of 20 cores (cut the core limit
in half for VMs).  The first time range from this dump is with the
cores at their normal limit (18).

As you can see, the behavior didn't change much from one set to the
other.  Both sets had instances where grub started doing retries,
although in neither case did it take very long.

http://paste.ubuntu.com/26530737/

So it seems that changing the CPU limits for the VMs doesn't change
the results drastically, which lines up with the data showing CPU
utilization never gets over 50%.

Jason


On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez
 wrote:
>>
>>
>> > That being said, because CPU load doesn't show high we are making the
>> > *assumption* that it is not impacting MAAS, but again, this is an
>> > assumption. Making the requested change for having at least 4 CPUs
>> (ideally
>> > 6) would allow us to determining what are the effects and see whether
>> > there's any difference on behavior and would help identify what other
>> > issues.
>> >
>> > Without having the comparison then we are making it more difficult to
>> > isolate the problem.
>>
>> To improve performance the typical pattern is 1) identify the
>> bottleneck 2) eliminate that as the bottleneck 3) repeat.
>>
>> We have not identified CPU as a bottleneck.  The top data says it is
>> not!
>>
>
> Jason,
>
> That doesn't change the fact that we are requesting tests to be run with
> different CPU configuration for VM's, so we can make a *comparison* and see
> if there is any material difference or none at all with the current
> conditions. While I agree with you that the data /seems/ to show that there
> is not issue with CPU, that doesn't change the fact that we don't have any
> data to compare with, as there could still be an impact even if it is
> minimum.
>
> Without the data, we cannot certainly assert that there's no issue caused
> by CPU usage because we don't have a reference or point of comparison. So
> while all fingers seem to be pointing to storage, It strongly believe it is
> worth gathering the data now and fully discard.
>
> If this is something that your environment is unable to do, I would
> appreciate that you clarify that instead of asserting that there's no
> performance impact in MAAS due to CPU usage, when we don't really know for
> sure (e.g. we don't know if MAAS behaves differently with less CPU usage in
> the current conditions, and that's data worth gathering to be able to
> better support you in the future).
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-06 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
 wrote:
> I think there's a misunderstanding on how the network boot process happens:
> Let's look at pxe linux first. Pxe linux does this:
>
> 1. tries UUID first # if no answer, it moves on
> 2. Tries mac # if no answer, it moves on
> 3. tries full IP address # if no answer, it moves on
> 4. tries partial IP address # if no answer, it moves on
> 5. does 4
> 6. does 4
> [...]
> 7. boots default.
>
> This can be seen in here:
>
> /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
> /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
> /mybootdir/pxelinux.cfg/C0A8025B
> /mybootdir/pxelinux.cfg/C0A8025
> /mybootdir/pxelinux.cfg/C0A802
> /mybootdir/pxelinux.cfg/C0A80
> /mybootdir/pxelinux.cfg/C0A8
> /mybootdir/pxelinux.cfg/C0A
> /mybootdir/pxelinux.cfg/C0
> /mybootdir/pxelinux.cfg/C
> /mybootdir/pxelinux.cfg/default
>
>
> That said, in the case of grub, this behavior is similar. You have
> described this behavior in comment #16. So what is it that's happening:
>
> 1. grub is trying grub.cfg- address multiple times, but since it
> doesn't get a response, it gives it.
> 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
>
> That said, the requests are handled completely different. The -
> requests actually accesses the *node* object in the database  by searching
> it with the mac address where the request is made. With this node object,
> we generate the config file.
>
> In comparison, the -default-amd64 does *not* access the node object. It
> just access two config settings and the db query is *much* cheaper. Also,
> we have to keep in mind that after grub has done many retries, this returns
> rather fast in comparison because it is not only cheaper, but at that point
> MAAS may be with way less load of queued DB requests. Either way, grub
> giving up means that it wont expect for the initial request, but it will
> expect a new response for the new file it asked for.
>
> That said, this is working *exactly* as expected, because this effectively
> tells grub "if config for your MAC address was not returned, you can safely
> assume you are an unknown machine to MAAS", hence grub requests a different
> config file to start the enlistment process.

Except it's not an unknown machine, and MAAS treating it like one is
bad behavior and a bug.

This is not "working exactly as expected".  "Working exactly as
expected" would be my machine being deployed when I asked for it to
be.

> So this is *not* a race condition in MAAS. This is working as designed and
> is expected. The problem here is that MAAS takes too long to answer the
> initial request, which causes grub to timeout and move on to request a
> different config file.

Yes, because there is a race condition in the design - the MAC
specific file has to be generated before grub times out.  It could
instead be generated before the node ever starts booting, allowing it
to be served just as fast as the -default-amd64 file is, eliminating
that race condition.

Jason

> On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs 
> wrote:
>
>> The packetdump (comment #35) of MAAS not responding to grub's request
>> for the mac specific grub.cfg before grub times out, and then responding
>> immediately to the generic-amd64 grub cfg, clearly shows a race
>> condition in MAAS.
>>
>> MAAS's design of dynamically generating the interface specific grub
>> config only after it receives the tftp request for it is susceptible to
>> a race condition where grub times out before MAAS can respond.
>>
>> That design is not the only possible design.  All the information
>> required for the interface specific grub.cfg is available before the
>> machine ever powers on, and could be made available on the rack
>> controllers at that time too.
>>
>> Doing so would eliminate that race condition, or at least reduce the
>> opportunity greatly, as we see MAAS has no problems immediately
>> responding and serving files that it doesn't need to dynamically
>> generate at request time.
>>
>> There is still some question around what in the environment is
>> contributing to MAAS not responding faster, and what MAAS is doing while
>> it takes 60+ seconds to respond to the request, but that doesn't change
>> the fact that the current MAAS design is racy (and that's a bug).
>>
>> Whatever we change in the environment to reduce the likelihood of
>> hitting this issue there doesn't solve the underlying race condition in
>> MAAS, and leaves open the possibility of hitting the issue other places
>> too.
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>>   Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
>
>
> > That being said, because CPU load doesn't show high we are making the
> > *assumption* that it is not impacting MAAS, but again, this is an
> > assumption. Making the requested change for having at least 4 CPUs
> (ideally
> > 6) would allow us to determining what are the effects and see whether
> > there's any difference on behavior and would help identify what other
> > issues.
> >
> > Without having the comparison then we are making it more difficult to
> > isolate the problem.
>
> To improve performance the typical pattern is 1) identify the
> bottleneck 2) eliminate that as the bottleneck 3) repeat.
>
> We have not identified CPU as a bottleneck.  The top data says it is
> not!
>

Jason,

That doesn't change the fact that we are requesting tests to be run with
different CPU configuration for VM's, so we can make a *comparison* and see
if there is any material difference or none at all with the current
conditions. While I agree with you that the data /seems/ to show that there
is not issue with CPU, that doesn't change the fact that we don't have any
data to compare with, as there could still be an impact even if it is
minimum.

Without the data, we cannot certainly assert that there's no issue caused
by CPU usage because we don't have a reference or point of comparison. So
while all fingers seem to be pointing to storage, It strongly believe it is
worth gathering the data now and fully discard.

If this is something that your environment is unable to do, I would
appreciate that you clarify that instead of asserting that there's no
performance impact in MAAS due to CPU usage, when we don't really know for
sure (e.g. we don't know if MAAS behaves differently with less CPU usage in
the current conditions, and that's data worth gathering to be able to
better support you in the future).

-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
+1 Mike.  I agree it's a bug, but it there isn't real evidence that
it's what causes the long delay.

On Mon, Feb 5, 2018 at 7:12 PM, Mike Pontillo
 wrote:
> Ah, I see what you mean there; I used the following filter in Wireshark:
>
> udp.dstport == 25305 or udp.srcport == 25305
>
> This is not the behavior I saw if the TFTP request is answered in a
> timely manner, so I suspect that the long delay between the initial
> request and the answer is causing the timeouts to occur in the TFTP
> code, which causes this separate "stacked OACK" bug.
>
> By the time the "stacked OACKs" are sent, it's been over one minute, and
> the client isn't listening for a reply any more. So yes, "stacked OACKs"
> are a real bug, but right now I think they're just distracting from the
> root cause (the long delay).
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mike Pontillo
Ah, I see what you mean there; I used the following filter in Wireshark:

udp.dstport == 25305 or udp.srcport == 25305

This is not the behavior I saw if the TFTP request is answered in a
timely manner, so I suspect that the long delay between the initial
request and the answer is causing the timeouts to occur in the TFTP
code, which causes this separate "stacked OACK" bug.

By the time the "stacked OACKs" are sent, it's been over one minute, and
the client isn't listening for a reply any more. So yes, "stacked OACKs"
are a real bug, but right now I think they're just distracting from the
root cause (the long delay).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Tue, Feb 06, 2018 at 12:11:21AM -, Mike Pontillo wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?

This was the first packet capture that Jason posted, in comment #30.  The
udp retransmits shown in packets 6262-6268 each receive an answering packet
in 6270-6271,6273-6277, in addition to 6269 as an answer to 6261.  For
whatever reason, wireshark here does not decipher these duplicate OACK
packets as OACK, but an examination of the raw packets shows that's clearly
what they are.

> I looked at a packet capture Andres pointed me to, and don't see the
> "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
> indicated by the (source port, dest port) tuple, and I see that MAAS
> correctly OACKs each individual transaction (per RFC 2347) - not the
> retry packets within the same transaction.

Packets 6269-6271,6273-6277 are all answers to the same port on the client.
They don't have the same source port, because MAAS has allocated a separate
source port for each of these.  It's not acking a separate individual
transaction, it's MAAS /creating/ a separate transaction (with the
allocation of a separate source port) for each one.

RFC2347 does not speak to this; the discussion of the port negotiation is in
RFC1350 §4:

   In order to create a connection, each end of the connection chooses a
   TID for itself, to be used for the duration of that connection.  The
   TID's chosen for a connection should be randomly chosen, so that the
   probability that the same number is chosen twice in immediate
   succession is very low.  Every packet has associated with it the two
   TID's of the ends of the connection, the source TID and the
   destination TID.  These TID's are handed to the supporting UDP (or
   other datagram protocol) as the source and destination ports.  A
   requesting host chooses its source TID as described above, and sends
   its initial request to the known TID 69 decimal (105 octal) on the
   serving host.  The response to the request, under normal operation,
   uses a TID chosen by the server as its source TID and the TID chosen
   for the previous message by the requestor as its destination TID.
   The two chosen TID's are then used for the remainder of the transfer.

MAAS responds to 8 udp retransmits on srcport=25305, dstport=69 by sending 8
independent OACK packets back to dstport=25305 each from a different source
port.

Since Andres confirms that these duplicate acks still only result in one
database query, this may be a negligible bug if the only impact is duplicate
small udp packets.  OTOH, depending on how MAAS implements this, it could
also result in port exhaustion on the server if unanswered OACKs are allowed
to linger.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguez
 wrote:
> @Jason,
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs 
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>>  wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have improved things
>>
>> Yes, it has.  That doesn't change whether or not there is a bug in
>> MAAS.  Can you please address the critical log errors that I mentioned
>> in comment #36?  This seems like enough to establish something is
>> going wrong in MAAS.
>>
>
> The tftp issue shows no evidence this is causing any booting failures. We
> have seen this issue before and confirmed that it doesn't cause boot
> issues. See [1]. If you want to try it, it is available in
> ppa:maas/proposed.
>
> [1].https://bugs.launchpad.net/maas/+bug/1376483

It's been "Fixed" multiple times before, in your link above and also
in bug 1724677, but we still see them, very suspiciously around the
time of failures.  I'm not convinced these are actually understood.
Do you have a specific commit  or some idea of what change that
addresses these in 2.4?

> As far as the postgresql logs with "maas@maasdb ERROR: could not serialize
> access due to concurrent update" that is  *not* a bug in MAAS or an issue.
> That's perfectly normal messages with the isolation level the MAAS DB is
> running with. This basically means something else is trying to update the
> db while something else is updating it, and MAAS already handles this by
> doing retries.

That is just one type of db error in the log.  There are many more.
Here's one that says there was a deadlock detected. That's not normal
OK behavior is it?

http://paste.ubuntu.com/26527181/

>> > 2. No tests have been run with fixed grub that have caused boot
>> failures.
>>
>> The comments from #56 were testing with the fixed grub - sorry if that
>> wasn't clear.
>>
>> > 3. AFAIK, the VM config has not changed to use less CPU to compare
>> results and whether this config change causes the bugs in question.
>>
>> The CPU load data from comments #48 and #50 shows that CPU load is not
>> the problem.  The max load average was under 12 on a 20 thread system.
>> That means there was lots of free CPU time, and that this workload is
>> not CPU bound.
>>
>
> CPU load is not CPU utilization. We know that at the time there's 6 other
> VM's with 150%+ CPU usage are writing to the disk because they are being
> deployed and/or configured (e.g. software installation).  Correct me if
> wrong, but this can cause the prioritization of whatever is writing to disk
> over anything else, like the MAAS processes access for resources.

The 150%+ number you are seeing is that process using all of 1 core
(hyperthread) and 50% of another (it's a multithreaded process).  This
does not mean the process is using 150% of the entire CPU capacity.

We don't just have load average - we also have a breakdown of CPU
utilization from top, every 5 seconds:

%Cpu(s): 23.5 us,  6.3 sy,  0.0 ni, 62.9 id,  7.2 wa,  0.0 hi,  0.1 si,
0.0 st

The top man page has more to say about this line, but have a look at
the 'id' number. It's the % of cpu time spent in the idle process
(nothing to do) in the sample period (5 seconds in the above logs).

The lowest that number ever goes in the logs I posted is 52%, meaning
over any 5 second period, we never use more than half of the available
CPU capacity.

> That being said, because CPU load doesn't show high we are making the
> *assumption* that it is not impacting MAAS, but again, this is an
> assumption. Making the requested change for having at least 4 CPUs (ideally
> 6) would allow us to determining what are the effects and see whether
> there's any difference on behavior and would help identify what other
> issues.
>
> Without having the comparison then we are making it more difficult to
> isolate the problem.

To improve performance the typical pattern is 1) identify the
bottleneck 2) eliminate that as the bottleneck 3) repeat.

We have not identified CPU as a bottleneck.  The top data says it is
not!

In the absence of data showing the CPU as being the bottleneck,
reducing CPU usage doesn't help identify the performance blocker,
because it may just move the bottleneck.  For example, it may cause
the processes that are doing disk I/O to not get scheduled to run as
much, which may then reduce the amount of disk I/O they can do, which
may alleviate the issue, but not because MAAS was CPU starved before
and now isn't.  Better to reduce the storage contention in the first
place, if the data shows that storage contention is the bottleneck.

In this case we had data from iotop that indicated storage contention
as the bottleneck, and reducing it seems to have alleviated the
problem, as we haven't hit the failure since then. We're going to take
more steps to alleviate storage contention even more soon, by making
sure 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
@Mike, you can see the stacked response behavior in
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap

You can tell packet 90573 is a response to the requests for
grub.cfg- because its destination port (25305) is the src port
the request for grub.cfg- was coming from (packets 2 through 38).

On Mon, Feb 5, 2018 at 6:11 PM, Mike Pontillo
 wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?
>
> I looked at a packet capture Andres pointed me to, and don't see the
> "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
> indicated by the (source port, dest port) tuple, and I see that MAAS
> correctly OACKs each individual transaction (per RFC 2347) - not the
> retry packets within the same transaction. Subsequently (in the same
> second, after the client ACKs the data packet) it re-requests the same
> file (which is the bug in grub that I understand is fixed), and then the
> client starts a new transaction and MAAS correctly issues another OACK.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mike Pontillo
Steve, can you be more specific about which packet capture showed the
"stacked OACK" behavior?

I looked at a packet capture Andres pointed me to, and don't see the
"stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
indicated by the (source port, dest port) tuple, and I see that MAAS
correctly OACKs each individual transaction (per RFC 2347) - not the
retry packets within the same transaction. Subsequently (in the same
second, after the client ACKs the data packet) it re-requests the same
file (which is the bug in grub that I understand is fixed), and then the
client starts a new transaction and MAAS correctly issues another OACK.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Mon, Feb 05, 2018 at 09:27:15PM -, Andres Rodriguez wrote:
> MAAS already has a mechanism to collapse retries into the initial request.

Are we certain that this is working correctly?  If so, why are packet
captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for
the duplicate incoming requests?

It's clear to me that MAAS's handling at the wire level is incorrect - 10
retries of the same tftp request should result in a single OACK, not 10 of
them (unless MAAS receives a retry *after* it has sent its OACK).  I don't
know if that also means that MAAS is inefficiently translating these into
database requests on the backend.  It had been suggested in this bug log and
on IRC that MAAS *was* sending duplicate db requests for each of these
packets; OTOH the timing of the stacked responses shows no latency in
between them that would imply additional db round-trips.

I think someone needs to directly inspect the behavior of a running MAAS
server in this scenario to be sure.

> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for a
> response from the region, these request get "ignored" and the Rack will
> only answer the first request.

That is absolutely contradicted by the packet captures.  The rack does not
ignore the additional requests, it answers *ALL* of the requests.  It's only
the *client* that consolidates the duplicate responses from MAAS.  (And
then, because of a grub bug higher up the stack, re-requests the same file
that it has already received.)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason,

The pcap exactly shows the behavior I was hoping to see, which is grub
tries to get X config first, and since it didn't get a response, it moves
on and tries to get Y config.

On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbs 
wrote:

> On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
>  wrote:
> > @Steve,
> >
> > MAAS already has a mechanism to collapse retries into the initial
> request.
> > In this case, it is the rack that grabs the requests and makes a request
> to
> > the region. If retries come within the time that the rack is waiting for
> a
> > response from the region, these request get "ignored" and the Rack will
> > only answer the first request. This is what the logs show after testing
> > with fixed grub, where grub makes multiple requests and MAAS answers
> > seconds after does requests, but only answers once. This is because the
> > requests were collapsed on the maas side.
> >
> > If, however, the retries come in after the region has answered the rack,
> > they these requests will be served.
>
> This is not true.  MAAS is responding to every single request grub
> makes for the file - the tcpdump logs show it.   And these are not
> "read 4 times" requests - they are retries because grub didn't get a
> response.
>
> This pcap shows MAAS responding to every request for grub.cfg-:
> https://bugs.launchpad.net/maas/+bug/1743249/+attachment/
> 5046952/+files/spearow-fall-back-to-default-amd64.pcap
>
> Jason
>
> >
> > On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek <
> steve.langa...@canonical.com
> >> wrote:
> >
> >> Jason's feedback was that, after making the changes to the storage
> >> configuration of his environment, deploying the test grubx64.efi doesn't
> >> have any effect on the MAAS server's response time to tftp requests.  So
> >> at this point it's not at all clear that the grub change, while correct,
> >> helps with this high-level symptom.
> >>
> >> It has also been suggested that each udp retry is generating a separate
> >> database query from MAAS.  That is absolutely a MAAS bug if true, and
> >> not something that can or should be fixed in GRUB.
> >>
> >> ** Changed in: grub2 (Ubuntu)
> >>Importance: Critical => Medium
> >>
> >> --
> >> You received this bug notification because you are subscribed to MAAS.
> >> https://bugs.launchpad.net/bugs/1743249
> >>
> >> Title:
> >>   Failed Deployment after timeout trying to retrieve grub cfg
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
> >>
> >> Launchpad-Notification-Type: bug
> >> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> >> importance=Undecided; assignee=None;
> >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> >> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> >> Launchpad-Bug-Information-Type: Public
> >> Launchpad-Bug-Private: no
> >> Launchpad-Bug-Security-Vulnerability: no
> >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> vorlon
> >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> >> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
> >> Launchpad-Message-Rationale: Subscriber (MAAS)
> >> Launchpad-Message-For: andreserl
> >>
> >
> >
> > --
> > Andres Rodriguez (RoAkSoAx)
> > Ubuntu Server Developer
> > MSc. Telecom & Networking
> > Systems Engineer
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> >   Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> >   New
> > Status in grub2 package in Ubuntu:
> >   In Progress
> >
> > Bug description:
> >   A node failed to deploy after it failed to retrieve a grub.cfg from
> >   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >   to retrieve the grub cfg many times, over about 30 seconds:
> >
> >   http://paste.ubuntu.com/26387256/
> >
> >   We see the same thing for other hosts around the same time:
> >
> >   http://paste.ubuntu.com/26387262/
> >
> >   It seems like MAAS is taking way too long to respond to these
> >   requests.
> >
> >   This is very similar to bug 1724677, which was happening pre-
> >   metldown/spectre. The only difference is we don't see "[critical] TFTP
> >   back-end failed" in the logs anymore.
> >
> >   I connected to the console on this system and it had errors about
> >   timing out retrieving the grub-cfg, then it had an error message along
> >   the lines of "error not an ip" and then "double free".  After I
> >   connected but before I could get a screenshot the system rebooted and
> >   was directed by maas to power off, which it did successfully after
> >   booting to linux.
> >
> >   Full logs are available here:
> >   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >   

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
I think there's a misunderstanding on how the network boot process happens:
Let's look at pxe linux first. Pxe linux does this:

1. tries UUID first # if no answer, it moves on
2. Tries mac # if no answer, it moves on
3. tries full IP address # if no answer, it moves on
4. tries partial IP address # if no answer, it moves on
5. does 4
6. does 4
[...]
7. boots default.

This can be seen in here:

/mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
/mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
/mybootdir/pxelinux.cfg/C0A8025B
/mybootdir/pxelinux.cfg/C0A8025
/mybootdir/pxelinux.cfg/C0A802
/mybootdir/pxelinux.cfg/C0A80
/mybootdir/pxelinux.cfg/C0A8
/mybootdir/pxelinux.cfg/C0A
/mybootdir/pxelinux.cfg/C0
/mybootdir/pxelinux.cfg/C
/mybootdir/pxelinux.cfg/default


That said, in the case of grub, this behavior is similar. You have
described this behavior in comment #16. So what is it that's happening:

1. grub is trying grub.cfg- address multiple times, but since it
doesn't get a response, it gives it.
2. Once it gives up, grub.cfg-default-amd64 is tried instead.

That said, the requests are handled completely different. The -
requests actually accesses the *node* object in the database  by searching
it with the mac address where the request is made. With this node object,
we generate the config file.

In comparison, the -default-amd64 does *not* access the node object. It
just access two config settings and the db query is *much* cheaper. Also,
we have to keep in mind that after grub has done many retries, this returns
rather fast in comparison because it is not only cheaper, but at that point
MAAS may be with way less load of queued DB requests. Either way, grub
giving up means that it wont expect for the initial request, but it will
expect a new response for the new file it asked for.

That said, this is working *exactly* as expected, because this effectively
tells grub "if config for your MAC address was not returned, you can safely
assume you are an unknown machine to MAAS", hence grub requests a different
config file to start the enlistment process.

So this is *not* a race condition in MAAS. This is working as designed and
is expected. The problem here is that MAAS takes too long to answer the
initial request, which causes grub to timeout and move on to request a
different config file.

On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs 
wrote:

> The packetdump (comment #35) of MAAS not responding to grub's request
> for the mac specific grub.cfg before grub times out, and then responding
> immediately to the generic-amd64 grub cfg, clearly shows a race
> condition in MAAS.
>
> MAAS's design of dynamically generating the interface specific grub
> config only after it receives the tftp request for it is susceptible to
> a race condition where grub times out before MAAS can respond.
>
> That design is not the only possible design.  All the information
> required for the interface specific grub.cfg is available before the
> machine ever powers on, and could be made available on the rack
> controllers at that time too.
>
> Doing so would eliminate that race condition, or at least reduce the
> opportunity greatly, as we see MAAS has no problems immediately
> responding and serving files that it doesn't need to dynamically
> generate at request time.
>
> There is still some question around what in the environment is
> contributing to MAAS not responding faster, and what MAAS is doing while
> it takes 60+ seconds to respond to the request, but that doesn't change
> the fact that the current MAAS design is racy (and that's a bug).
>
> Whatever we change in the environment to reduce the likelihood of
> hitting this issue there doesn't solve the underlying race condition in
> MAAS, and leaves open the possibility of hitting the issue other places
> too.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You 

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
 wrote:
> @Jason,
>
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs 
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>>  wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have improved things
>>
>> Yes, it has.  That doesn't change whether or not there is a bug in
>> MAAS.  Can you please address the critical log errors that I mentioned
>> in comment #36?  This seems like enough to establish something is
>> going wrong in MAAS.
>>
>>
> The bugs you have raised in #36 have already been fixed.

Where?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
On Mon, Feb 05, 2018 at 08:40:56PM -, Jason Hobbs wrote:
> @Steve - I don't think it helps with the problem of MAAS taking a long
> time to respond to the grub.cfg request.  However, it may help with the
> part of this bug where grub is hitting an error and asking for keyboard
> input.  https://imgur.com/a/as8Sx

> Maybe that should be a separate bug?  It seems like grub should never
> ask for user keyboard input on a server.

Perhaps that bug is fixed as a side effect of the grub change.

But what do you think the correct behavior should be when grub cannot find
the file that it needs in order to boot?  Should grub enter a boot loop,
retrying endlessly?  Should it try to halt the system?  Why is either of
these options more correct than putting the machine to a console prompt?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason,

On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs 
wrote:

> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>  wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has.  That doesn't change whether or not there is a bug in
> MAAS.  Can you please address the critical log errors that I mentioned
> in comment #36?  This seems like enough to establish something is
> going wrong in MAAS.
>

The tftp issue shows no evidence this is causing any booting failures. We
have seen this issue before and confirmed that it doesn't cause boot
issues. See [1]. If you want to try it, it is available in
ppa:maas/proposed.

[1].https://bugs.launchpad.net/maas/+bug/1376483

As far as the postgresql logs with "maas@maasdb ERROR: could not serialize
access due to concurrent update" that is  *not* a bug in MAAS or an issue.
That's perfectly normal messages with the isolation level the MAAS DB is
running with. This basically means something else is trying to update the
db while something else is updating it, and MAAS already handles this by
doing retries.


> > 2. No tests have been run with fixed grub that have caused boot
> failures.
>
> The comments from #56 were testing with the fixed grub - sorry if that
> wasn't clear.
>
> > 3. AFAIK, the VM config has not changed to use less CPU to compare
> results and whether this config change causes the bugs in question.
>
> The CPU load data from comments #48 and #50 shows that CPU load is not
> the problem.  The max load average was under 12 on a 20 thread system.
> That means there was lots of free CPU time, and that this workload is
> not CPU bound.
>

CPU load is not CPU utilization. We know that at the time there's 6 other
VM's with 150%+ CPU usage are writing to the disk because they are being
deployed and/or configured (e.g. software installation).  Correct me if
wrong, but this can cause the prioritization of whatever is writing to disk
over anything else, like the MAAS processes access for resources.

That being said, because CPU load doesn't show high we are making the
*assumption* that it is not impacting MAAS, but again, this is an
assumption. Making the requested change for having at least 4 CPUs (ideally
6) would allow us to determining what are the effects and see whether
there's any difference on behavior and would help identify what other
issues.

Without having the comparison then we are making it more difficult to
isolate the problem.

>
> Jason
>
>
> ** Changed in: maas
>Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
 wrote:
> @Steve,
>
> MAAS already has a mechanism to collapse retries into the initial request.
> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for a
> response from the region, these request get "ignored" and the Rack will
> only answer the first request. This is what the logs show after testing
> with fixed grub, where grub makes multiple requests and MAAS answers
> seconds after does requests, but only answers once. This is because the
> requests were collapsed on the maas side.
>
> If, however, the retries come in after the region has answered the rack,
> they these requests will be served.

This is not true.  MAAS is responding to every single request grub
makes for the file - the tcpdump logs show it.   And these are not
"read 4 times" requests - they are retries because grub didn't get a
response.

This pcap shows MAAS responding to every request for grub.cfg-:
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap

Jason

>
> On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek > wrote:
>
>> Jason's feedback was that, after making the changes to the storage
>> configuration of his environment, deploying the test grubx64.efi doesn't
>> have any effect on the MAAS server's response time to tftp requests.  So
>> at this point it's not at all clear that the grub change, while correct,
>> helps with this high-level symptom.
>>
>> It has also been suggested that each udp retry is generating a separate
>> database query from MAAS.  That is absolutely a MAAS bug if true, and
>> not something that can or should be fixed in GRUB.
>>
>> ** Changed in: grub2 (Ubuntu)
>>Importance: Critical => Medium
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>>   Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   In Progress
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Jason,


On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs 
wrote:

> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>  wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has.  That doesn't change whether or not there is a bug in
> MAAS.  Can you please address the critical log errors that I mentioned
> in comment #36?  This seems like enough to establish something is
> going wrong in MAAS.
>
>
The bugs you have raised in #36 have already been fixed.


> > 2. No tests have been run with fixed grub that have caused boot
> failures.
>
> The comments from #56 were testing with the fixed grub - sorry if that
> wasn't clear.
>
> > 3. AFAIK, the VM config has not changed to use less CPU to compare
> results and whether this config change causes the bugs in question.
>
> The CPU load data from comments #48 and #50 shows that CPU load is not
> the problem.  The max load average was under 12 on a 20 thread system.
> That means there was lots of free CPU time, and that this workload is
> not CPU bound.
>
> Jason
>
>
> ** Changed in: maas
>Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
The packetdump (comment #35) of MAAS not responding to grub's request
for the mac specific grub.cfg before grub times out, and then responding
immediately to the generic-amd64 grub cfg, clearly shows a race
condition in MAAS.

MAAS's design of dynamically generating the interface specific grub
config only after it receives the tftp request for it is susceptible to
a race condition where grub times out before MAAS can respond.

That design is not the only possible design.  All the information
required for the interface specific grub.cfg is available before the
machine ever powers on, and could be made available on the rack
controllers at that time too.

Doing so would eliminate that race condition, or at least reduce the
opportunity greatly, as we see MAAS has no problems immediately
responding and serving files that it doesn't need to dynamically
generate at request time.

There is still some question around what in the environment is
contributing to MAAS not responding faster, and what MAAS is doing while
it takes 60+ seconds to respond to the request, but that doesn't change
the fact that the current MAAS design is racy (and that's a bug).

Whatever we change in the environment to reduce the likelihood of
hitting this issue there doesn't solve the underlying race condition in
MAAS, and leaves open the possibility of hitting the issue other places
too.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
@Steve,

MAAS already has a mechanism to collapse retries into the initial request.
In this case, it is the rack that grabs the requests and makes a request to
the region. If retries come within the time that the rack is waiting for a
response from the region, these request get "ignored" and the Rack will
only answer the first request. This is what the logs show after testing
with fixed grub, where grub makes multiple requests and MAAS answers
seconds after does requests, but only answers once. This is because the
requests were collapsed on the maas side.

If, however, the retries come in after the region has answered the rack,
they these requests will be served.


On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek  wrote:

> Jason's feedback was that, after making the changes to the storage
> configuration of his environment, deploying the test grubx64.efi doesn't
> have any effect on the MAAS server's response time to tftp requests.  So
> at this point it's not at all clear that the grub change, while correct,
> helps with this high-level symptom.
>
> It has also been suggested that each udp retry is generating a separate
> database query from MAAS.  That is absolutely a MAAS bug if true, and
> not something that can or should be fixed in GRUB.
>
> ** Changed in: grub2 (Ubuntu)
>Importance: Critical => Medium
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
@Steve - I don't think it helps with the problem of MAAS taking a long
time to respond to the grub.cfg request.  However, it may help with the
part of this bug where grub is hitting an error and asking for keyboard
input.  https://imgur.com/a/as8Sx

Maybe that should be a separate bug?  It seems like grub should never
ask for user keyboard input on a server.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Jason Hobbs
On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
 wrote:
> No new data was provided to mark this New in MAAS:
>
> 1. Changes to the storage seem to have improved things

Yes, it has.  That doesn't change whether or not there is a bug in
MAAS.  Can you please address the critical log errors that I mentioned
in comment #36?  This seems like enough to establish something is
going wrong in MAAS.

> 2. No tests have been run with fixed grub that have caused boot
failures.

The comments from #56 were testing with the fixed grub - sorry if that
wasn't clear.

> 3. AFAIK, the VM config has not changed to use less CPU to compare
results and whether this config change causes the bugs in question.

The CPU load data from comments #48 and #50 shows that CPU load is not
the problem.  The max load average was under 12 on a 20 thread system.
That means there was lots of free CPU time, and that this workload is
not CPU bound.

Jason


** Changed in: maas
   Status: Incomplete => New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Steve Langasek
Jason's feedback was that, after making the changes to the storage
configuration of his environment, deploying the test grubx64.efi doesn't
have any effect on the MAAS server's response time to tftp requests.  So
at this point it's not at all clear that the grub change, while correct,
helps with this high-level symptom.

It has also been suggested that each udp retry is generating a separate
database query from MAAS.  That is absolutely a MAAS bug if true, and
not something that can or should be fixed in GRUB.

** Changed in: grub2 (Ubuntu)
   Importance: Critical => Medium

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Mathieu Trudel-Lapierre
** Changed in: grub2 (Ubuntu)
   Status: Triaged => In Progress

** Changed in: grub2 (Ubuntu)
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Andres Rodriguez
No new data was provided to mark this New in MAAS:

1. Changes to the storage seem to have improved things
2. No tests have been run with fixed grub that have caused boot failures.
3. AFAIK, the VM config has not changed to use less CPU to compare results and 
whether this config change causes the bugs in question.

As such, marking this incomplete until we can verify that 1 and 2 and 3
don't make any difference or with those changes, we continue to see the
issues.

** Changed in: maas
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-05 Thread Chris Gregan
** Changed in: maas
   Status: Incomplete => New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Jason Hobbs
Here is part of a packet capture on my environment:
http://paste.ubuntu.com/26509374/

>From the other tftp server on the deploy:
http://paste.ubuntu.com/26509386/

The whole pcap is prohibitively large because it's for multiple hosts.

You can see from this that grub is only reading the file once now, so
that grub bug has been fixed.

You can also see that MAAS is still taking a while to respond to the
request sometimes - 6 seconds in this capture.

There were no failures on this run, but we don't have failures every
time, so that doesn't prove anything.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Andres Rodriguez
I've tested and I can confirm it made just 1 request instead of 4. I
think now we need to test it in Jason's environment to see the
differences.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Steve Langasek
Note that the source file is grubnetx64.efi, it should be installed as
grubx64.efi in the tftp server directory.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Andres Rodriguez
** Attachment added: "tcpdump.pcap"
   
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1743249/+attachment/5047711/+files/tcpdump.pcap

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Steve Langasek
Attached is an (unsigned) test grubnetx64.efi, built from xenial grub2
plus my patch.  Please deploy this in the maas tftp environment where
you are experiencing the timeouts, and give feedback on whether it helps
with the primary symptom.

** Attachment added: "grubnetx64.efi"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047679/+files/grubnetx64.efi

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-02 Thread Ubuntu Foundations Team Bug Bot
** Tags added: patch

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Steve Langasek
Here is a possible fix for grub's repeated requests of the config file.

** Patch added: "bufio_sensible_block_sizes.patch"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047245/+files/bufio_sensible_block_sizes.patch

** Changed in: grub2 (Ubuntu)
   Status: New => Triaged

** Changed in: grub2 (Ubuntu)
 Assignee: (unassigned) => Mathieu Trudel-Lapierre (cyphermox)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
here is the complete output of top from comment #48

** Attachment added: "top.txt.gz"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047072/+files/top.txt.gz

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
I also collected iotop output from the same run:
http://paste.ubuntu.com/26502363/

The storage setup on these nodes is writethrough bcache with a 400 GB
nvme in front of a 1TB spinning disk.  Since it's writethrough, writes
have to make it to the spinning disk before being counted as sync'd.

The write numbers look high for random i/o on a spinning disk.  It seems
possible that the slow MAAS performance is due to postgresql waiting for
writes to disk to complete, and MAAS threads blocking on that, so that
servicing DB reads is blocked on the commits completing first.

The VMs running on the machine are using this same bcache setup for
their storage pool.  It looks like most of the disk write traffic is
coming from the VMs.

Based on this data we'll make two changes to our setup which I think should 
help alleviate this problem:
- move the VMs storage hosting to separate disk.
- change the storage setup to use writeback bcache.

** Attachment added: "iotop.txt.gz"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5047065/+files/iotop.txt.gz

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
I collected top output from a run (this run did not exhibit this
failure):

http://paste.ubuntu.com/26502311/

The highest the load average ever gets is 11.85, and it's usually around
3-4.  This is a 20 thread system, so it doesn't look like CPU contention
is the problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Steve,


On Thu, Feb 1, 2018 at 1:49 PM, Steve Langasek  wrote:

> On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote:
> > @Jason,
>
> > Packet 90573 doesn't seem to me as an indication of what you are
> > describing. What I see is this:
>
> > 1. grub makes ~30 requests for PXE config on grub.cfg-, after which
> it gives up because it didn't receive a response.
> > 2. grub moves on and requests grub.cfg-default-amd64, and it receives a
> response from MAAS.
>
> > Now, the difference between the above, is that 1 does *database*
> > lookups, while 2 does not. In other words, 1 causes a request to obtain
> > the 'node' object based on the MAC to provide, and if grub is making 30+
> > requests, then this can definitely flood the db with requests.
>
> Then as I've said on IRC, this is a bug in maas, because 30 udp retries
> should not generate 30 requests to the database.
>
> GRUB is *not* wrong to retransmit its udp packets when it doesn't get a
> response.  If each of these increases the load in MAAS, then MAAS should be
> fixed.


> The case where GRUB retrieves the same file multiple times is a GRUB bug,
> but I don't see any evidence linking this GRUB bug to the timeout and
> fallback problem in Jason's latest trace.


I agree with you if we are only considering this 1 system.

Let's not forget that we have other systems booting at around the same
time, each of which may be making at least 4 requests (for those grub
systems) that may or may not be answered immediately after each request.
But if requests are being served at the same time that more requests come
in, I do see how making multiple requests can indeed be causing the
degraded performance.

Specially, now that we've learned that we have multiple VM's in the same
host, all consuming 18 CPU's, on a 20 CPU system, and when MAAS alone, runs
5 processes that we typically recommend a dedicated CPU for each.


> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=New; importance=Undecided; assignee=None;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason,

Did you expand the "production environment" section?

Memory (MB) CPU (GHz)   Disk (GB)
Region controller (minus PostgreSQL)20482.0 5
PostgreSQL  20482.0 20
Rack controller 20482.0 20
Ubuntu Server (including logs)  512 0.5 20

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
Oh I see what you mean, yeah ignore the GHz section, that's wrong.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
FYI those minimum requirements don't mention anything about core/thread
count.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason,

I would give MAAS at least 6 CPU's.

2 for Region
2 for Postgres
2 for Rack.

I would even recommend 4 for region instead of just 2, as MAAS runs 4
region processes. So that would be a total of 8.

[2]: https://docs.ubuntu.com/maas/2.3/en/#minimum-requirements

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Steve Langasek
On Thu, Feb 01, 2018 at 06:15:31PM -, Andres Rodriguez wrote:
> @Jason,

> Packet 90573 doesn't seem to me as an indication of what you are
> describing. What I see is this:

> 1. grub makes ~30 requests for PXE config on grub.cfg-, after which it 
> gives up because it didn't receive a response.
> 2. grub moves on and requests grub.cfg-default-amd64, and it receives a 
> response from MAAS.

> Now, the difference between the above, is that 1 does *database*
> lookups, while 2 does not. In other words, 1 causes a request to obtain
> the 'node' object based on the MAC to provide, and if grub is making 30+
> requests, then this can definitely flood the db with requests.

Then as I've said on IRC, this is a bug in maas, because 30 udp retries
should not generate 30 requests to the database.

GRUB is *not* wrong to retransmit its udp packets when it doesn't get a
response.  If each of these increases the load in MAAS, then MAAS should be
fixed.

The case where GRUB retrieves the same file multiple times is a GRUB bug,
but I don't see any evidence linking this GRUB bug to the timeout and
fallback problem in Jason's latest trace.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
Andres,

You can tell packet 90573 is a response to the requests for
grub.cfg- because its destination port (25305) is the src port the
request for grub.cfg- was coming from (packets 2 through 38).

We're running another test now to collect load information.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Andres Rodriguez
@Jason,

Packet 90573 doesn't seem to me as an indication of what you are
describing. What I see is this:

1. grub makes ~30 requests for PXE config on grub.cfg-, after which it 
gives up because it didn't receive a response.
2. grub moves on and requests grub.cfg-default-amd64, and it receives a 
response from MAAS.

Now, the difference between the above, is that 1 does *database*
lookups, while 2 does not. In other words, 1 causes a request to obtain
the 'node' object based on the MAC to provide, and if grub is making 30+
requests, then this can definitely flood the db with requests.

That said, based on my understanding of how your environment is
configured, you have other 3 VM's in the system PXE booting from MAAS +
other machines at the same time, where each VM has assigned to itself 8
CPU's on a system that has 20 CPU's (that means that the VM's alone, in
other words, you are over committing CPU), combined with other machines
PXE booting off MAAS at the same time, plus the performance implications
of the recent kernel, then it does seem to me that all of the other
things could be impacting maas in contending resources, when we already
know postgresql is running in degraded performance due to the newer
kernels.

That said, did you disable spectre features and rebooted your machine?
Did you test this by NOT running VM's in the same system as MAAS or at least, 
reducing the number of cores each VM access to (since there's 3 VM's, with 8 
cores each, that means 24 cores on a 20 core system).

Also, do you have any CPU load at the time of failure?

** Changed in: maas
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
** Changed in: maas
   Status: Incomplete => New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
In the pcap from comment #35, MAAS eventually does respond to the
interface specific grub request, 61 seconds after the request, after
it's already sent the grub.cfg-default-amd64, kernel, and initrd. You
can see the responses to the interface specific grub.cfg requests coming
back starting at packet 90573.

While Steve's finding in #33/34 seem to indicate a grub bug, this seems
like a MAAS problem occurring before that grub bug even has a chance to
take effect.  I'm attaching MAAS logs from this same test run.

>From the maas logs, the requests start at 01:02:49
logs-2018-02-01-01.04.49/10.244.40.30/var/log/maas/rackd.log

There are some "critical" tftp errors logged in the same file not long 
afterwards:
http://paste.ubuntu.com/26501394/

There are errors in postgresl's log around the same time too:
(logs-2018-02-01-01.04.49/10.244.40.30/var/log/postgresql$ vim 
postgresql-9.5-ha.log)

http://paste.ubuntu.com/26501399/

** Attachment added: "infra-logs.tar"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046970/+files/infra-logs.tar

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-02-01 Thread Jason Hobbs
Attaching a pcap from a failure case.  In this case, grub tried for 30
seconds to retrieve the interface specific grub.cfg, but never got a
response from MAAS.  It then gave up and got the amd64-default one
instead, which caused the machine to try to enlist and then power off,
leading to a failed deployment.

** Attachment added: "spearow-fall-back-to-default-amd64.pcap"
   
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-01-31 Thread Steve Langasek
Regarding grub requesting the same file 4 times, a surprising finding:
I'm able to reproduce this with files of a certain length.  By chance my
grub.cfg was 1 byte shorter than the one maas serves (269 bytes instead
of 270), and I saw multiple requests for this file.

To reproduce this in a VM using UEFI:
- set up dhcp to point to bootx64.efi
- set up tftp with bootx64.efi and grubx64.efi but not grub/grub.cfg
- create files of varying sizes and access them using 'source 
(pxe)/config-file-on-server'

A simple file consisting of nothing but newlines is sufficient.

confirmed "good" file lengths: 1,2,3,4,266,268,270
confirmed "bad" file lengths: 267,269,271,584,595,627

No pattern established yet.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

2018-01-31 Thread Andres Rodriguez
** Also affects: grub2 (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs