Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-12-01 Thread Steve Baker

On 02/12/15 03:18, Lennart Regebro wrote:

On Tue, Dec 1, 2015 at 3:39 AM, Steve Baker  wrote:

I mean _here_

https://review.openstack.org/#/c/251587/

OK, that's great! If you want any help implementing it, I can try.



Hey Lennart, help is always appreciated.

I can elaborate on the implementation approach for ``openstack overcloud 
failed list`` and you can take a crack at that if you like while I work 
on the other two commands.


I think before we start on the commands proper I will need to implement 
some yaml printing utility functions, so lets coordinate on that.


cheers

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-12-01 Thread Lennart Regebro
On Tue, Dec 1, 2015 at 3:39 AM, Steve Baker  wrote:
> I mean _here_
>
> https://review.openstack.org/#/c/251587/

OK, that's great! If you want any help implementing it, I can try.

//Lennart

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Steve Baker

On 01/12/15 15:39, Steve Baker wrote:

On 01/12/15 10:28, Steven Hardy wrote:

On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:

On 30/11/15 23:21, Steven Hardy wrote:

On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:

I'm tasked to implement a command that shows error messages when a
deployment has failed. I have a vague memory of having seen scripts
that do something like this, if that exists, can somebody point me in
teh right direction?

I wrote a super simple script and put it in a blog post a while back:

http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html 



All it does is find the failed SoftwareDeployment resources, then 
do heat
deployment-show on the resource, so you can see the stderr 
associated with

the failure.

Having tripleoclient do that by default would be useful.


Any opinions on what that should do, specifically? Traverse failed
resources to find error messages, I assume. Anything else?
Yeah, but I think for this to be useful, we need to go a bit deeper 
than
just showing the resource error - there are a number of typical 
failure

modes, and I end up repeating the same steps to debug every time.

1. SoftwareDeployment failed (mentioned above).  Every time, you 
need to

see the name of the SoftwareDeployment which failed, figure out if it
failed on one or all of the servers, then look at the stderr for 
clues.


2. A server failed to build (OS::Nova::Server resource is FAILED), 
here we
need to check both nova and ironic, looking first to see if ironic 
has the

node(s) in the wrong state for scheduling (e.g nova gave us a no valid
host error), and then if they are OK in ironic, do nova show on the 
failed

host to see the reason nova gives us for it failing to go ACTIVE.

3. A stack timeout happened.  IIRC when this happens, we currently 
fail
with an obscure keystone related backtrace due to the token 
expiring.  We

should instead catch this error and show the heat stack status_reason,
which should say clearly the stack timed out.

If we could just make these three cases really clear and easy to 
debug, I
think things would be much better (IME the above are a high 
proportion of
all failures), but I'm sure folks can come up with other ideas to 
add to

the list.

I'm actually drafting a spec which includes a command which does 
this. I

hope to submit it soon, but here is the current state of that command's
description:

Diagnosing resources in a FAILED state
--

One command will be implemented:
- openstack overcloud failed list

This will print a yaml tree showing the hierarchy of nested stacks 
until it
gets to the actual failed resource, then it will show information 
regarding

the
failure. For most resource types this information will be the 
status_reason,
but for software-deployment resources the deploy_stdout, 
deploy_stderr and

deploy_status code will be printed.

In addition to this stand-alone command, this output will also be 
printed

when
an ``openstack overcloud deploy`` or ``openstack overcloud update`` 
command

results in a stack in a FAILED state.

This sounds great!

The spec is here.

I mean _here_

https://review.openstack.org/#/c/251587/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Steve Baker

On 01/12/15 10:28, Steven Hardy wrote:

On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:

On 30/11/15 23:21, Steven Hardy wrote:

On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:

I'm tasked to implement a command that shows error messages when a
deployment has failed. I have a vague memory of having seen scripts
that do something like this, if that exists, can somebody point me in
teh right direction?

I wrote a super simple script and put it in a blog post a while back:

http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html

All it does is find the failed SoftwareDeployment resources, then do heat
deployment-show on the resource, so you can see the stderr associated with
the failure.

Having tripleoclient do that by default would be useful.


Any opinions on what that should do, specifically? Traverse failed
resources to find error messages, I assume. Anything else?

Yeah, but I think for this to be useful, we need to go a bit deeper than
just showing the resource error - there are a number of typical failure
modes, and I end up repeating the same steps to debug every time.

1. SoftwareDeployment failed (mentioned above).  Every time, you need to
see the name of the SoftwareDeployment which failed, figure out if it
failed on one or all of the servers, then look at the stderr for clues.

2. A server failed to build (OS::Nova::Server resource is FAILED), here we
need to check both nova and ironic, looking first to see if ironic has the
node(s) in the wrong state for scheduling (e.g nova gave us a no valid
host error), and then if they are OK in ironic, do nova show on the failed
host to see the reason nova gives us for it failing to go ACTIVE.

3. A stack timeout happened.  IIRC when this happens, we currently fail
with an obscure keystone related backtrace due to the token expiring.  We
should instead catch this error and show the heat stack status_reason,
which should say clearly the stack timed out.

If we could just make these three cases really clear and easy to debug, I
think things would be much better (IME the above are a high proportion of
all failures), but I'm sure folks can come up with other ideas to add to
the list.


I'm actually drafting a spec which includes a command which does this. I
hope to submit it soon, but here is the current state of that command's
description:

Diagnosing resources in a FAILED state
--

One command will be implemented:
- openstack overcloud failed list

This will print a yaml tree showing the hierarchy of nested stacks until it
gets to the actual failed resource, then it will show information regarding
the
failure. For most resource types this information will be the status_reason,
but for software-deployment resources the deploy_stdout, deploy_stderr and
deploy_status code will be printed.

In addition to this stand-alone command, this output will also be printed
when
an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
results in a stack in a FAILED state.

This sounds great!

The spec is here.

Another piece of low-hanging-fruit in the meantime is we should actually
print the stack_status_reason on failure:

https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/v1/overcloud_deploy.py#L280

The DeploymentError raised could include the stack_status_reason vs the
unqualified "Heat Stack create failed".

I guess your event listing partially overlaps with this, as you can now
derive the stack_status_reason from the last event, but it's still be good
to loudly output it so folks can see more quickly when things such as
timeouts happen that are clearly displayed in the top-level stack status.


Yes, this would be a trivially implemented quick win.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Steven Hardy
On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:
> On 30/11/15 23:21, Steven Hardy wrote:
> >On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
> >>I'm tasked to implement a command that shows error messages when a
> >>deployment has failed. I have a vague memory of having seen scripts
> >>that do something like this, if that exists, can somebody point me in
> >>teh right direction?
> >I wrote a super simple script and put it in a blog post a while back:
> >
> >http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html
> >
> >All it does is find the failed SoftwareDeployment resources, then do heat
> >deployment-show on the resource, so you can see the stderr associated with
> >the failure.
> >
> >Having tripleoclient do that by default would be useful.
> >
> >>Any opinions on what that should do, specifically? Traverse failed
> >>resources to find error messages, I assume. Anything else?
> >Yeah, but I think for this to be useful, we need to go a bit deeper than
> >just showing the resource error - there are a number of typical failure
> >modes, and I end up repeating the same steps to debug every time.
> >
> >1. SoftwareDeployment failed (mentioned above).  Every time, you need to
> >see the name of the SoftwareDeployment which failed, figure out if it
> >failed on one or all of the servers, then look at the stderr for clues.
> >
> >2. A server failed to build (OS::Nova::Server resource is FAILED), here we
> >need to check both nova and ironic, looking first to see if ironic has the
> >node(s) in the wrong state for scheduling (e.g nova gave us a no valid
> >host error), and then if they are OK in ironic, do nova show on the failed
> >host to see the reason nova gives us for it failing to go ACTIVE.
> >
> >3. A stack timeout happened.  IIRC when this happens, we currently fail
> >with an obscure keystone related backtrace due to the token expiring.  We
> >should instead catch this error and show the heat stack status_reason,
> >which should say clearly the stack timed out.
> >
> >If we could just make these three cases really clear and easy to debug, I
> >think things would be much better (IME the above are a high proportion of
> >all failures), but I'm sure folks can come up with other ideas to add to
> >the list.
> >
> I'm actually drafting a spec which includes a command which does this. I
> hope to submit it soon, but here is the current state of that command's
> description:
> 
> Diagnosing resources in a FAILED state
> --
> 
> One command will be implemented:
> - openstack overcloud failed list
> 
> This will print a yaml tree showing the hierarchy of nested stacks until it
> gets to the actual failed resource, then it will show information regarding
> the
> failure. For most resource types this information will be the status_reason,
> but for software-deployment resources the deploy_stdout, deploy_stderr and
> deploy_status code will be printed.
> 
> In addition to this stand-alone command, this output will also be printed
> when
> an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
> results in a stack in a FAILED state.

This sounds great!

Another piece of low-hanging-fruit in the meantime is we should actually
print the stack_status_reason on failure:

https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/v1/overcloud_deploy.py#L280

The DeploymentError raised could include the stack_status_reason vs the
unqualified "Heat Stack create failed".

I guess your event listing partially overlaps with this, as you can now
derive the stack_status_reason from the last event, but it's still be good
to loudly output it so folks can see more quickly when things such as
timeouts happen that are clearly displayed in the top-level stack status.

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Steve Baker

On 30/11/15 23:21, Steven Hardy wrote:

On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:

I'm tasked to implement a command that shows error messages when a
deployment has failed. I have a vague memory of having seen scripts
that do something like this, if that exists, can somebody point me in
teh right direction?

I wrote a super simple script and put it in a blog post a while back:

http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html

All it does is find the failed SoftwareDeployment resources, then do heat
deployment-show on the resource, so you can see the stderr associated with
the failure.

Having tripleoclient do that by default would be useful.


Any opinions on what that should do, specifically? Traverse failed
resources to find error messages, I assume. Anything else?

Yeah, but I think for this to be useful, we need to go a bit deeper than
just showing the resource error - there are a number of typical failure
modes, and I end up repeating the same steps to debug every time.

1. SoftwareDeployment failed (mentioned above).  Every time, you need to
see the name of the SoftwareDeployment which failed, figure out if it
failed on one or all of the servers, then look at the stderr for clues.

2. A server failed to build (OS::Nova::Server resource is FAILED), here we
need to check both nova and ironic, looking first to see if ironic has the
node(s) in the wrong state for scheduling (e.g nova gave us a no valid
host error), and then if they are OK in ironic, do nova show on the failed
host to see the reason nova gives us for it failing to go ACTIVE.

3. A stack timeout happened.  IIRC when this happens, we currently fail
with an obscure keystone related backtrace due to the token expiring.  We
should instead catch this error and show the heat stack status_reason,
which should say clearly the stack timed out.

If we could just make these three cases really clear and easy to debug, I
think things would be much better (IME the above are a high proportion of
all failures), but I'm sure folks can come up with other ideas to add to
the list.

I'm actually drafting a spec which includes a command which does this. I 
hope to submit it soon, but here is the current state of that command's 
description:


Diagnosing resources in a FAILED state
--

One command will be implemented:
- openstack overcloud failed list

This will print a yaml tree showing the hierarchy of nested stacks until it
gets to the actual failed resource, then it will show information 
regarding the

failure. For most resource types this information will be the status_reason,
but for software-deployment resources the deploy_stdout, deploy_stderr and
deploy_status code will be printed.

In addition to this stand-alone command, this output will also be 
printed when

an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
results in a stack in a FAILED state.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Steven Hardy
On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
> I'm tasked to implement a command that shows error messages when a
> deployment has failed. I have a vague memory of having seen scripts
> that do something like this, if that exists, can somebody point me in
> teh right direction?

I wrote a super simple script and put it in a blog post a while back:

http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html

All it does is find the failed SoftwareDeployment resources, then do heat
deployment-show on the resource, so you can see the stderr associated with
the failure.

Having tripleoclient do that by default would be useful.

> Any opinions on what that should do, specifically? Traverse failed
> resources to find error messages, I assume. Anything else?

Yeah, but I think for this to be useful, we need to go a bit deeper than
just showing the resource error - there are a number of typical failure
modes, and I end up repeating the same steps to debug every time.

1. SoftwareDeployment failed (mentioned above).  Every time, you need to
see the name of the SoftwareDeployment which failed, figure out if it
failed on one or all of the servers, then look at the stderr for clues.

2. A server failed to build (OS::Nova::Server resource is FAILED), here we
need to check both nova and ironic, looking first to see if ironic has the
node(s) in the wrong state for scheduling (e.g nova gave us a no valid
host error), and then if they are OK in ironic, do nova show on the failed
host to see the reason nova gives us for it failing to go ACTIVE.

3. A stack timeout happened.  IIRC when this happens, we currently fail
with an obscure keystone related backtrace due to the token expiring.  We
should instead catch this error and show the heat stack status_reason,
which should say clearly the stack timed out.

If we could just make these three cases really clear and easy to debug, I
think things would be much better (IME the above are a high proportion of
all failures), but I'm sure folks can come up with other ideas to add to
the list.

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO/heat] openstack debug command

2015-11-30 Thread Lennart Regebro
I'm tasked to implement a command that shows error messages when a
deployment has failed. I have a vague memory of having seen scripts
that do something like this, if that exists, can somebody point me in
teh right direction?

Any opinions on what that should do, specifically? Traverse failed
resources to find error messages, I assume. Anything else?

//Lennart

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev