Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-11 Thread Rossella Sblendido

Hello Matt,


On 11/10/2015 07:33 PM, Matt Riedemann wrote:

Let me qualify by saying I'm not a Neutron person.

We know that gate-tempest-dsvm-neutron-full is failing hard as of the
last 24 hours [1].

An error that's been showing up in tempest runs with neutron a lot is:

"AssertionError: 0 == 0 : No IPv4 addresses found in: []"

So checking logstash [2] it's hitting a lot. It's only recent because
that failure message is new to Tempest in the last day or so, but it has
a lot of hits, so whatever it is, it's failing a lot.

So the next step is usually digging into service logs looking for
errors. I check the q-svc logs first. Not many errors but a bazillion
warnings for things not found (networks and devices). [3]

For example:

2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
[req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
ovs-agent-devstack-trusty-rax-iad-5785199 not found in database

2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
[req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might have
been deleted concurrently.

Are several hundred of these warnings useful to an operator trying to
debug a problem? The point of the CI gate testing is to try and simulate
a production cloud environment. When something goes wrong, you check the
logs. With the amount of warning/error level logging that is in the
neutron logs, finding a real problem is like looking for a needle in a
haystack. Since everything is async, 404s are expected when racing to
delete a resource and they should be handled gracefully.

Anyway, the server log isn't useful so I go digging in the agent logs
and stacktraces there are aplenty. [4]

Particularly this:

"Exception: Port tapcea51630-e1 is not ready, resync needed"

That's due to a new change landing in the last 24 hours [5]. But the
trace shows up over 16K times since it landed [6].

Checking the code, it's basically a loop processing events and when it
hits an event it can't handle, it punts (breaking the loop so you don't
process the other events after it - which is a bug), and the code that
eventually handles it is just catching all Exception and tracing them
out assuming they are really bad.


As you noticed in the review [1] there was a dependent patch that was 
solving this. It was a big and pretty complex change, I tried to split 
it in few patches. I should have split it in a better way.




At this point, as a non-neutron person, i.e. not well versed in the
operations of neutron or how to debug it in great detail, I assume
something is bad here but I don't really know - and the logs are so full
of noise that I can't distinguish real failures.

I don't mean to pick on this particular change, but it's a good example
of a recent thing.

I'd like to know if this is all known issue or WIP type stuff. I've
complained about excessively noisey neutron logs in channel before and
I'm usually told that they are either necessary (for whatever reason) or
that rather than complain about the verbosity, I should fix the race
that is causing it - which is not likely to happen since I don't have
the async rpc happy nature of everything in neutron in my head to debug
it (I doubt many do).


Yes in this case it's a WIP, logs were not meant to stay there, actually 
they were cleaned up by the dependent patch.





Anyway, this is a plea for sanity in the logs. There are logging
guidelines for openstack [7]. Let's please abide by them. Let's keep
operators in mind when we're looking at logs and be proactive about
making them useful (which includes more granular error handling and less
global try/except Exception: LOG.exception constructs).


We are aware of the guidelines and when reviewing code in Neutron we try 
to enforce them. We all want better logs :) so please keep giving 
feedback. Thanks for raising this point.


cheers,

Rossella




[1] http://tinyurl.com/ne3ex4v
[2]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22

[3]
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE

[4]
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-agt.txt.gz?level=TRACE

[5] https://review.openstack.org/#/c/164880/
[6]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Exception:%20Port%5C%22%20AND%20message:%5C%22is%20not%20ready,%20resync%20needed%5C%22%20AND%20tags:%5C%22screen-q-agt.txt%5C%22

[7]
http://specs.openstack.org/openstack/openstack-specs/specs/log-guidelines.html




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Matt Riedemann



On 11/10/2015 12:51 PM, Armando M. wrote:



On 10 November 2015 at 10:33, Matt Riedemann > wrote:

Let me qualify by saying I'm not a Neutron person.

We know that gate-tempest-dsvm-neutron-full is failing hard as of
the last 24 hours [1].

An error that's been showing up in tempest runs with neutron a lot is:

"AssertionError: 0 == 0 : No IPv4 addresses found in: []"

So checking logstash [2] it's hitting a lot. It's only recent
because that failure message is new to Tempest in the last day or
so, but it has a lot of hits, so whatever it is, it's failing a lot.

So the next step is usually digging into service logs looking for
errors. I check the q-svc logs first. Not many errors but a
bazillion warnings for things not found (networks and devices). [3]

For example:

2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
[req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
ovs-agent-devstack-trusty-rax-iad-5785199 not found in database

2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
[req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might
have been deleted concurrently.

Are several hundred of these warnings useful to an operator trying
to debug a problem? The point of the CI gate testing is to try and
simulate a production cloud environment. When something goes wrong,
you check the logs. With the amount of warning/error level logging
that is in the neutron logs, finding a real problem is like looking
for a needle in a haystack. Since everything is async, 404s are
expected when racing to delete a resource and they should be handled
gracefully.

Anyway, the server log isn't useful so I go digging in the agent
logs and stacktraces there are aplenty. [4]

Particularly this:

"Exception: Port tapcea51630-e1 is not ready, resync needed"

That's due to a new change landing in the last 24 hours [5]. But the
trace shows up over 16K times since it landed [6].

Checking the code, it's basically a loop processing events and when
it hits an event it can't handle, it punts (breaking the loop so you
don't process the other events after it - which is a bug), and the
code that eventually handles it is just catching all Exception and
tracing them out assuming they are really bad.

At this point, as a non-neutron person, i.e. not well versed in the
operations of neutron or how to debug it in great detail, I assume
something is bad here but I don't really know - and the logs are so
full of noise that I can't distinguish real failures.

I don't mean to pick on this particular change, but it's a good
example of a recent thing.

I'd like to know if this is all known issue or WIP type stuff. I've
complained about excessively noisey neutron logs in channel before
and I'm usually told that they are either necessary (for whatever
reason) or that rather than complain about the verbosity, I should
fix the race that is causing it - which is not likely to happen
since I don't have the async rpc happy nature of everything in
neutron in my head to debug it (I doubt many do).


I am not sure that's a fair statement: we usually pinpoint that just
lowering log levels is not really solving the underlying issue
(whichever it may be), and that comment really should apply to any
project, not just Neutron. That said, we had examples where we took your
input and drove the right fix ourselves.

We have a 'logging' tag for Neutron bugs that we use to identify these
type of cleanups. We'd need your attention to details to alert us of
issues like these; we'll take care of the right fixes. Currently, the
queue is pretty dry. If you can top it up, that'll be great. Going off
on a log cleanup rampage doesn't seem like the best course of action;
I'd rather knock issues one by one as they come, like the one you just
mentioned.

[1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging


Tagged:

https://bugs.launchpad.net/nova/+bug/1514935

Although it must not be an official tag since it didn't auto-complete 
for me, that should be fixed.





Anyway, this is a plea for sanity in the logs. There are logging
guidelines for openstack [7]. Let's please abide by them. Let's keep
operators in mind when we're looking at logs and be proactive about
making them useful (which includes more granular error handling and
less global try/except Exception: LOG.exception constructs).

Your point is duly noted. We have documented this, and we are being more
meticulous during reviews.


[1] http://tinyurl.com/ne3ex4v
[2]


Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Matt Riedemann



On 11/10/2015 1:10 PM, Matt Riedemann wrote:



On 11/10/2015 12:51 PM, Armando M. wrote:



On 10 November 2015 at 10:33, Matt Riedemann > wrote:

Let me qualify by saying I'm not a Neutron person.

We know that gate-tempest-dsvm-neutron-full is failing hard as of
the last 24 hours [1].

An error that's been showing up in tempest runs with neutron a lot
is:

"AssertionError: 0 == 0 : No IPv4 addresses found in: []"

So checking logstash [2] it's hitting a lot. It's only recent
because that failure message is new to Tempest in the last day or
so, but it has a lot of hits, so whatever it is, it's failing a lot.

So the next step is usually digging into service logs looking for
errors. I check the q-svc logs first. Not many errors but a
bazillion warnings for things not found (networks and devices). [3]

For example:

2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
[req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
ovs-agent-devstack-trusty-rax-iad-5785199 not found in database

2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
[req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might
have been deleted concurrently.

Are several hundred of these warnings useful to an operator trying
to debug a problem? The point of the CI gate testing is to try and
simulate a production cloud environment. When something goes wrong,
you check the logs. With the amount of warning/error level logging
that is in the neutron logs, finding a real problem is like looking
for a needle in a haystack. Since everything is async, 404s are
expected when racing to delete a resource and they should be handled
gracefully.

Anyway, the server log isn't useful so I go digging in the agent
logs and stacktraces there are aplenty. [4]

Particularly this:

"Exception: Port tapcea51630-e1 is not ready, resync needed"

That's due to a new change landing in the last 24 hours [5]. But the
trace shows up over 16K times since it landed [6].

Checking the code, it's basically a loop processing events and when
it hits an event it can't handle, it punts (breaking the loop so you
don't process the other events after it - which is a bug), and the
code that eventually handles it is just catching all Exception and
tracing them out assuming they are really bad.

At this point, as a non-neutron person, i.e. not well versed in the
operations of neutron or how to debug it in great detail, I assume
something is bad here but I don't really know - and the logs are so
full of noise that I can't distinguish real failures.

I don't mean to pick on this particular change, but it's a good
example of a recent thing.

I'd like to know if this is all known issue or WIP type stuff. I've
complained about excessively noisey neutron logs in channel before
and I'm usually told that they are either necessary (for whatever
reason) or that rather than complain about the verbosity, I should
fix the race that is causing it - which is not likely to happen
since I don't have the async rpc happy nature of everything in
neutron in my head to debug it (I doubt many do).


I am not sure that's a fair statement: we usually pinpoint that just
lowering log levels is not really solving the underlying issue
(whichever it may be), and that comment really should apply to any
project, not just Neutron. That said, we had examples where we took your
input and drove the right fix ourselves.

We have a 'logging' tag for Neutron bugs that we use to identify these
type of cleanups. We'd need your attention to details to alert us of
issues like these; we'll take care of the right fixes. Currently, the
queue is pretty dry. If you can top it up, that'll be great. Going off
on a log cleanup rampage doesn't seem like the best course of action;
I'd rather knock issues one by one as they come, like the one you just
mentioned.

[1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging


Tagged:

https://bugs.launchpad.net/nova/+bug/1514935

Although it must not be an official tag since it didn't auto-complete
for me, that should be fixed.


Crap, nevermind that, my bad - I had created the bug against nova (force 
of habit) rather than neutron. Confirmed that 'logging' is a neutron tag 
in LP.







Anyway, this is a plea for sanity in the logs. There are logging
guidelines for openstack [7]. Let's please abide by them. Let's keep
operators in mind when we're looking at logs and be proactive about
making them useful (which includes more granular error handling and
less global try/except Exception: LOG.exception constructs).

Your point is duly noted. We have 

[openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Matt Riedemann

Let me qualify by saying I'm not a Neutron person.

We know that gate-tempest-dsvm-neutron-full is failing hard as of the 
last 24 hours [1].


An error that's been showing up in tempest runs with neutron a lot is:

"AssertionError: 0 == 0 : No IPv4 addresses found in: []"

So checking logstash [2] it's hitting a lot. It's only recent because 
that failure message is new to Tempest in the last day or so, but it has 
a lot of hits, so whatever it is, it's failing a lot.


So the next step is usually digging into service logs looking for 
errors. I check the q-svc logs first. Not many errors but a bazillion 
warnings for things not found (networks and devices). [3]


For example:

2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc 
[req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device 
aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent 
ovs-agent-devstack-trusty-rax-iad-5785199 not found in database


2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc 
[req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network 
b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might have 
been deleted concurrently.


Are several hundred of these warnings useful to an operator trying to 
debug a problem? The point of the CI gate testing is to try and simulate 
a production cloud environment. When something goes wrong, you check the 
logs. With the amount of warning/error level logging that is in the 
neutron logs, finding a real problem is like looking for a needle in a 
haystack. Since everything is async, 404s are expected when racing to 
delete a resource and they should be handled gracefully.


Anyway, the server log isn't useful so I go digging in the agent logs 
and stacktraces there are aplenty. [4]


Particularly this:

"Exception: Port tapcea51630-e1 is not ready, resync needed"

That's due to a new change landing in the last 24 hours [5]. But the 
trace shows up over 16K times since it landed [6].


Checking the code, it's basically a loop processing events and when it 
hits an event it can't handle, it punts (breaking the loop so you don't 
process the other events after it - which is a bug), and the code that 
eventually handles it is just catching all Exception and tracing them 
out assuming they are really bad.


At this point, as a non-neutron person, i.e. not well versed in the 
operations of neutron or how to debug it in great detail, I assume 
something is bad here but I don't really know - and the logs are so full 
of noise that I can't distinguish real failures.


I don't mean to pick on this particular change, but it's a good example 
of a recent thing.


I'd like to know if this is all known issue or WIP type stuff. I've 
complained about excessively noisey neutron logs in channel before and 
I'm usually told that they are either necessary (for whatever reason) or 
that rather than complain about the verbosity, I should fix the race 
that is causing it - which is not likely to happen since I don't have 
the async rpc happy nature of everything in neutron in my head to debug 
it (I doubt many do).


Anyway, this is a plea for sanity in the logs. There are logging 
guidelines for openstack [7]. Let's please abide by them. Let's keep 
operators in mind when we're looking at logs and be proactive about 
making them useful (which includes more granular error handling and less 
global try/except Exception: LOG.exception constructs).


[1] http://tinyurl.com/ne3ex4v
[2] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22
[3] 
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE
[4] 
http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-agt.txt.gz?level=TRACE

[5] https://review.openstack.org/#/c/164880/
[6] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Exception:%20Port%5C%22%20AND%20message:%5C%22is%20not%20ready,%20resync%20needed%5C%22%20AND%20tags:%5C%22screen-q-agt.txt%5C%22
[7] 
http://specs.openstack.org/openstack/openstack-specs/specs/log-guidelines.html


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Armando M.
On 10 November 2015 at 10:33, Matt Riedemann 
wrote:

> Let me qualify by saying I'm not a Neutron person.
>
> We know that gate-tempest-dsvm-neutron-full is failing hard as of the last
> 24 hours [1].
>
> An error that's been showing up in tempest runs with neutron a lot is:
>
> "AssertionError: 0 == 0 : No IPv4 addresses found in: []"
>
> So checking logstash [2] it's hitting a lot. It's only recent because that
> failure message is new to Tempest in the last day or so, but it has a lot
> of hits, so whatever it is, it's failing a lot.
>
> So the next step is usually digging into service logs looking for errors.
> I check the q-svc logs first. Not many errors but a bazillion warnings for
> things not found (networks and devices). [3]
>
> For example:
>
> 2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
> [req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
> aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
> ovs-agent-devstack-trusty-rax-iad-5785199 not found in database
>
> 2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
> [req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
> b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might have been
> deleted concurrently.
>
> Are several hundred of these warnings useful to an operator trying to
> debug a problem? The point of the CI gate testing is to try and simulate a
> production cloud environment. When something goes wrong, you check the
> logs. With the amount of warning/error level logging that is in the neutron
> logs, finding a real problem is like looking for a needle in a haystack.
> Since everything is async, 404s are expected when racing to delete a
> resource and they should be handled gracefully.
>
> Anyway, the server log isn't useful so I go digging in the agent logs and
> stacktraces there are aplenty. [4]
>
> Particularly this:
>
> "Exception: Port tapcea51630-e1 is not ready, resync needed"
>
> That's due to a new change landing in the last 24 hours [5]. But the trace
> shows up over 16K times since it landed [6].
>
> Checking the code, it's basically a loop processing events and when it
> hits an event it can't handle, it punts (breaking the loop so you don't
> process the other events after it - which is a bug), and the code that
> eventually handles it is just catching all Exception and tracing them out
> assuming they are really bad.
>
> At this point, as a non-neutron person, i.e. not well versed in the
> operations of neutron or how to debug it in great detail, I assume
> something is bad here but I don't really know - and the logs are so full of
> noise that I can't distinguish real failures.
>
> I don't mean to pick on this particular change, but it's a good example of
> a recent thing.
>
> I'd like to know if this is all known issue or WIP type stuff. I've
> complained about excessively noisey neutron logs in channel before and I'm
> usually told that they are either necessary (for whatever reason) or that
> rather than complain about the verbosity, I should fix the race that is
> causing it - which is not likely to happen since I don't have the async rpc
> happy nature of everything in neutron in my head to debug it (I doubt many
> do).
>

I am not sure that's a fair statement: we usually pinpoint that just
lowering log levels is not really solving the underlying issue (whichever
it may be), and that comment really should apply to any project, not just
Neutron. That said, we had examples where we took your input and drove the
right fix ourselves.

We have a 'logging' tag for Neutron bugs that we use to identify these type
of cleanups. We'd need your attention to details to alert us of issues like
these; we'll take care of the right fixes. Currently, the queue is pretty
dry. If you can top it up, that'll be great. Going off on a log cleanup
rampage doesn't seem like the best course of action; I'd rather knock
issues one by one as they come, like the one you just mentioned.

[1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging


>
> Anyway, this is a plea for sanity in the logs. There are logging
> guidelines for openstack [7]. Let's please abide by them. Let's keep
> operators in mind when we're looking at logs and be proactive about making
> them useful (which includes more granular error handling and less global
> try/except Exception: LOG.exception constructs).
>

Your point is duly noted. We have documented this, and we are being more
meticulous during reviews.


> [1] http://tinyurl.com/ne3ex4v
> [2]
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22
> [3]
> http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE
> [4]
> 

Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Armando M.
On 10 November 2015 at 11:12, Matt Riedemann 
wrote:

>
>
> On 11/10/2015 1:10 PM, Matt Riedemann wrote:
>
>>
>>
>> On 11/10/2015 12:51 PM, Armando M. wrote:
>>
>>>
>>>
>>> On 10 November 2015 at 10:33, Matt Riedemann >> > wrote:
>>>
>>> Let me qualify by saying I'm not a Neutron person.
>>>
>>> We know that gate-tempest-dsvm-neutron-full is failing hard as of
>>> the last 24 hours [1].
>>>
>>> An error that's been showing up in tempest runs with neutron a lot
>>> is:
>>>
>>> "AssertionError: 0 == 0 : No IPv4 addresses found in: []"
>>>
>>> So checking logstash [2] it's hitting a lot. It's only recent
>>> because that failure message is new to Tempest in the last day or
>>> so, but it has a lot of hits, so whatever it is, it's failing a lot.
>>>
>>> So the next step is usually digging into service logs looking for
>>> errors. I check the q-svc logs first. Not many errors but a
>>> bazillion warnings for things not found (networks and devices). [3]
>>>
>>> For example:
>>>
>>> 2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
>>> [req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
>>> aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
>>> ovs-agent-devstack-trusty-rax-iad-5785199 not found in database
>>>
>>> 2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
>>> [req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
>>> b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might
>>> have been deleted concurrently.
>>>
>>> Are several hundred of these warnings useful to an operator trying
>>> to debug a problem? The point of the CI gate testing is to try and
>>> simulate a production cloud environment. When something goes wrong,
>>> you check the logs. With the amount of warning/error level logging
>>> that is in the neutron logs, finding a real problem is like looking
>>> for a needle in a haystack. Since everything is async, 404s are
>>> expected when racing to delete a resource and they should be handled
>>> gracefully.
>>>
>>> Anyway, the server log isn't useful so I go digging in the agent
>>> logs and stacktraces there are aplenty. [4]
>>>
>>> Particularly this:
>>>
>>> "Exception: Port tapcea51630-e1 is not ready, resync needed"
>>>
>>> That's due to a new change landing in the last 24 hours [5]. But the
>>> trace shows up over 16K times since it landed [6].
>>>
>>> Checking the code, it's basically a loop processing events and when
>>> it hits an event it can't handle, it punts (breaking the loop so you
>>> don't process the other events after it - which is a bug), and the
>>> code that eventually handles it is just catching all Exception and
>>> tracing them out assuming they are really bad.
>>>
>>> At this point, as a non-neutron person, i.e. not well versed in the
>>> operations of neutron or how to debug it in great detail, I assume
>>> something is bad here but I don't really know - and the logs are so
>>> full of noise that I can't distinguish real failures.
>>>
>>> I don't mean to pick on this particular change, but it's a good
>>> example of a recent thing.
>>>
>>> I'd like to know if this is all known issue or WIP type stuff. I've
>>> complained about excessively noisey neutron logs in channel before
>>> and I'm usually told that they are either necessary (for whatever
>>> reason) or that rather than complain about the verbosity, I should
>>> fix the race that is causing it - which is not likely to happen
>>> since I don't have the async rpc happy nature of everything in
>>> neutron in my head to debug it (I doubt many do).
>>>
>>>
>>> I am not sure that's a fair statement: we usually pinpoint that just
>>> lowering log levels is not really solving the underlying issue
>>> (whichever it may be), and that comment really should apply to any
>>> project, not just Neutron. That said, we had examples where we took your
>>> input and drove the right fix ourselves.
>>>
>>> We have a 'logging' tag for Neutron bugs that we use to identify these
>>> type of cleanups. We'd need your attention to details to alert us of
>>> issues like these; we'll take care of the right fixes. Currently, the
>>> queue is pretty dry. If you can top it up, that'll be great. Going off
>>> on a log cleanup rampage doesn't seem like the best course of action;
>>> I'd rather knock issues one by one as they come, like the one you just
>>> mentioned.
>>>
>>> [1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging
>>>
>>
>> Tagged:
>>
>> https://bugs.launchpad.net/nova/+bug/1514935
>>
>> Although it must not be an official tag since it didn't auto-complete
>> for me, that should be fixed.
>>
>
> Crap, nevermind that, my bad - I had created the bug against nova (force
> of habit) 

Re: [openstack-dev] Can we get some sanity in the Neutron logs please?

2015-11-10 Thread Armando M.
On 10 November 2015 at 11:10, Matt Riedemann 
wrote:

>
>
> On 11/10/2015 12:51 PM, Armando M. wrote:
>
>>
>>
>> On 10 November 2015 at 10:33, Matt Riedemann > > wrote:
>>
>> Let me qualify by saying I'm not a Neutron person.
>>
>> We know that gate-tempest-dsvm-neutron-full is failing hard as of
>> the last 24 hours [1].
>>
>> An error that's been showing up in tempest runs with neutron a lot is:
>>
>> "AssertionError: 0 == 0 : No IPv4 addresses found in: []"
>>
>> So checking logstash [2] it's hitting a lot. It's only recent
>> because that failure message is new to Tempest in the last day or
>> so, but it has a lot of hits, so whatever it is, it's failing a lot.
>>
>> So the next step is usually digging into service logs looking for
>> errors. I check the q-svc logs first. Not many errors but a
>> bazillion warnings for things not found (networks and devices). [3]
>>
>> For example:
>>
>> 2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc
>> [req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device
>> aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent
>> ovs-agent-devstack-trusty-rax-iad-5785199 not found in database
>>
>> 2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc
>> [req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network
>> b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might
>> have been deleted concurrently.
>>
>> Are several hundred of these warnings useful to an operator trying
>> to debug a problem? The point of the CI gate testing is to try and
>> simulate a production cloud environment. When something goes wrong,
>> you check the logs. With the amount of warning/error level logging
>> that is in the neutron logs, finding a real problem is like looking
>> for a needle in a haystack. Since everything is async, 404s are
>> expected when racing to delete a resource and they should be handled
>> gracefully.
>>
>> Anyway, the server log isn't useful so I go digging in the agent
>> logs and stacktraces there are aplenty. [4]
>>
>> Particularly this:
>>
>> "Exception: Port tapcea51630-e1 is not ready, resync needed"
>>
>> That's due to a new change landing in the last 24 hours [5]. But the
>> trace shows up over 16K times since it landed [6].
>>
>> Checking the code, it's basically a loop processing events and when
>> it hits an event it can't handle, it punts (breaking the loop so you
>> don't process the other events after it - which is a bug), and the
>> code that eventually handles it is just catching all Exception and
>> tracing them out assuming they are really bad.
>>
>> At this point, as a non-neutron person, i.e. not well versed in the
>> operations of neutron or how to debug it in great detail, I assume
>> something is bad here but I don't really know - and the logs are so
>> full of noise that I can't distinguish real failures.
>>
>> I don't mean to pick on this particular change, but it's a good
>> example of a recent thing.
>>
>> I'd like to know if this is all known issue or WIP type stuff. I've
>> complained about excessively noisey neutron logs in channel before
>> and I'm usually told that they are either necessary (for whatever
>> reason) or that rather than complain about the verbosity, I should
>> fix the race that is causing it - which is not likely to happen
>> since I don't have the async rpc happy nature of everything in
>> neutron in my head to debug it (I doubt many do).
>>
>>
>> I am not sure that's a fair statement: we usually pinpoint that just
>> lowering log levels is not really solving the underlying issue
>> (whichever it may be), and that comment really should apply to any
>> project, not just Neutron. That said, we had examples where we took your
>> input and drove the right fix ourselves.
>>
>> We have a 'logging' tag for Neutron bugs that we use to identify these
>> type of cleanups. We'd need your attention to details to alert us of
>> issues like these; we'll take care of the right fixes. Currently, the
>> queue is pretty dry. If you can top it up, that'll be great. Going off
>> on a log cleanup rampage doesn't seem like the best course of action;
>> I'd rather knock issues one by one as they come, like the one you just
>> mentioned.
>>
>> [1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging
>>
>
> Tagged:
>
> https://bugs.launchpad.net/nova/+bug/1514935
>
> Although it must not be an official tag since it didn't auto-complete for
> me, that should be fixed.
>

It is...that's perhaps because you tried to tag whilst the bug was
targeting nova? btw thanks!


>
>
>>
>> Anyway, this is a plea for sanity in the logs. There are logging
>> guidelines for openstack [7]. Let's please abide by them. Let's