Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague s...@dague.net wrote: Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). So I have to disagree with the approach being taken here. Particularly in the case of Cinder and the negative tests that are in place. When I read this last week I assumed you actually meant that Exceptions were exceptional and nothing in Tempest should cause Exceptions. It turns out you apparently did mean Errors. I completely disagree here, Errors happen, some are recovered, some are expected by the tests etc. Having a policy and especially a gate that says NO ERROR MESSAGE in logs makes absolutely no sense to me. Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree with, but this makes no sense to me. By the way, here's a perfect example: https://bugs.launchpad.net/cinder/+bug/1243485 As long as we have Tempest tests that do things like show non-existent volume you're going to get an Error message and I think that you should quite frankly. Ceilometer is currently one of the largest offenders in dumping ERRORs in the gate - http://logs.openstack.org/68/**52768/1/check/check-tempest-** devstack-vm-full/76f83a4/**console.html#_2013-10-19_14_**51_51_271http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm-full/76f83a4/console.html#_2013-10-19_14_51_51_271(that item isn't in our whitelist yet, so you'll see a lot of it at the end of every run) and http://logs.openstack.org/68/**52768/1/check/check-tempest-** devstack-vm-full/76f83a4/logs/**screen-ceilometer-collector.** txt.gz?level=TRACEhttp://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm-full/76f83a4/logs/screen-ceilometer-collector.txt.gz?level=TRACEfor full details This seems like something is wrong in the integration, and would be really helpful if we could get ceilometer eyes on this one to put ceilo into a non erroring state. -Sean -- Sean Dague http://dague.net __**_ OpenStack-dev mailing list OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On 10/23/2013 10:40 AM, John Griffith wrote: On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague s...@dague.net mailto:s...@dague.net wrote: Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). So I have to disagree with the approach being taken here. Particularly in the case of Cinder and the negative tests that are in place. When I read this last week I assumed you actually meant that Exceptions were exceptional and nothing in Tempest should cause Exceptions. It turns out you apparently did mean Errors. I completely disagree here, Errors happen, some are recovered, some are expected by the tests etc. Having a policy and especially a gate that says NO ERROR MESSAGE in logs makes absolutely no sense to me. Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree with, but this makes no sense to me. By the way, here's a perfect example: https://bugs.launchpad.net/cinder/+bug/1243485 As long as we have Tempest tests that do things like show non-existent volume you're going to get an Error message and I think that you should quite frankly. Ok, I guess that's where we probably need to clarify what Not Found is. Because Not Found to me seems like it should be a request at INFO level, not ERROR. ERROR from an admin perspective should really be something that would suitable for sending an alert to an administrator for them to come and fix the cloud. TRACE is actually a lower level of severity in our log systems than ERROR is. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On Wed, Oct 23, 2013 at 8:47 AM, Sean Dague s...@dague.net wrote: On 10/23/2013 10:40 AM, John Griffith wrote: On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague s...@dague.net mailto:s...@dague.net wrote: Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). So I have to disagree with the approach being taken here. Particularly in the case of Cinder and the negative tests that are in place. When I read this last week I assumed you actually meant that Exceptions were exceptional and nothing in Tempest should cause Exceptions. It turns out you apparently did mean Errors. I completely disagree here, Errors happen, some are recovered, some are expected by the tests etc. Having a policy and especially a gate that says NO ERROR MESSAGE in logs makes absolutely no sense to me. Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree with, but this makes no sense to me. By the way, here's a perfect example: https://bugs.launchpad.net/**cinder/+bug/1243485https://bugs.launchpad.net/cinder/+bug/1243485 As long as we have Tempest tests that do things like show non-existent volume you're going to get an Error message and I think that you should quite frankly. Ok, I guess that's where we probably need to clarify what Not Found is. Because Not Found to me seems like it should be a request at INFO level, not ERROR. ERROR from an admin perspective should really be something that would suitable for sending an alert to an administrator for them to come and fix the cloud. TRACE is actually a lower level of severity in our log systems than ERROR is. Sorry, by Trace I was referring to unhandled stack/exception trace messages in the logs. -Sean -- Sean Dague http://dague.net __**_ OpenStack-dev mailing list OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
John Griffith wrote: On Wed, Oct 23, 2013 at 8:47 AM, Sean Dague s...@dague.netmailto:s...@dague.net wrote: On 10/23/2013 10:40 AM, John Griffith wrote: On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague s...@dague.netmailto:s...@dague.net mailto:s...@dague.netmailto:s...@dague.net wrote: Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). So I have to disagree with the approach being taken here. Particularly in the case of Cinder and the negative tests that are in place. When I read this last week I assumed you actually meant that Exceptions were exceptional and nothing in Tempest should cause Exceptions. It turns out you apparently did mean Errors. I completely disagree here, Errors happen, some are recovered, some are expected by the tests etc. Having a policy and especially a gate that says NO ERROR MESSAGE in logs makes absolutely no sense to me. Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree with, but this makes no sense to me. By the way, here's a perfect example: https://bugs.launchpad.net/cinder/+bug/1243485 As long as we have Tempest tests that do things like show non-existent volume you're going to get an Error message and I think that you should quite frankly. Ok, I guess that's where we probably need to clarify what Not Found is. Because Not Found to me seems like it should be a request at INFO level, not ERROR. ERROR from an admin perspective should really be something that would suitable for sending an alert to an administrator for them to come and fix the cloud. From my perspective as someone who has done Ops in the past, a Volume Not Found can be either info or an error. It all depends on the context. That said, we need to be able to test ERROR conditions and ensure that they report properly as ERROR, else the poor Ops folks will always be on the spot for not knowing that there is a problem. A volume that has gone missing is a problem. Ops would like an immediate report. They would trigger on the ERROR statement in the log. On the other hand, if someone/thing fatfingers an input and requests something that has never existed, then that's just info. We need to be able to test for correctness of errors and process logs with errors in them as part of the test verification. Perhaps a switch in the test that indicates log needs post processing, or a way to redirect the log during a specific error test, or some such? The question is, how do we keep test system logs clean of ERRORs and still test system logs for intentionally triggered ERRORs? --Rocky TRACE is actually a lower level of severity in our log systems than ERROR is. Sorry, by Trace I was referring to unhandled stack/exception trace messages in the logs. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On 10/23/2013 05:08 PM, Rochelle.Grober wrote: John Griffith wrote: On Wed, Oct 23, 2013 at 8:47 AM, Sean Dague s...@dague.net mailto:s...@dague.net wrote: On 10/23/2013 10:40 AM, John Griffith wrote: On Sun, Oct 20, 2013 at 7:38 AM, Sean Dague s...@dague.net mailto:s...@dague.net mailto:s...@dague.net mailto:s...@dague.net wrote: Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). So I have to disagree with the approach being taken here. Particularly in the case of Cinder and the negative tests that are in place. When I read this last week I assumed you actually meant that Exceptions were exceptional and nothing in Tempest should cause Exceptions. It turns out you apparently did mean Errors. I completely disagree here, Errors happen, some are recovered, some are expected by the tests etc. Having a policy and especially a gate that says NO ERROR MESSAGE in logs makes absolutely no sense to me. Something like NO TRACE/EXCEPTION MESSAGE in logs I can agree with, but this makes no sense to me. By the way, here's a perfect example: https://bugs.launchpad.net/cinder/+bug/1243485 As long as we have Tempest tests that do things like show non-existent volume you're going to get an Error message and I think that you should quite frankly. Ok, I guess that's where we probably need to clarify what Not Found is. Because Not Found to me seems like it should be a request at INFO level, not ERROR. ERROR from an admin perspective should really be something that would suitable for sending an alert to an administrator for them to come and fix the cloud. From my perspective as someone who has done Ops in the past, a Volume Not Found can be either info or an error. It all depends on the context. That said, we need to be able to test ERROR conditions and ensure that they report properly as ERROR, else the poor Ops folks will always be on the spot for not knowing that there is a problem. A volume that has gone missing is a problem. Ops would like an immediate report. They would trigger on the ERROR statement in the log. On the other hand, if someone/thing fatfingers an input and requests something that has never existed, then that's just info. It is not just a case of fatfingers. Some of the delete apis are asynchronous and the only way to know that a delete finished is to check if the object still exists. Tempest does such checks to manage resource usage, even if there were no negative tests. The logs are not full of ERRORs because almost all of our apis, including nova, do not log an ERROR when returning 404. I think John's point is that it can be hard or impossible to tell if an object is not found because it truly no longer exists (or never existed), or if there is something wrong with the system and the object really exists but is not being found. But I would argue that even if this is true we cannot alert the operator every time some user checks to see if an object is still there. So there has to be some thing that gets put in the log which says there is a problem with the system, either a bug or ran out of disk or something. The appearance of that thing in the log is what an alert should be triggered on, and what should fail a gate job. That is pretty close to what ERROR is being used for now. We need to be able to test for correctness of errors and process logs with errors in them as part of the test verification. Perhaps a switch in the test that indicates log needs post processing, or a way to redirect the log during a specific error test, or some such? The question is, how do we keep test system logs clean of ERRORs and still test system logs for intentionally triggered ERRORs? --Rocky We might be able to do that in our test framework, but it would not help operators. IMO the least of evils here by far is to log events associated with an api call that returns 4xx in a way that is distinguishable from how we log when we detect a system failure of some sort. -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
Hi, On Mon, Oct 21, 2013 at 02:27:44PM +, Neal, Phil wrote: https://blueprints.launchpad.net/tempest/+spec/basic-tempest-integration-for-ceilometer Some works have been done behind an other blueprint: https://blueprints.launchpad.net/tempest/+spec/add-basic-ceilometer-tests Most of this code have been written since a while, and need to be rebased. And some other have showed bugs in ceilometer. Bugs discovered with gate already have fixed in gerrit, and should be merged soon. -Original Message- From: Sean Dague [mailto:s...@dague.net] Sent: Sunday, October 20, 2013 7:39 AM To: OpenStack Development Mailing List Ceilometer is currently one of the largest offenders in dumping ERRORs in the gate - http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm- full/76f83a4/console.html#_2013-10-19_14_51_51_271 (that item isn't in our whitelist yet, so you'll see a lot of it at the end of every run) and http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm- full/76f83a4/logs/screen-ceilometer-collector.txt.gz?level=TRACE for full details I have planned to take a look on this, this week. Regards, -- Mehdi Abaakouk mail: sil...@sileht.net irc: sileht signature.asc Description: Digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On 10/21/2013 10:27 AM, Neal, Phil wrote: Sean, we currently have a BP out there to investigate basic tempest integration and I think this might fall under the same umbrella. Unfortunately I've not been able to free up my development time for it, but I've assigned it out to someone who can take a look and report back. https://blueprints.launchpad.net/tempest/+spec/basic-tempest-integration-for-ceilometer This is kind of worse than tempest integration issues. As far as I can tell ceilometer via devstack is basically non functional at all. And sort of worse than non functional, it's spewing errors, a lot. This really ought to be a top ceilometer item to address, otherwise we should probably turn off celiometer in devstack by default, because it's really not working at the moment. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
On 10/22/2013 10:19 AM, Sean Dague wrote: On 10/21/2013 10:27 AM, Neal, Phil wrote: Sean, we currently have a BP out there to investigate basic tempest integration and I think this might fall under the same umbrella. Unfortunately I've not been able to free up my development time for it, but I've assigned it out to someone who can take a look and report back. https://blueprints.launchpad.net/tempest/+spec/basic-tempest-integration-for-ceilometer This is kind of worse than tempest integration issues. As far as I can tell ceilometer via devstack is basically non functional at all. And sort of worse than non functional, it's spewing errors, a lot. This really ought to be a top ceilometer item to address, otherwise we should probably turn off celiometer in devstack by default, because it's really not working at the moment. -Sean Here are the two errors showing up persistently that are not whitelisted. Such log errors are now being shown in the console log right after the tempest tests finish. https://bugs.launchpad.net/ceilometer/+bug/1243251 2013-10-21 21:11:00.229 | 2013-10-21 21:05:20.046 5624 ERROR ceilometer.collector.dispatcher.database [-] Failed to record metering data: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 https://bugs.launchpad.net/ceilometer/+bug/1243249 2013-10-21 20:22:27.600 | Log File: ceilometer-alarm-evaluator 2013-10-21 20:22:27.600 | 2013-10-21 20:14:33.038 22760 ERROR ceilometer.alarm.service [-] alarm evaluation cycle failed See also https://bugs.launchpad.net/ceilometer/+bug/1237671 -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
Hi guys, I can share my experience with devstack+ceilometer. There is certainly a problem with MongoDB, because Ceilometer requires more fresh Mongo than devstack provides. But I didn't experienced problems with SQL. And just a quick question about testing: are there any plans to test Ceilometer with different db-backends in devstack? Or do you suggest that it is not devstack's responsibility? Thanks, Nadya On Tue, Oct 22, 2013 at 6:55 PM, David Kranz dkr...@redhat.com wrote: On 10/22/2013 10:19 AM, Sean Dague wrote: On 10/21/2013 10:27 AM, Neal, Phil wrote: Sean, we currently have a BP out there to investigate basic tempest integration and I think this might fall under the same umbrella. Unfortunately I've not been able to free up my development time for it, but I've assigned it out to someone who can take a look and report back. https://blueprints.launchpad.**net/tempest/+spec/basic-** tempest-integration-for-**ceilometerhttps://blueprints.launchpad.net/tempest/+spec/basic-tempest-integration-for-ceilometer This is kind of worse than tempest integration issues. As far as I can tell ceilometer via devstack is basically non functional at all. And sort of worse than non functional, it's spewing errors, a lot. This really ought to be a top ceilometer item to address, otherwise we should probably turn off celiometer in devstack by default, because it's really not working at the moment. -Sean Here are the two errors showing up persistently that are not whitelisted. Such log errors are now being shown in the console log right after the tempest tests finish. https://bugs.launchpad.net/**ceilometer/+bug/1243251https://bugs.launchpad.net/ceilometer/+bug/1243251 2013-10-21 21:11:00.229 | 2013-10-21 21:05:20.046 5624 ERROR ceilometer.collector.**dispatcher.database [-] Failed to record metering data: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 https://bugs.launchpad.net/**ceilometer/+bug/1243249https://bugs.launchpad.net/ceilometer/+bug/1243249 2013-10-21 20:22:27.600 | Log File: ceilometer-alarm-evaluator 2013-10-21 20:22:27.600 | 2013-10-21 20:14:33.038 22760 ERROR ceilometer.alarm.service [-] alarm evaluation cycle failed See also https://bugs.launchpad.net/**ceilometer/+bug/1237671https://bugs.launchpad.net/ceilometer/+bug/1237671 -David __**_ OpenStack-dev mailing list OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
Sean, we currently have a BP out there to investigate basic tempest integration and I think this might fall under the same umbrella. Unfortunately I've not been able to free up my development time for it, but I've assigned it out to someone who can take a look and report back. https://blueprints.launchpad.net/tempest/+spec/basic-tempest-integration-for-ceilometer - Phil -Original Message- From: Sean Dague [mailto:s...@dague.net] Sent: Sunday, October 20, 2013 7:39 AM To: OpenStack Development Mailing List Subject: [openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). Ceilometer is currently one of the largest offenders in dumping ERRORs in the gate - http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm- full/76f83a4/console.html#_2013-10-19_14_51_51_271 (that item isn't in our whitelist yet, so you'll see a lot of it at the end of every run) and http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm- full/76f83a4/logs/screen-ceilometer-collector.txt.gz?level=TRACE for full details This seems like something is wrong in the integration, and would be really helpful if we could get ceilometer eyes on this one to put ceilo into a non erroring state. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [ceilometer] [qa] Ceilometer ERRORS in normal runs
Dave Kranz has been building a system so that we can ensure that during a Tempest run services don't spew ERRORs in the logs. Eventually, we're going to gate on this, because there is nothing that Tempest does to the system that should cause any OpenStack service to ERROR or stack trace (Errors should actually be exceptional events that something is wrong with the system, not regular events). Ceilometer is currently one of the largest offenders in dumping ERRORs in the gate - http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm-full/76f83a4/console.html#_2013-10-19_14_51_51_271 (that item isn't in our whitelist yet, so you'll see a lot of it at the end of every run) and http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm-full/76f83a4/logs/screen-ceilometer-collector.txt.gz?level=TRACE for full details This seems like something is wrong in the integration, and would be really helpful if we could get ceilometer eyes on this one to put ceilo into a non erroring state. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev