Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
Hi, Re-architecting the schema might fix most of the performance issues of resource_list. And also, must do some work to improve the performance of meter-list. Is the Gordon's blue print gonna cover the both aspects ? https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql Sampath -Original Message- From: Neal, Phil [mailto:phil.n...@hp.com] Sent: Wednesday, March 19, 2014 12:17 AM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command -Original Message- From: Tim Bell [mailto:tim.b...@cern.ch] Sent: Monday, March 17, 2014 2:04 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command At CERN, we've had similar issues when enabling telemetry. Our resource-list times out after 10 minutes when the proxies for HA assume there is no answer coming back. Keystone instances per cell have helped the situation a little so we can collect the data but there was a significant increase in load on the API endpoints. I feel that some reference for production scale validation would be beneficial as part of TC approval to leave incubation in case there are issues such as this to be addressed. Tim -Original Message- From: Jay Pipes [mailto:jaypi...@gmail.com] Sent: 17 March 2014 20:25 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command ... Yep. At ATT, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant horrendous performance. Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause [2] which completely kills performance of the query to get resources. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage / impl_sqlalchemy.py#L436 [2] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage / impl_sqlalchemy.py#L503 The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. From logstash it seems there are still some cases when the resource list takes as long to execute as it used to, but the majority of runs take a long time: http://goo.gl/smJPB9 In the short term I've pushed out a patch that will remove this test from gate runs: https://review.openstack.org/#/c/81036 But, I thought it would be good to bring this up on the ML to try and figure out what changed or why this is so slow. I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that the underlying database schema is non-optimal) should be addressed. Gordon has introduced a blueprint https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some fixes for individual queries but +1 to the point of looking at re-architecting the schema as an approach to fixing performance. We've also seen some gains here at HP using batch writes as well but have temporarily tabled that work in favor of getting a better-performing schema in place. - Phil Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev
Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
-Original Message- From: Tim Bell [mailto:tim.b...@cern.ch] Sent: Monday, March 17, 2014 2:04 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command At CERN, we've had similar issues when enabling telemetry. Our resource-list times out after 10 minutes when the proxies for HA assume there is no answer coming back. Keystone instances per cell have helped the situation a little so we can collect the data but there was a significant increase in load on the API endpoints. I feel that some reference for production scale validation would be beneficial as part of TC approval to leave incubation in case there are issues such as this to be addressed. Tim -Original Message- From: Jay Pipes [mailto:jaypi...@gmail.com] Sent: 17 March 2014 20:25 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command ... Yep. At ATT, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant horrendous performance. Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause [2] which completely kills performance of the query to get resources. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/ impl_sqlalchemy.py#L436 [2] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/ impl_sqlalchemy.py#L503 The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. From logstash it seems there are still some cases when the resource list takes as long to execute as it used to, but the majority of runs take a long time: http://goo.gl/smJPB9 In the short term I've pushed out a patch that will remove this test from gate runs: https://review.openstack.org/#/c/81036 But, I thought it would be good to bring this up on the ML to try and figure out what changed or why this is so slow. I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that the underlying database schema is non-optimal) should be addressed. Gordon has introduced a blueprint https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some fixes for individual queries but +1 to the point of looking at re-architecting the schema as an approach to fixing performance. We've also seen some gains here at HP using batch writes as well but have temporarily tabled that work in favor of getting a better-performing schema in place. - Phil Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
hi Matt, test_ceilometer_resource_list which just calls ceilometer resource_list from the CLI once is taking =2 min to respond. For example: http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm- postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 (where it takes 3min) thanks for bringing this up... we're tracking this here: https://bugs.launchpad.net/ceilometer/+bug/1264434 i've put a patch out that partially fixes the issue. from bad to average... but i guess i should make the fix a bit more aggressive to bring the performance in line with the 'seconds' expectation. cheers, gordon chung openstack, ibm software standards Matthew Treinish mtrein...@kortar.org wrote on 17/03/2014 02:55:40 PM: From: Matthew Treinish mtrein...@kortar.org To: openstack-dev@lists.openstack.org Date: 17/03/2014 02:57 PM Subject: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command Hi everyone, So a little while ago we noticed that in all the gate runs one of the ceilometer cli tests is consistently in the list of slowest tests. (and often the slowest) This was a bit surprising given the nature of the cli tests we expect them to execute very quickly. test_ceilometer_resource_list which just calls ceilometer resource_list from the CLI once is taking =2 min to respond. For example: http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm- postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 (where it takes 3min) The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. From logstash it seems there are still some cases when the resource list takes as long to execute as it used to, but the majority of runs take a long time: http://goo.gl/smJPB9 In the short term I've pushed out a patch that will remove this testfrom gate runs: https://review.openstack.org/#/c/81036 But, I thought it wouldbe good to bring this up on the ML to try and figure out what changed or why this is so slow. Thanks, -Matt Treinish ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.orgwrote: Hi everyone, So a little while ago we noticed that in all the gate runs one of the ceilometer cli tests is consistently in the list of slowest tests. (and often the slowest) This was a bit surprising given the nature of the cli tests we expect them to execute very quickly. test_ceilometer_resource_list which just calls ceilometer resource_list from the CLI once is taking =2 min to respond. For example: http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 (where it takes 3min) The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. Sounds like we should add another round of sanity checking to the CLI tests: make sure all commands return within x seconds. As a first pass we can say x=60 and than crank it down in the future. From logstash it seems there are still some cases when the resource list takes as long to execute as it used to, but the majority of runs take a long time: http://goo.gl/smJPB9 In the short term I've pushed out a patch that will remove this test from gate runs: https://review.openstack.org/#/c/81036 But, I thought it would be good to bring this up on the ML to try and figure out what changed or why this is so slow. Thanks, -Matt Treinish ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
On Mon, 2014-03-17 at 14:55 -0400, Matthew Treinish wrote: Hi everyone, So a little while ago we noticed that in all the gate runs one of the ceilometer cli tests is consistently in the list of slowest tests. (and often the slowest) This was a bit surprising given the nature of the cli tests we expect them to execute very quickly. test_ceilometer_resource_list which just calls ceilometer resource_list from the CLI once is taking =2 min to respond. For example: http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 (where it takes 3min) Yep. At ATT, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant horrendous performance. Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause [2] which completely kills performance of the query to get resources. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436 [2] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503 The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. From logstash it seems there are still some cases when the resource list takes as long to execute as it used to, but the majority of runs take a long time: http://goo.gl/smJPB9 In the short term I've pushed out a patch that will remove this test from gate runs: https://review.openstack.org/#/c/81036 But, I thought it would be good to bring this up on the ML to try and figure out what changed or why this is so slow. I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that the underlying database schema is non-optimal) should be addressed. Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command
On Mon, Mar 17, 2014 at 12:25 PM, Sean Dague s...@dague.net wrote: On 03/17/2014 03:22 PM, Joe Gordon wrote: On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.org mailto:mtrein...@kortar.org wrote: Hi everyone, So a little while ago we noticed that in all the gate runs one of the ceilometer cli tests is consistently in the list of slowest tests. (and often the slowest) This was a bit surprising given the nature of the cli tests we expect them to execute very quickly. test_ceilometer_resource_list which just calls ceilometer resource_list from the CLI once is taking =2 min to respond. For example: http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 (where it takes 3min) The cli tests are supposed to be quick read-only sanity checks of the cli functionality and really shouldn't ever be on the list of slowest tests for a gate run. I think there was possibly a performance regression recently in ceilometer because from I can tell this test used to normally take ~60 sec. (which honestly is probably too slow for a cli test too) but it is currently much slower than that. Sounds like we should add another round of sanity checking to the CLI tests: make sure all commands return within x seconds. As a first pass we can say x=60 and than crank it down in the future. So, the last thing I want to do is trigger a race here by us artificially timing out on tests. However I do think cli tests should be returning in 2s otherwise they are not simple readonly tests. Agreed, I said 60 just as a starting point. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev