Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-20 Thread Sampath Priyankara
Hi,

  Re-architecting the schema might fix most of the performance issues of
resource_list.  
  And also, must do some work to improve the performance of meter-list.
  Is the Gordon's blue print gonna cover the both aspects ?
  https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql
  
Sampath

-Original Message-
From: Neal, Phil [mailto:phil.n...@hp.com] 
Sent: Wednesday, March 19, 2014 12:17 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list
CLI command


 -Original Message-
 From: Tim Bell [mailto:tim.b...@cern.ch]
 Sent: Monday, March 17, 2014 2:04 PM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer 
 resource_list CLI command
 
 
 At CERN, we've had similar issues when enabling telemetry. Our 
 resource-list times out after 10 minutes when the proxies for HA 
 assume there is no answer coming back. Keystone instances per cell 
 have helped the situation a little so we can collect the data but 
 there was a significant increase in load on the API endpoints.
 
 I feel that some reference for production scale validation would be 
 beneficial as part of TC approval to leave incubation in case there 
 are issues such as this to be addressed.
 
 Tim
 
  -Original Message-
  From: Jay Pipes [mailto:jaypi...@gmail.com]
  Sent: 17 March 2014 20:25
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command
 
 ...
 
  Yep. At ATT, we had to disable calls to GET /resources without any 
  filters
 on it. The call would return hundreds of thousands of
  records, all being JSON-ified at the Ceilometer API endpoint, and 
  the result
 would take minutes to return. There was no default limit
  on the query, which meant every single records in the database was
 returned, and on even a semi-busy system, that meant
  horrendous performance.
 
  Besides the problem that the SQLAlchemy driver doesn't yet support
 pagination [1], the main problem with the get_resources() call is
  the underlying databases schema for the Sample model is wacky, and
 forces the use of a dependent subquery in the WHERE clause
  [2] which completely kills performance of the query to get resources.
 
  [1]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
 /
 impl_sqlalchemy.py#L436
  [2]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
 /
 impl_sqlalchemy.py#L503
 
   The cli tests are supposed to be quick read-only sanity checks of 
   the cli functionality and really shouldn't ever be on the list of 
   slowest tests for a gate run.
 
  Oh, the test is readonly all-right. ;) It's just that it's reading 
  hundreds of
 thousands of records.
 
I think there was possibly a performance regression recently in 
   ceilometer because from I can tell this test used to normally take ~60
sec.
   (which honestly is probably too slow for a cli test too) but it is 
   currently much slower than that.
  
   From logstash it seems there are still some cases when the 
   resource list takes as long to execute as it used to, but the 
   majority of runs take a
 long time:
   http://goo.gl/smJPB9
  
   In the short term I've pushed out a patch that will remove this 
   test from gate
   runs: https://review.openstack.org/#/c/81036 But, I thought it 
   would be good to bring this up on the ML to try and figure out 
   what changed or why this is so slow.
 
  I agree with removing the test from the gate in the short term. 
  Medium to
 long term, the root causes of the problem (that GET
  /resources has no support for pagination on the query, there is no 
  default
 for limiting results based on a since timestamp, and that
  the underlying database schema is non-optimal) should be addressed.

Gordon has introduced a blueprint
https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some
fixes for individual queries but +1 to the point of looking at
re-architecting the schema as an approach to fixing performance. We've also
seen some gains here at HP using batch writes as well but have temporarily
tabled that work in favor of getting a better-performing schema in place.
- Phil

 
  Best,
  -jay
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-18 Thread Neal, Phil

 -Original Message-
 From: Tim Bell [mailto:tim.b...@cern.ch]
 Sent: Monday, March 17, 2014 2:04 PM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command
 
 
 At CERN, we've had similar issues when enabling telemetry. Our resource-list
 times out after 10 minutes when the proxies for HA assume there is no
 answer coming back. Keystone instances per cell have helped the situation a
 little so we can collect the data but there was a significant increase in 
 load on
 the API endpoints.
 
 I feel that some reference for production scale validation would be beneficial
 as part of TC approval to leave incubation in case there are issues such as 
 this
 to be addressed.
 
 Tim
 
  -Original Message-
  From: Jay Pipes [mailto:jaypi...@gmail.com]
  Sent: 17 March 2014 20:25
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command
 
 ...
 
  Yep. At ATT, we had to disable calls to GET /resources without any filters
 on it. The call would return hundreds of thousands of
  records, all being JSON-ified at the Ceilometer API endpoint, and the result
 would take minutes to return. There was no default limit
  on the query, which meant every single records in the database was
 returned, and on even a semi-busy system, that meant
  horrendous performance.
 
  Besides the problem that the SQLAlchemy driver doesn't yet support
 pagination [1], the main problem with the get_resources() call is
  the underlying databases schema for the Sample model is wacky, and
 forces the use of a dependent subquery in the WHERE clause
  [2] which completely kills performance of the query to get resources.
 
  [1]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/
 impl_sqlalchemy.py#L436
  [2]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/
 impl_sqlalchemy.py#L503
 
   The cli tests are supposed to be quick read-only sanity checks of the
   cli functionality and really shouldn't ever be on the list of slowest
   tests for a gate run.
 
  Oh, the test is readonly all-right. ;) It's just that it's reading hundreds 
  of
 thousands of records.
 
I think there was possibly a performance regression recently in
   ceilometer because from I can tell this test used to normally take ~60 
   sec.
   (which honestly is probably too slow for a cli test too) but it is
   currently much slower than that.
  
   From logstash it seems there are still some cases when the resource
   list takes as long to execute as it used to, but the majority of runs 
   take a
 long time:
   http://goo.gl/smJPB9
  
   In the short term I've pushed out a patch that will remove this test
   from gate
   runs: https://review.openstack.org/#/c/81036 But, I thought it would
   be good to bring this up on the ML to try and figure out what changed
   or why this is so slow.
 
  I agree with removing the test from the gate in the short term. Medium to
 long term, the root causes of the problem (that GET
  /resources has no support for pagination on the query, there is no default
 for limiting results based on a since timestamp, and that
  the underlying database schema is non-optimal) should be addressed.

Gordon has introduced a blueprint 
https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some fixes 
for individual queries but +1 to the point of looking at re-architecting the 
schema as an approach to fixing performance. We've also seen some gains here at 
HP using batch writes as well but have temporarily tabled that work in favor of 
getting a better-performing schema in place.
- Phil

 
  Best,
  -jay
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Gordon Chung
hi Matt,

 test_ceilometer_resource_list which just calls ceilometer 
 resource_list from the
 CLI once is taking =2 min to respond. For example:
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-
 postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)

thanks for bringing this up... we're tracking this here: 
https://bugs.launchpad.net/ceilometer/+bug/1264434

i've put a patch out that partially fixes the issue. from bad to 
average... but i guess i should make the fix a bit more aggressive to 
bring the performance in line with the 'seconds' expectation.

cheers,
gordon chung
openstack, ibm software standards

Matthew Treinish mtrein...@kortar.org wrote on 17/03/2014 02:55:40 PM:

 From: Matthew Treinish mtrein...@kortar.org
 To: openstack-dev@lists.openstack.org
 Date: 17/03/2014 02:57 PM
 Subject: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer 
 resource_list CLI command
 
 Hi everyone,
 
 So a little while ago we noticed that in all the gate runs one of 
 the ceilometer
 cli tests is consistently in the list of slowest tests. (and often 
 the slowest)
 This was a bit surprising given the nature of the cli tests we expect 
them to
 execute very quickly.
 
 test_ceilometer_resource_list which just calls ceilometer 
 resource_list from the
 CLI once is taking =2 min to respond. For example:
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-
 postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)
 
 The cli tests are supposed to be quick read-only sanity checks of the 
cli
 functionality and really shouldn't ever be on the list of slowest tests 
for a
 gate run. I think there was possibly a performance regression recently 
in
 ceilometer because from I can tell this test used to normally take ~60 
sec.
 (which honestly is probably too slow for a cli test too) but it is 
currently
 much slower than that.
 
 From logstash it seems there are still some cases when the resource list 
takes
 as long to execute as it used to, but the majority of runs take a long 
time:
 http://goo.gl/smJPB9
 
 In the short term I've pushed out a patch that will remove this testfrom 
gate
 runs: https://review.openstack.org/#/c/81036 But, I thought it wouldbe 
good to
 bring this up on the ML to try and figure out what changed or why this 
is so
 slow.
 
 Thanks,
 
 -Matt Treinish
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Joe Gordon
On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.orgwrote:

 Hi everyone,

 So a little while ago we noticed that in all the gate runs one of the
 ceilometer
 cli tests is consistently in the list of slowest tests. (and often the
 slowest)
 This was a bit surprising given the nature of the cli tests we expect them
 to
 execute very quickly.

 test_ceilometer_resource_list which just calls ceilometer resource_list
 from the
 CLI once is taking =2 min to respond. For example:

 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)

 The cli tests are supposed to be quick read-only sanity checks of the cli
 functionality and really shouldn't ever be on the list of slowest tests
 for a
 gate run. I think there was possibly a performance regression recently in
 ceilometer because from I can tell this test used to normally take ~60 sec.
 (which honestly is probably too slow for a cli test too) but it is
 currently
 much slower than that.


Sounds like we should add another round of sanity checking to the CLI
tests: make sure all commands return within x seconds.   As a first pass we
can say x=60 and than crank it down in the future.



 From logstash it seems there are still some cases when the resource list
 takes
 as long to execute as it used to, but the majority of runs take a long
 time:
 http://goo.gl/smJPB9

 In the short term I've pushed out a patch that will remove this test from
 gate
 runs: https://review.openstack.org/#/c/81036 But, I thought it would be
 good to
 bring this up on the ML to try and figure out what changed or why this is
 so
 slow.

 Thanks,

 -Matt Treinish

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Jay Pipes
On Mon, 2014-03-17 at 14:55 -0400, Matthew Treinish wrote:
 Hi everyone,
 
 So a little while ago we noticed that in all the gate runs one of the 
 ceilometer
 cli tests is consistently in the list of slowest tests. (and often the 
 slowest)
 This was a bit surprising given the nature of the cli tests we expect them to
 execute very quickly.
 
 test_ceilometer_resource_list which just calls ceilometer resource_list from 
 the
 CLI once is taking =2 min to respond. For example:
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)

Yep. At ATT, we had to disable calls to GET /resources without any
filters on it. The call would return hundreds of thousands of records,
all being JSON-ified at the Ceilometer API endpoint, and the result
would take minutes to return. There was no default limit on the query,
which meant every single records in the database was returned, and on
even a semi-busy system, that meant horrendous performance.

Besides the problem that the SQLAlchemy driver doesn't yet support
pagination [1], the main problem with the get_resources() call is the
underlying databases schema for the Sample model is wacky, and forces
the use of a dependent subquery in the WHERE clause [2] which completely
kills performance of the query to get resources.

[1]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436
[2]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503

 The cli tests are supposed to be quick read-only sanity checks of the cli
 functionality and really shouldn't ever be on the list of slowest tests for a
 gate run.

Oh, the test is readonly all-right. ;) It's just that it's reading
hundreds of thousands of records.

  I think there was possibly a performance regression recently in
 ceilometer because from I can tell this test used to normally take ~60 sec.
 (which honestly is probably too slow for a cli test too) but it is currently
 much slower than that.
 
 From logstash it seems there are still some cases when the resource list takes
 as long to execute as it used to, but the majority of runs take a long time:
 http://goo.gl/smJPB9
 
 In the short term I've pushed out a patch that will remove this test from gate
 runs: https://review.openstack.org/#/c/81036 But, I thought it would be good 
 to
 bring this up on the ML to try and figure out what changed or why this is so
 slow.

I agree with removing the test from the gate in the short term. Medium
to long term, the root causes of the problem (that GET /resources has no
support for pagination on the query, there is no default for limiting
results based on a since timestamp, and that the underlying database
schema is non-optimal) should be addressed.

Best,
-jay


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Joe Gordon
On Mon, Mar 17, 2014 at 12:25 PM, Sean Dague s...@dague.net wrote:

 On 03/17/2014 03:22 PM, Joe Gordon wrote:
 
 
 
  On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.org
  mailto:mtrein...@kortar.org wrote:
 
  Hi everyone,
 
  So a little while ago we noticed that in all the gate runs one of
  the ceilometer
  cli tests is consistently in the list of slowest tests. (and often
  the slowest)
  This was a bit surprising given the nature of the cli tests we
  expect them to
  execute very quickly.
 
  test_ceilometer_resource_list which just calls ceilometer
  resource_list from the
  CLI once is taking =2 min to respond. For example:
 
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
  (where it takes  3min)
 
  The cli tests are supposed to be quick read-only sanity checks of
  the cli
  functionality and really shouldn't ever be on the list of slowest
  tests for a
  gate run. I think there was possibly a performance regression
  recently in
  ceilometer because from I can tell this test used to normally take
  ~60 sec.
  (which honestly is probably too slow for a cli test too) but it is
  currently
  much slower than that.
 
 
  Sounds like we should add another round of sanity checking to the CLI
  tests: make sure all commands return within x seconds.   As a first pass
  we can say x=60 and than crank it down in the future.

 So, the last thing I want to do is trigger a race here by us
 artificially timing out on tests. However I do think cli tests should be
 returning in  2s otherwise they are not simple readonly tests.


Agreed, I said 60 just as a starting point.



 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev