subject:"\[openstack\-dev\] \[Ceilometer\] \[QA\] Slow Ceilometer resource

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-20 Thread Sampath Priyankara

Hi,

  Re-architecting the schema might fix most of the performance issues of
resource_list.  
  And also, must do some work to improve the performance of meter-list.
  Is the Gordon's blue print gonna cover the both aspects ?
  https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql
  
Sampath

-Original Message-
From: Neal, Phil [mailto:phil.n...@hp.com] 
Sent: Wednesday, March 19, 2014 12:17 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list
CLI command


 -Original Message-
 From: Tim Bell [mailto:tim.b...@cern.ch]
 Sent: Monday, March 17, 2014 2:04 PM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer 
 resource_list CLI command
 
 
 At CERN, we've had similar issues when enabling telemetry. Our 
 resource-list times out after 10 minutes when the proxies for HA 
 assume there is no answer coming back. Keystone instances per cell 
 have helped the situation a little so we can collect the data but 
 there was a significant increase in load on the API endpoints.
 
 I feel that some reference for production scale validation would be 
 beneficial as part of TC approval to leave incubation in case there 
 are issues such as this to be addressed.
 
 Tim
 
  -Original Message-
  From: Jay Pipes [mailto:jaypi...@gmail.com]
  Sent: 17 March 2014 20:25
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command
 
 ...
 
  Yep. At ATT, we had to disable calls to GET /resources without any 
  filters
 on it. The call would return hundreds of thousands of
  records, all being JSON-ified at the Ceilometer API endpoint, and 
  the result
 would take minutes to return. There was no default limit
  on the query, which meant every single records in the database was
 returned, and on even a semi-busy system, that meant
  horrendous performance.
 
  Besides the problem that the SQLAlchemy driver doesn't yet support
 pagination [1], the main problem with the get_resources() call is
  the underlying databases schema for the Sample model is wacky, and
 forces the use of a dependent subquery in the WHERE clause
  [2] which completely kills performance of the query to get resources.
 
  [1]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
 /
 impl_sqlalchemy.py#L436
  [2]
 
 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage
 /
 impl_sqlalchemy.py#L503
 
   The cli tests are supposed to be quick read-only sanity checks of 
   the cli functionality and really shouldn't ever be on the list of 
   slowest tests for a gate run.
 
  Oh, the test is readonly all-right. ;) It's just that it's reading 
  hundreds of
 thousands of records.
 
I think there was possibly a performance regression recently in 
   ceilometer because from I can tell this test used to normally take ~60
sec.
   (which honestly is probably too slow for a cli test too) but it is 
   currently much slower than that.
  
   From logstash it seems there are still some cases when the 
   resource list takes as long to execute as it used to, but the 
   majority of runs take a
 long time:
   http://goo.gl/smJPB9
  
   In the short term I've pushed out a patch that will remove this 
   test from gate
   runs: https://review.openstack.org/#/c/81036 But, I thought it 
   would be good to bring this up on the ML to try and figure out 
   what changed or why this is so slow.
 
  I agree with removing the test from the gate in the short term. 
  Medium to
 long term, the root causes of the problem (that GET
  /resources has no support for pagination on the query, there is no 
  default
 for limiting results based on a since timestamp, and that
  the underlying database schema is non-optimal) should be addressed.

Gordon has introduced a blueprint
https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some
fixes for individual queries but +1 to the point of looking at
re-architecting the schema as an approach to fixing performance. We've also
seen some gains here at HP using batch writes as well but have temporarily
tabled that work in favor of getting a better-performing schema in place.
- Phil

 
  Best,
  -jay
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-18 Thread Neal, Phil

 -Original Message-
 From: Tim Bell [mailto:tim.b...@cern.ch]
 Sent: Monday, March 17, 2014 2:04 PM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command

 At CERN, we've had similar issues when enabling telemetry. Our resource-list
 times out after 10 minutes when the proxies for HA assume there is no
 answer coming back. Keystone instances per cell have helped the situation a
 little so we can collect the data but there was a significant increase in 
 load on
 the API endpoints.

 I feel that some reference for production scale validation would be beneficial
 as part of TC approval to leave incubation in case there are issues such as 
 this
 to be addressed.

 Tim

  -Original Message-
  From: Jay Pipes [mailto:jaypi...@gmail.com]
  Sent: 17 March 2014 20:25
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer
 resource_list CLI command

 ...

  Yep. At ATT, we had to disable calls to GET /resources without any filters
 on it. The call would return hundreds of thousands of
  records, all being JSON-ified at the Ceilometer API endpoint, and the result
 would take minutes to return. There was no default limit
  on the query, which meant every single records in the database was
 returned, and on even a semi-busy system, that meant
  horrendous performance.

  Besides the problem that the SQLAlchemy driver doesn't yet support
 pagination [1], the main problem with the get_resources() call is
  the underlying databases schema for the Sample model is wacky, and
 forces the use of a dependent subquery in the WHERE clause
  [2] which completely kills performance of the query to get resources.

  [1]

 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/
 impl_sqlalchemy.py#L436
  [2]

 https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/
 impl_sqlalchemy.py#L503

   The cli tests are supposed to be quick read-only sanity checks of the
   cli functionality and really shouldn't ever be on the list of slowest
   tests for a gate run.

  Oh, the test is readonly all-right. ;) It's just that it's reading hundreds 
  of
 thousands of records.

I think there was possibly a performance regression recently in
   ceilometer because from I can tell this test used to normally take ~60 
   sec.
   (which honestly is probably too slow for a cli test too) but it is
   currently much slower than that.

   From logstash it seems there are still some cases when the resource
   list takes as long to execute as it used to, but the majority of runs 
   take a
 long time:
   http://goo.gl/smJPB9

   In the short term I've pushed out a patch that will remove this test
   from gate
   runs: https://review.openstack.org/#/c/81036 But, I thought it would
   be good to bring this up on the ML to try and figure out what changed
   or why this is so slow.

  I agree with removing the test from the gate in the short term. Medium to
 long term, the root causes of the problem (that GET
  /resources has no support for pagination on the query, there is no default
 for limiting results based on a since timestamp, and that
  the underlying database schema is non-optimal) should be addressed.

Gordon has introduced a blueprint 
https://blueprints.launchpad.net/ceilometer/+spec/big-data-sql with some fixes 
for individual queries but +1 to the point of looking at re-architecting the 
schema as an approach to fixing performance. We've also seen some gains here at 
HP using batch writes as well but have temporarily tabled that work in favor of 
getting a better-performing schema in place.
- Phil

  Best,
  -jay

  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Gordon Chung

hi Matt,

 test_ceilometer_resource_list which just calls ceilometer 
 resource_list from the
 CLI once is taking =2 min to respond. For example:
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-
 postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)

thanks for bringing this up... we're tracking this here: 
https://bugs.launchpad.net/ceilometer/+bug/1264434

i've put a patch out that partially fixes the issue. from bad to 
average... but i guess i should make the fix a bit more aggressive to 
bring the performance in line with the 'seconds' expectation.

cheers,
gordon chung
openstack, ibm software standards

Matthew Treinish mtrein...@kortar.org wrote on 17/03/2014 02:55:40 PM:

 From: Matthew Treinish mtrein...@kortar.org
 To: openstack-dev@lists.openstack.org
 Date: 17/03/2014 02:57 PM
 Subject: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer 
 resource_list CLI command
 
 Hi everyone,
 
 So a little while ago we noticed that in all the gate runs one of 
 the ceilometer
 cli tests is consistently in the list of slowest tests. (and often 
 the slowest)
 This was a bit surprising given the nature of the cli tests we expect 
them to
 execute very quickly.
 
 test_ceilometer_resource_list which just calls ceilometer 
 resource_list from the
 CLI once is taking =2 min to respond. For example:
 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-
 postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)
 
 The cli tests are supposed to be quick read-only sanity checks of the 
cli
 functionality and really shouldn't ever be on the list of slowest tests 
for a
 gate run. I think there was possibly a performance regression recently 
in
 ceilometer because from I can tell this test used to normally take ~60 
sec.
 (which honestly is probably too slow for a cli test too) but it is 
currently
 much slower than that.
 
 From logstash it seems there are still some cases when the resource list 
takes
 as long to execute as it used to, but the majority of runs take a long 
time:
 http://goo.gl/smJPB9
 
 In the short term I've pushed out a patch that will remove this testfrom 
gate
 runs: https://review.openstack.org/#/c/81036 But, I thought it wouldbe 
good to
 bring this up on the ML to try and figure out what changed or why this 
is so
 slow.
 
 Thanks,
 
 -Matt Treinish
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Joe Gordon

On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.orgwrote:

 Hi everyone,

 So a little while ago we noticed that in all the gate runs one of the
 ceilometer
 cli tests is consistently in the list of slowest tests. (and often the
 slowest)
 This was a bit surprising given the nature of the cli tests we expect them
 to
 execute very quickly.

 test_ceilometer_resource_list which just calls ceilometer resource_list
 from the
 CLI once is taking =2 min to respond. For example:

 http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
 (where it takes  3min)

 The cli tests are supposed to be quick read-only sanity checks of the cli
 functionality and really shouldn't ever be on the list of slowest tests
 for a
 gate run. I think there was possibly a performance regression recently in
 ceilometer because from I can tell this test used to normally take ~60 sec.
 (which honestly is probably too slow for a cli test too) but it is
 currently
 much slower than that.


Sounds like we should add another round of sanity checking to the CLI
tests: make sure all commands return within x seconds.   As a first pass we
can say x=60 and than crank it down in the future.



 From logstash it seems there are still some cases when the resource list
 takes
 as long to execute as it used to, but the majority of runs take a long
 time:
 http://goo.gl/smJPB9

 In the short term I've pushed out a patch that will remove this test from
 gate
 runs: https://review.openstack.org/#/c/81036 But, I thought it would be
 good to
 bring this up on the ML to try and figure out what changed or why this is
 so
 slow.

 Thanks,

 -Matt Treinish

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Jay Pipes

On Mon, 2014-03-17 at 14:55 -0400, Matthew Treinish wrote:
Hi everyone,

So a little while ago we noticed that in all the gate runs one of the
ceilometer
cli tests is consistently in the list of slowest tests. (and often the
slowest)
This was a bit surprising given the nature of the cli tests we expect them to
execute very quickly.

test_ceilometer_resource_list which just calls ceilometer resource_list from
the
CLI once is taking =2 min to respond. For example:
http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
(where it takes 3min)

Yep. At ATT, we had to disable calls to GET /resources without any
filters on it. The call would return hundreds of thousands of records,
all being JSON-ified at the Ceilometer API endpoint, and the result
would take minutes to return. There was no default limit on the query,
which meant every single records in the database was returned, and on
even a semi-busy system, that meant horrendous performance.

Besides the problem that the SQLAlchemy driver doesn't yet support
pagination [1], the main problem with the get_resources() call is the
underlying databases schema for the Sample model is wacky, and forces
the use of a dependent subquery in the WHERE clause [2] which completely
kills performance of the query to get resources.

[1]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436
[2]
https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503

The cli tests are supposed to be quick read-only sanity checks of the cli
functionality and really shouldn't ever be on the list of slowest tests for a
gate run.

Oh, the test is readonly all-right. ;) It's just that it's reading
hundreds of thousands of records.

I think there was possibly a performance regression recently in
ceilometer because from I can tell this test used to normally take ~60 sec.
(which honestly is probably too slow for a cli test too) but it is currently
much slower than that.

From logstash it seems there are still some cases when the resource list takes
as long to execute as it used to, but the majority of runs take a long time:
http://goo.gl/smJPB9

In the short term I've pushed out a patch that will remove this test from gate
runs: https://review.openstack.org/#/c/81036 But, I thought it would be good
to
bring this up on the ML to try and figure out what changed or why this is so
slow.

I agree with removing the test from the gate in the short term. Medium
to long term, the root causes of the problem (that GET /resources has no
support for pagination on the query, there is no default for limiting
results based on a since timestamp, and that the underlying database
schema is non-optimal) should be addressed.

Best,
-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

2014-03-17 Thread Joe Gordon

On Mon, Mar 17, 2014 at 12:25 PM, Sean Dague s...@dague.net wrote:

On 03/17/2014 03:22 PM, Joe Gordon wrote:

On Mon, Mar 17, 2014 at 11:55 AM, Matthew Treinish mtrein...@kortar.org
mailto:mtrein...@kortar.org wrote:

Hi everyone,

So a little while ago we noticed that in all the gate runs one of
the ceilometer
cli tests is consistently in the list of slowest tests. (and often
the slowest)
This was a bit surprising given the nature of the cli tests we
expect them to
execute very quickly.

test_ceilometer_resource_list which just calls ceilometer
resource_list from the
CLI once is taking =2 min to respond. For example:

http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003
(where it takes 3min)

The cli tests are supposed to be quick read-only sanity checks of
the cli
functionality and really shouldn't ever be on the list of slowest
tests for a
gate run. I think there was possibly a performance regression
recently in
ceilometer because from I can tell this test used to normally take
~60 sec.
(which honestly is probably too slow for a cli test too) but it is
currently
much slower than that.

Sounds like we should add another round of sanity checking to the CLI
tests: make sure all commands return within x seconds. As a first pass
we can say x=60 and than crank it down in the future.

So, the last thing I want to do is trigger a race here by us
artificially timing out on tests. However I do think cli tests should be
returning in 2s otherwise they are not simple readonly tests.

Agreed, I said 60 just as a starting point.

-Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

Re: [openstack-dev] [Ceilometer] [QA] Slow Ceilometer resource_list CLI command

6 matches

Site Navigation

Mail list logo

Footer information