Re: [openstack-dev] [Nova] A multi-cell instance-list performance test

2018-08-19 Thread Alex Xu
2018-08-17 2:44 GMT+08:00 Dan Smith :

> >  yes, the DB query was in serial, after some investigation, it seems
> that we are unable to perform eventlet.mockey_patch in uWSGI mode, so
> >  Yikun made this fix:
> >
> >  https://review.openstack.org/#/c/592285/
>
> Cool, good catch :)
>
> >
> >  After making this change, we test again, and we got this kind of data:
> >
> >   total collect sort view
> >  before monkey_patch 13.5745 11.7012 1.1511 0.5966
> >  after monkey_patch 12.8367 10.5471 1.5642 0.6041
> >
> >  The performance improved a little, and from the log we can saw:
>
> Since these all took ~1s when done in series, but now take ~10s in
> parallel, I think you must be hitting some performance bottleneck in
> either case, which is why the overall time barely changes. Some ideas:
>
> 1. In the real world, I think you really need to have 10x database
>servers or at least a DB server with plenty of cores loading from a
>very fast (or separate) disk in order to really ensure you're getting
>full parallelism of the DB work. However, because these queries all
>took ~1s in your serialized case, I expect this is not your problem.
>
> 2. What does the network look like between the api machine and the DB?
>
> 3. What do the memory and CPU usage of the api process look like while
>this is happening?
>
> Related to #3, even though we issue the requests to the DB in parallel,
> we still process the result of those calls in series in a single python
> thread on the API. That means all the work of reading the data from the
> socket, constructing the SQLA objects, turning those into nova objects,
> etc, all happens serially. It could be that the DB query is really a
> small part of the overall time and our serialized python handling of the
> result is the slow part. If you see the api process pegging a single
> core at 100% for ten seconds, I think that's likely what is happening.
>

I remember I did a test on sqlalchemy, the sqlalchemy object construction
is super slow than fetch the data from remote.
Maybe you can try profile it, to figure out how much time spend on the
wire, how much time spend on construct the object.
http://docs.sqlalchemy.org/en/latest/faq/performance.html


>
> >  so, now the queries are in parallel, but the whole thing still seems
> >  serial.
>
> In your table, you show the time for "1 cell, 1000 instances" as ~3s and
> "10 cells, 1000 instances" as 10s. The problem with comparing those
> directly is that in the latter, you're actually pulling 10,000 records
> over the network, into memory, processing them, and then just returning
> the first 1000 from the sort. A closer comparison would be the "10
> cells, 100 instances" with "1 cell, 1000 instances". In both of those
> cases, you pull 1000 instances total from the db, into memory, and
> return 1000 from the sort. In that case, the multi-cell situation is
> faster (~2.3s vs. ~3.1s). You could also compare the "10 cells, 1000
> instances" case to "1 cell, 10,000 instances" just to confirm at the
> larger scale that it's better or at least the same.
>
> We _have_ to pull $limit instances from each cell, in case (according to
> the sort key) the first $limit instances are all in one cell. We _could_
> try to batch the results from each cell to avoid loading so many that we
> don't need, but we punted this as an optimization to be done later. I'm
> not sure it's really worth the complexity at this point, but it's
> something we could investigate.
>
> --Dan
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] A multi-cell instance-list performance test

2018-08-17 Thread Dan Smith
> We have tried out the patch:
> https://review.openstack.org/#/c/592698/
> we also applied https://review.openstack.org/#/c/592285/
>
> it turns out that we are able to half the overall time consumption, we
> did try with different sort key and dirs, the results are similar, we
> didn't try out paging yet:

Excellent! Let's continue discussion of the batching approach in that
review. There are some other things to try.

Thanks!

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] A multi-cell instance-list performance test

2018-08-16 Thread Zhenyu Zheng
Hi,

Thanks alot for the reply, for your question #2, we did tests with two
kinds of deployments: 1. There is only 1 DB with all 10 cells(also cell0)
and it is on the same server with
the API; 2. We took 5 of the DBs to another machine on the same rack to
test out if it matters, and it turns out there are no big differences.

For question #3, we did a test with limit = 1000 and 10 cells:
as we can see, the CPU workload from API process and MySQL query is both
high in the first 3 seconds, but start from the 4th second, only API
process occupies the CPU,
and the memory consumption is low comparing to the CPU consumption. And
this is tested with the patch fix posted in previous mail.

[image: image.png]

[image: image.png]

BR,

Kevin

On Fri, Aug 17, 2018 at 2:45 AM Dan Smith  wrote:

> >  yes, the DB query was in serial, after some investigation, it seems
> that we are unable to perform eventlet.mockey_patch in uWSGI mode, so
> >  Yikun made this fix:
> >
> >  https://review.openstack.org/#/c/592285/
>
> Cool, good catch :)
>
> >
> >  After making this change, we test again, and we got this kind of data:
> >
> >   total collect sort view
> >  before monkey_patch 13.5745 11.7012 1.1511 0.5966
> >  after monkey_patch 12.8367 10.5471 1.5642 0.6041
> >
> >  The performance improved a little, and from the log we can saw:
>
> Since these all took ~1s when done in series, but now take ~10s in
> parallel, I think you must be hitting some performance bottleneck in
> either case, which is why the overall time barely changes. Some ideas:
>
> 1. In the real world, I think you really need to have 10x database
>servers or at least a DB server with plenty of cores loading from a
>very fast (or separate) disk in order to really ensure you're getting
>full parallelism of the DB work. However, because these queries all
>took ~1s in your serialized case, I expect this is not your problem.
>
> 2. What does the network look like between the api machine and the DB?
>
> 3. What do the memory and CPU usage of the api process look like while
>this is happening?
>
> Related to #3, even though we issue the requests to the DB in parallel,
> we still process the result of those calls in series in a single python
> thread on the API. That means all the work of reading the data from the
> socket, constructing the SQLA objects, turning those into nova objects,
> etc, all happens serially. It could be that the DB query is really a
> small part of the overall time and our serialized python handling of the
> result is the slow part. If you see the api process pegging a single
> core at 100% for ten seconds, I think that's likely what is happening.
>
> >  so, now the queries are in parallel, but the whole thing still seems
> >  serial.
>
> In your table, you show the time for "1 cell, 1000 instances" as ~3s and
> "10 cells, 1000 instances" as 10s. The problem with comparing those
> directly is that in the latter, you're actually pulling 10,000 records
> over the network, into memory, processing them, and then just returning
> the first 1000 from the sort. A closer comparison would be the "10
> cells, 100 instances" with "1 cell, 1000 instances". In both of those
> cases, you pull 1000 instances total from the db, into memory, and
> return 1000 from the sort. In that case, the multi-cell situation is
> faster (~2.3s vs. ~3.1s). You could also compare the "10 cells, 1000
> instances" case to "1 cell, 10,000 instances" just to confirm at the
> larger scale that it's better or at least the same.
>
> We _have_ to pull $limit instances from each cell, in case (according to
> the sort key) the first $limit instances are all in one cell. We _could_
> try to batch the results from each cell to avoid loading so many that we
> don't need, but we punted this as an optimization to be done later. I'm
> not sure it's really worth the complexity at this point, but it's
> something we could investigate.
>
> --Dan
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] A multi-cell instance-list performance test

2018-08-16 Thread Dan Smith
>  yes, the DB query was in serial, after some investigation, it seems that we 
> are unable to perform eventlet.mockey_patch in uWSGI mode, so
>  Yikun made this fix:
>
>  https://review.openstack.org/#/c/592285/

Cool, good catch :)

>
>  After making this change, we test again, and we got this kind of data:
>
>    total collect sort view 
>  before monkey_patch 13.5745 11.7012 1.1511 0.5966 
>  after monkey_patch 12.8367 10.5471 1.5642 0.6041 
>
>  The performance improved a little, and from the log we can saw:

Since these all took ~1s when done in series, but now take ~10s in
parallel, I think you must be hitting some performance bottleneck in
either case, which is why the overall time barely changes. Some ideas:

1. In the real world, I think you really need to have 10x database
   servers or at least a DB server with plenty of cores loading from a
   very fast (or separate) disk in order to really ensure you're getting
   full parallelism of the DB work. However, because these queries all
   took ~1s in your serialized case, I expect this is not your problem.

2. What does the network look like between the api machine and the DB?

3. What do the memory and CPU usage of the api process look like while
   this is happening?

Related to #3, even though we issue the requests to the DB in parallel,
we still process the result of those calls in series in a single python
thread on the API. That means all the work of reading the data from the
socket, constructing the SQLA objects, turning those into nova objects,
etc, all happens serially. It could be that the DB query is really a
small part of the overall time and our serialized python handling of the
result is the slow part. If you see the api process pegging a single
core at 100% for ten seconds, I think that's likely what is happening.

>  so, now the queries are in parallel, but the whole thing still seems
>  serial.

In your table, you show the time for "1 cell, 1000 instances" as ~3s and
"10 cells, 1000 instances" as 10s. The problem with comparing those
directly is that in the latter, you're actually pulling 10,000 records
over the network, into memory, processing them, and then just returning
the first 1000 from the sort. A closer comparison would be the "10
cells, 100 instances" with "1 cell, 1000 instances". In both of those
cases, you pull 1000 instances total from the db, into memory, and
return 1000 from the sort. In that case, the multi-cell situation is
faster (~2.3s vs. ~3.1s). You could also compare the "10 cells, 1000
instances" case to "1 cell, 10,000 instances" just to confirm at the
larger scale that it's better or at least the same.

We _have_ to pull $limit instances from each cell, in case (according to
the sort key) the first $limit instances are all in one cell. We _could_
try to batch the results from each cell to avoid loading so many that we
don't need, but we punted this as an optimization to be done later. I'm
not sure it's really worth the complexity at this point, but it's
something we could investigate.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] A multi-cell instance-list performance test

2018-08-16 Thread Yikun Jiang
Some more information:
*1. How did we record the time when listing?*
you can see all our changes in:
http://paste.openstack.org/show/728162/

Total cost:  L26
Construct view: L43
Data gather per cell cost:   L152
Data gather all cells cost: L174
Merge Sort cost: L198

*2. Why it is not parallel in the first result?*
The root reason of gathering data in first table is not in parallel because
we don’t enable
eventlet.monkey_patch (especially, time flag is not True) under the uswgi
mode.

Then the oslo_db’s thread yield [2] doesn’t work, and all db data gathering
threads are blocked
until they get all data from db[1].

Finally the gathering process looks like is executed in serial, so we fix
it in [2]

but after fix[2], it still has no more improvement as we expected, looks
like every thread is influenced by
each other, so we need your idea. : )

[1]
https://github.com/openstack/oslo.db/blob/256ebc3/oslo_db/sqlalchemy/engines.py#L51
[2] https://review.openstack.org/#/c/592285/

Regards,
Yikun

Jiang Yikun(Kero)
Mail: yikunk...@gmail.com


Zhenyu Zheng  于2018年8月16日周四 下午3:54写道:

> Hi, Nova
>
> As the Cells v2 architecture is getting mature, and CERN used it and seems
> worked well, *Huawei *is also willing to consider using this in our
> Public Cloud deployments.
> As we still have concerns about the performance when doing multi-cell
> listing, recently *Yikun Jiang* and I have done a performance test for
> ``instance list`` across
> multi-cell deployment, we would like share our test results and findings.
>
> First, I want to point out our testing environment, as we(Yikun and I) are
> doing this as a concept test(to show the ratio between time consumptions
> for query data from
> DB and sorting etc.) so we are doing it on our own machine, the machine
> has 16 CPUs and 80 GB RAM, as it is old, so the Disk might be slow. So we
> will not judging
> the time consumption data itself, but the overall logic and the ratios
> between different steps. We are doing it with a devstack deployment on this
> single machine.
>
> Then I would like to share our test plan, we will setup 10 cells
> (cell1~cell10) and we will generate 1 instance records in those cells
> (considering 20 instances per
> host, it would be like 500 hosts, which seems a good size for a cell),
> cell0 is kept empty as the number for errored instance could be very less
> and it doesn't really matter.
> We will test the time consumption for listing instances across 1,2,5, and
> 10 cells(cell0 will be always queried, so it is actually 2, 3, 6 and 11
> cells) with the limit of
> 100, 200, 500 and 1000, as the default maximum limit is 1000. In order to
> get more general results, we tested the list with default sort key and dir,
> sort by
> instance_uuid and sort by uuid & name, this should provide a more general
> result.
>
> This is what we got(the time unit is second):
>
> *Default sort*
>
> *Uuid* *Sort*
>
> *uuid+name* *Sort*
>
> *Cell*
>
> *Num*
>
> *Limit*
>
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> 10
>
> 100
>
> 2.3313
>
> 2.1306
>
> 0.1145
>
> 0.0672
>
> 2.3693
>
> 2.1343
>
> 0.1148
>
> 0.1016
>
> 2.3284
>
> 2.1264
>
> 0.1145
>
> 0.0679
>
> 200
>
> 3.5979
>
> 3.2137
>
> 0.2287
>
> 0.1265
>
> 3.5316
>
> 3.1509
>
> 0.2265
>
> 0.1255
>
> 3.481
>
> 3.054
>
> 0.2697
>
> 0.1284
>
> 500
>
> 7.1952
>
> 6.2597
>
> 0.5704
>
> 0.3029
>
> 7.5057
>
> 6.4761
>
> 0.6263
>
> 0.341
>
> 7.4885
>
> 6.4623
>
> 0.6239
>
> 0.3404
>
> 1000
>
> 13.5745
>
> 11.7012
>
> 1.1511
>
> 0.5966
>
> 13.8408
>
> 11.9007
>
> 1.2268
>
> 0.5939
>
> 13.8813
>
> 11.913
>
> 1.2301
>
> 0.6187
>
> 5
>
> 100
>
> 1.3142
>
> 1.1003
>
> 0.1163
>
> 0.0706
>
> 1.2458
>
> 1.0498
>
> 0.1163
>
> 0.0665
>
> 1.2528
>
> 1.0579
>
> 0.1161
>
> 0.066
>
> 200
>
> 2.0151
>
> 1.6063
>
> 0.2645
>
> 0.1255
>
> 1.9866
>
> 1.5386
>
> 0.2668
>
> 0.1615
>
> 2.0352
>
> 1.6246
>
> 0.2646
>
> 0.1262
>
> 500
>
> 4.2109
>
> 3.1358
>
> 0.7033
>
> 0.3343
>
> 4.1605
>
> 3.0893
>
> 0.6951
>
> 0.3384
>
> 4.1972
>
> 3.2461
>
> 0.6104
>
> 0.3028
>
> 1000
>
> 7.841
>
> 5.8881
>
> 1.2027
>
> 0.6802
>
> 7.7135
>
> 5.9121
>
> 1.1363
>
> 0.5969
>
> 7.8377
>
> 5.9385
>
> 1.1936
>
> 0.6376
>
> 2
>
> 100
>
> 0.6736
>
> 0.4727
>
> 0.1113
>
> 0.0822
>
> 0.605
>
> 0.4192
>
> 0.1105
>
> 0.0656
>
> 0.688
>
> 0.4613
>
> 0.1126
>
> 0.0682
>
> 200
>
> 1.1226
>
> 0.7229
>
> 0.2577
>
> 0.1255
>
> 1.0268
>
> 0.6671
>
> 0.2255
>
> 0.1254
>
> 1.2805
>
> 0.8171
>
> 0.
>
> 0.1258
>
> 500
>
> 2.2358
>
> 1.3506
>
> 0.5595
>
> 0.3026
>
> 2.3307
>
> 1.2748
>
> 0.6581
>
> 0.3362
>
> 2.741
>
> 1.6023
>
> 0.633
>
> 0.3365
>
> 1000
>
> 4.2079
>
> 2.3367
>
> 1.2053
>
> 0.5986
>
> 4.2384
>
> 2.4071
>
> 1.2017
>
> 0.633
>
> 4.3437
>
> 2.4136
>
> 1.217
>
> 0.6394
>
> 1
>
> 100
>
> 0.4857