subject:"Re\: \[openstack\-dev\] \[nova\] A prototype implementation towards the \"shared state scheduler\""

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-04 Thread Sylvain Bauza

Both of you, thanks for your insights. Greatly appreciated.

Le 04/03/2016 09:25, Cheng, Yingxin a écrit :

Hi,

First of all, many delayed thanks to Jay Pipes's benchmarking framework, learnt
a lot from it : )

Other comments inline.

On Friday, March 4, 2016 8:42 AM Jay Pipes wrote:

Hi again, Yingxin, sorry for the delayed response... been traveling.
Comments inline :)

On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:

Hi,

I have simulated the distributed resource management with the incremental

update model based on Jay's benchmarking framework:
https://github.com/cyx1231st/placement-bench/tree/shared-state-
demonstration. The complete result lies at
http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
4GB RAM, and the mysql service is using the default settings with the
"innodb_buffer_pool_size" setting to "2G". The number of simulated compute
nodes are set to "300".

A few things.

First, in order to make any predictions or statements about a potential
implementation's scaling characteristics, you need to run the benchmarks with
increasing levels of compute nodes. The results you show give us only a single
dimension of scaling (300 compute nodes). What you want to do is run the
benchmarks at 100, 200, 400, 800 and 1600 compute node scales. You don't
need to run *all* of the different permutations of placement/partition/workers
scenarios, of course. I'd suggest just running the none partition strategy and
the
pack placement strategy at 8 worker processes. Those results will give you (and
us!) the data points that will indicate the scaling behaviour of the
shared-state-
scheduler implementation proposal as the number of compute nodes in the
deployment increases. The "none" partitioning strategy represents the reality of
the existing scheduler implementation, which does not shard the deployment
into partitions but retrieves all compute nodes for the entire deployment on
every request to the scheduler's
select_destinations() method.

Hmm... good suggestion. I don't like to run all the benchmarks, either. It
makes me wait for a whole day, and so much data to evaluate.

300 is the max number for me to test in my environment. Or db will refuse to work because
of connection limits, because all those nodes are asking for connections. Should I
emulate "conductors" to limit the db connections, or build up a thread pool to
connect, or edit db configurations? I'm wondering if I can write a new tool to do tests
in more real environment.

Yeah, I think you need to simulate the same conditions than a real
production environment, where the compute nodes are not actually writing
on the DB directly, but using conductors rather.
Having the same model (ie. AMQP casts to the conductor and conductor
worker writing on the DB) would help us having better accurate figures
for that, and would potentially help you to scale your lab.

Again, we have a fake oslo.messaging driver for this kind of purpose, I
guess you should take a look on the Nova in-tree functional tests to see
how we setup that.

[a]
So, single threaded in each node is already good enough to support
"shared-state" scheduler to make 1000 more decisions per second. And because
those claims are made distributedly in nodes, they are actually wrote to db by 300
parallel nodes in nature. AFAIK, the compute node is single threaded, they actually use
greenthreads instead of real threads.

True, but keep in mind that we have a very weird design where the
compute manager actually initializes 1 or more ResourceTrackers
(depending on the number of "nodes" attached to the compute-manager
service) which means that you have potentially synchronized sections
running concurrently when trying to update the stats.

I would appreciate if you could amend your branch to have a synchronized
section there :
https://github.com/cyx1231st/placement-bench/blob/shared-state-demonstration/compute_node.py#L91
that would really simulate how the RT is working. Adding a possibility
to have this 1:N relationship between node(s) and service would also
make the simulator closer to the real situation.

If you want more than a single process on the compute
node to be able to handle claims of resources, you will need to modify that code
to use a compare-and-update strategy, checking a "generation" attribute on the
inventory record to ensure that another process on the compute node hasn't
simultaneously updated the allocations information for that compute node.

I still don't think the compare-and-update strategy should be forced to "compute-local"
resources even if the compute service is changed to use multiple processes. The scheduler

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-04 Thread Cheng, Yingxin

Hi,

First of all, many delayed thanks to Jay Pipes's benchmarking framework, learnt 
a lot from it : )

Other comments inline.

On Friday, March 4, 2016 8:42 AM Jay Pipes wrote:
> Hi again, Yingxin, sorry for the delayed response... been traveling.
> Comments inline :)
> 
> On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:
> > Hi,
> >
> > I have simulated the distributed resource management with the incremental
> update model based on Jay's benchmarking framework:
> https://github.com/cyx1231st/placement-bench/tree/shared-state-
> demonstration. The complete result lies at
> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
> 4GB RAM, and the mysql service is using the default settings with the
> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
> nodes are set to "300".
> 
> A few things.
> 
> First, in order to make any predictions or statements about a potential
> implementation's scaling characteristics, you need to run the benchmarks with
> increasing levels of compute nodes. The results you show give us only a single
> dimension of scaling (300 compute nodes). What you want to do is run the
> benchmarks at 100, 200, 400, 800 and 1600 compute node scales. You don't
> need to run *all* of the different permutations of placement/partition/workers
> scenarios, of course. I'd suggest just running the none partition strategy 
> and the
> pack placement strategy at 8 worker processes. Those results will give you 
> (and
> us!) the data points that will indicate the scaling behaviour of the 
> shared-state-
> scheduler implementation proposal as the number of compute nodes in the
> deployment increases. The "none" partitioning strategy represents the reality 
> of
> the existing scheduler implementation, which does not shard the deployment
> into partitions but retrieves all compute nodes for the entire deployment on
> every request to the scheduler's
> select_destinations() method.

Hmm... good suggestion. I don't like to run all the benchmarks, either. It 
makes me wait for a whole day, and so much data to evaluate.

300 is the max number for me to test in my environment. Or db will refuse to 
work because of connection limits, because all those nodes are asking for 
connections. Should I emulate "conductors" to limit the db connections, or 
build up a thread pool to connect, or edit db configurations? I'm wondering if 
I can write a new tool to do tests in more real environment.

> Secondly, and I'm not sure if you intended this, the code in your
> compute_node.py file in the placement-bench project is not thread-safe.
> In other words, your code assumes that only a single process on each compute
> node could ever actually run the database transaction that inserts allocation
> records at any time.

[a]
So, single threaded in each node is already good enough to support 
"shared-state" scheduler to make 1000 more decisions per second. And because 
those claims are made distributedly in nodes, they are actually wrote to db by 
300 parallel nodes in nature. AFAIK, the compute node is single threaded,  they 
actually use greenthreads instead of real threads.

> If you want more than a single process on the compute
> node to be able to handle claims of resources, you will need to modify that 
> code
> to use a compare-and-update strategy, checking a "generation" attribute on the
> inventory record to ensure that another process on the compute node hasn't
> simultaneously updated the allocations information for that compute node.

I still don't think the compare-and-update strategy should be forced to 
"compute-local" resources even if the compute service is changed to use 
multiple processes. The scheduler decisions to those "compute-local" resources 
can be checked and confirmed by the accurate in-memory view of local resources 
in resource tracker, which is really really faster than db operations. And the 
following inventory insertion can be concurrent without locks.

The db is only responsible to use "compare-and-update strategy" to claim those 
shared resources, persist the confirmed scheduler decision with consumption 
into inventories, then tell compute service that it's OK to start to do the 
long job in spawning the VM.

> Third, you have your scheduler workers consuming messages off the request
> queue using get_nowait(), while you left the original placement scheduler 
> using
> the blocking get() call. :) Probably best to compare apples to apples and have
> them both using the blocking get() call.

Sorry, I don't agree with this. Consuming messages and getting requests are 
entirely two different things. I've tried to add timer around the "get()" 
method, there are no blocks actually, because the requests are already prepared 
and put into the queue. Note there is a "None" for every schedulers at the end 
of request queue, and the emulated scheduler will stop getting more requests 
immediately if there are no more requests. There is no wait at all.

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-03 Thread Jay Pipes

Hi again, Yingxin, sorry for the delayed response... been traveling.
Comments inline :)

On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:

Hi,

I have simulated the distributed resource management with the incremental update model based on Jay's
benchmarking framework: https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. The
complete result lies at http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and 4GB RAM,
and the mysql service is using the default settings with the "innodb_buffer_pool_size" setting to
"2G". The number of simulated compute nodes are set to "300".

A few things.

First, in order to make any predictions or statements about a potential
implementation's scaling characteristics, you need to run the benchmarks
with increasing levels of compute nodes. The results you show give us
only a single dimension of scaling (300 compute nodes). What you want to
do is run the benchmarks at 100, 200, 400, 800 and 1600 compute node
scales. You don't need to run *all* of the different permutations of
placement/partition/workers scenarios, of course. I'd suggest just
running the none partition strategy and the pack placement strategy at 8
worker processes. Those results will give you (and us!) the data points
that will indicate the scaling behaviour of the shared-state-scheduler
implementation proposal as the number of compute nodes in the deployment
increases. The "none" partitioning strategy represents the reality of
the existing scheduler implementation, which does not shard the
deployment into partitions but retrieves all compute nodes for the
entire deployment on every request to the scheduler's
select_destinations() method.

Secondly, and I'm not sure if you intended this, the code in your
compute_node.py file in the placement-bench project is not thread-safe.
In other words, your code assumes that only a single process on each
compute node could ever actually run the database transaction that
inserts allocation records at any time. If you want more than a single
process on the compute node to be able to handle claims of resources,
you will need to modify that code to use a compare-and-update strategy,
checking a "generation" attribute on the inventory record to ensure that
another process on the compute node hasn't simultaneously updated the
allocations information for that compute node.

Third, you have your scheduler workers consuming messages off the
request queue using get_nowait(), while you left the original placement
scheduler using the blocking get() call. :) Probably best to compare
apples to apples and have them both using the blocking get() call.

First, the conclusions from the result of the eventually consistent scheduler state
simulation(i.e. rows that "do claim in compute?" = Yes):
#accuracy
1. The final decision accuracy is 100%: No resource usage will exceed the real
capacity by examining the rationality of db records at the end of each run.

Again, with your simulation, this assumes only a single thread will ever
attempt a claim on each compute node at any given time.

2. The schedule decision accuracy is 100% if there's only one scheduler: The successful scheduler
decisions are all succeeded in compute nodes, thus no retries recorded, i.e. "Count of
requests processed" = "Placement query count". See
http://paste.openstack.org/show/488696/

Yep, no disagreement here :)

3. The schedule decision accuracy is 100% if "Partition strategy" is set to
"modulo", no matter how many scheduler processes. See
http://paste.openstack.org/show/488697/
#racing

Yep, modulo partitioning eliminates the race conditions when the number
of partitions == the number of worker processes. However, this isn't
representative of the existing scheduler system which processes every
compute node in the deployment on every call to select_destinations().

What happens in the shared-state-scheduler approach when you want to
scale the scheduler process out with more scheduler processes handling
more load? What about having two scheduler processes handling the
scheduling to the same partition (i.e. highly-available scheduling)?
Both of these situations will introduce contention into the scheduling
process and introduce races that will manifest themselves on the compute
nodes instead of in the scheduler processes themselves where the total
deadlock and retry time can be limited.

4. No racing is happened if there's only one scheduler process or the "Partition
strategy" is set to "modulo", explained by 2. 3.

Yes, no disagreement.

5. The multiple-schedulers racing rate is extremely low using the "spread" or "random" placement strategy used by legacy filter
scheduler: This rate is 3.0% using "spread" strategy and 0.15% using "random" strategy, note that there are 8 workers in
processing about 12000 requests within 20 seconds. The result is even better than resource-provider scheduler(rows that "do claim in

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-02 Thread Sylvain Bauza




Le 02/03/2016 09:15, Cheng, Yingxin a écrit :

On Tuesday, March 1, 2016 7:29 PM, John Garbutt  
wrote

On 1 March 2016 at 08:34, Cheng, Yingxin  wrote:

Hi,

I have simulated the distributed resource management with the incremental

update model based on Jay's benchmarking framework:
https://github.com/cyx1231st/placement-bench/tree/shared-state-
demonstration. The complete result lies at
http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
4GB RAM, and the mysql service is using the default settings with the
"innodb_buffer_pool_size" setting to "2G". The number of simulated compute
nodes are set to "300".

[...]

Second, here's what I've found in the centralized db claim design(i.e. rows that

"do claim in compute?" = No):

1. The speed of legacy python filtering is not slow(see rows that
"Filter strategy" = python): "Placement total query time" records the
cost of all query time including fetching states from db and filtering
using python. The actual cost of python filtering is
(Placement_total_query_time - Placement_total_db_query_time), and
that's only about 1/4 of total cost or even less. It also means python
in-memory filtering is much faster than db filtering in this
experiment. See http://paste.openstack.org/show/488710/
2. The speed of `db filter strategy` and the legacy `python filter
strategy` are in the same order of magnitude, not a very huge
improvement. See the comparison of column "Placement total query
time". Note that the extra cost of `python filter strategy` mainly
comes from "Placement total db query time"(i.e. fetching states from
db). See http://paste.openstack.org/show/488709/

I think it might be time to run this in a nova-scheduler like
environment: eventlet threads responding to rabbit, using pymysql backend, etc.
Note we should get quite a bit of concurrency within a single nova-scheduler
process with the db approach.

I suspect clouds that are largely full of pets, pack/fill first, with a smaller
percentage of cattle on top, will benefit the most, as that initial DB filter 
will
bring back a small list of hosts.


Third, my major concern of "centralized db claim" design is: Putting too much

scheduling works into the centralized db, and it is not scalable by simply 
adding
conductors and schedulers.

1. The filtering works are majorly done inside db by executing complex sqls. If

the filtering logic is much more complex(currently only CPU and RAM are
accounted in the experiment), the db overhead will be considerable.

So, to clarify, only resources we have claims for in the DB will be filtered in 
the
DB. All other filters will still occur in python.

The good news, is that if that turns out to be the wrong trade off, its easy to
revert back to doing all the filtering in python, with zero impact on the DB
schema.

Another point is, the db filtering will recalculate every resources to get 
their free value from inventories and allocations each time when a schedule 
request comes. This overhead is unnecessary if scheduler can accept the 
incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of 
scheduler caches to make sure those updates are correctly handled.


2. The racing of centralized claims are resolved by rolling back transactions

and by checking the generations(see the design of "compare and update"
strategy in https://review.openstack.org/#/c/283253/), it also causes additional
overhead to db.

Its worth noting this pattern is designed to work well with a Galera DB cluster,
including one that has writes going to all the nodes.

I know, my point is the "distributed resource management" with resource 
trackers doesn't need db-locks or db-rolling-backs to those compute-local resources as 
well as the additional overhead, regardless of the type of databases.
  


Well, the transactional "compare-and-update" doesn't need to be done on 
the scheduler side, but it will still be needed if we leave the compute 
nodes updating their resources.
TBC, that's not because the scheduler doesn't need that we shouldn't 
have the DB models having some "compare-and-update" strategy. But I got 
your point :-)




3. The db overhead of filtering operation can be relaxed by moving
them to schedulers, that will be 38 times faster and can be executed
in parallel by schedulers according to the column "Placement avg query
time". See http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using

distributed resource claims in resource trackers. There is no need to roll back
transactions in updating inventories of compute local resources in order to be
accurate. It is confirmed by checking the db records at the end of each run of
eventually consistent scheduler state design.

5. If a) all the filtering operations are done inside schedulers,
 b) schedulers do not need to refresh

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-02 Thread Cheng, Yingxin

On Tuesday, March 1, 2016 7:29 PM, John Garbutt  
wrote
> On 1 March 2016 at 08:34, Cheng, Yingxin  wrote:
> > Hi,
> >
> > I have simulated the distributed resource management with the incremental
> update model based on Jay's benchmarking framework:
> https://github.com/cyx1231st/placement-bench/tree/shared-state-
> demonstration. The complete result lies at
> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
> 4GB RAM, and the mysql service is using the default settings with the
> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
> nodes are set to "300".
> >
> > [...]
> >
> > Second, here's what I've found in the centralized db claim design(i.e. rows 
> > that
> "do claim in compute?" = No):
> > 1. The speed of legacy python filtering is not slow(see rows that
> > "Filter strategy" = python): "Placement total query time" records the
> > cost of all query time including fetching states from db and filtering
> > using python. The actual cost of python filtering is
> > (Placement_total_query_time - Placement_total_db_query_time), and
> > that's only about 1/4 of total cost or even less. It also means python
> > in-memory filtering is much faster than db filtering in this
> > experiment. See http://paste.openstack.org/show/488710/
> > 2. The speed of `db filter strategy` and the legacy `python filter
> > strategy` are in the same order of magnitude, not a very huge
> > improvement. See the comparison of column "Placement total query
> > time". Note that the extra cost of `python filter strategy` mainly
> > comes from "Placement total db query time"(i.e. fetching states from
> > db). See http://paste.openstack.org/show/488709/
> 
> I think it might be time to run this in a nova-scheduler like
> environment: eventlet threads responding to rabbit, using pymysql backend, 
> etc.
> Note we should get quite a bit of concurrency within a single nova-scheduler
> process with the db approach.
> 
> I suspect clouds that are largely full of pets, pack/fill first, with a 
> smaller
> percentage of cattle on top, will benefit the most, as that initial DB filter 
> will
> bring back a small list of hosts.
> 
> > Third, my major concern of "centralized db claim" design is: Putting too 
> > much
> scheduling works into the centralized db, and it is not scalable by simply 
> adding
> conductors and schedulers.
> > 1. The filtering works are majorly done inside db by executing complex 
> > sqls. If
> the filtering logic is much more complex(currently only CPU and RAM are
> accounted in the experiment), the db overhead will be considerable.
> 
> So, to clarify, only resources we have claims for in the DB will be filtered 
> in the
> DB. All other filters will still occur in python.
> 
> The good news, is that if that turns out to be the wrong trade off, its easy 
> to
> revert back to doing all the filtering in python, with zero impact on the DB
> schema.

Another point is, the db filtering will recalculate every resources to get 
their free value from inventories and allocations each time when a schedule 
request comes. This overhead is unnecessary if scheduler can accept the 
incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of 
scheduler caches to make sure those updates are correctly handled.

> > 2. The racing of centralized claims are resolved by rolling back 
> > transactions
> and by checking the generations(see the design of "compare and update"
> strategy in https://review.openstack.org/#/c/283253/), it also causes 
> additional
> overhead to db.
> 
> Its worth noting this pattern is designed to work well with a Galera DB 
> cluster,
> including one that has writes going to all the nodes.

I know, my point is the "distributed resource management" with resource 
trackers doesn't need db-locks or db-rolling-backs to those compute-local 
resources as well as the additional overhead, regardless of the type of 
databases.
 
> > 3. The db overhead of filtering operation can be relaxed by moving
> > them to schedulers, that will be 38 times faster and can be executed
> > in parallel by schedulers according to the column "Placement avg query
> > time". See http://paste.openstack.org/show/488715/
> > 4. The "compare and update" overhead can be partially relaxed by using
> distributed resource claims in resource trackers. There is no need to roll 
> back
> transactions in updating inventories of compute local resources in order to be
> accurate. It is confirmed by checking the db records at the end of each run of
> eventually consistent scheduler state design.
> > 5. If a) all the filtering operations are done inside schedulers,
> > b) schedulers do not need to refresh caches from db because of
> incremental updates,
> > c) it is no need to do "compare and update" to compute-local
> resources(i.e. none-shared resources),

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-01 Thread John Garbutt

On 1 March 2016 at 08:34, Cheng, Yingxin  wrote:
> Hi,
>
> I have simulated the distributed resource management with the incremental 
> update model based on Jay's benchmarking framework: 
> https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. 
> The complete result lies at http://paste.openstack.org/show/488677/. It's ran 
> by a VM with 4 cores and 4GB RAM, and the mysql service is using the default 
> settings with the "innodb_buffer_pool_size" setting to "2G". The number of 
> simulated compute nodes are set to "300".
>
> [...]
>
> Second, here's what I've found in the centralized db claim design(i.e. rows 
> that "do claim in compute?" = No):
> 1. The speed of legacy python filtering is not slow(see rows that "Filter 
> strategy" = python): "Placement total query time" records the cost of all 
> query time including fetching states from db and filtering using python. The 
> actual cost of python filtering is (Placement_total_query_time - 
> Placement_total_db_query_time), and that's only about 1/4 of total cost or 
> even less. It also means python in-memory filtering is much faster than db 
> filtering in this experiment. See http://paste.openstack.org/show/488710/
> 2. The speed of `db filter strategy` and the legacy `python filter strategy` 
> are in the same order of magnitude, not a very huge improvement. See the 
> comparison of column "Placement total query time". Note that the extra cost 
> of `python filter strategy` mainly comes from "Placement total db query 
> time"(i.e. fetching states from db). See 
> http://paste.openstack.org/show/488709/

I think it might be time to run this in a nova-scheduler like
environment: eventlet threads responding to rabbit, using pymysql
backend, etc. Note we should get quite a bit of concurrency within a
single nova-scheduler process with the db approach.

I suspect clouds that are largely full of pets, pack/fill first, with
a smaller percentage of cattle on top, will benefit the most, as that
initial DB filter will bring back a small list of hosts.

> Third, my major concern of "centralized db claim" design is: Putting too much 
> scheduling works into the centralized db, and it is not scalable by simply 
> adding conductors and schedulers.
> 1. The filtering works are majorly done inside db by executing complex sqls. 
> If the filtering logic is much more complex(currently only CPU and RAM are 
> accounted in the experiment), the db overhead will be considerable.

So, to clarify, only resources we have claims for in the DB will be
filtered in the DB. All other filters will still occur in python.

The good news, is that if that turns out to be the wrong trade off,
its easy to revert back to doing all the filtering in python, with
zero impact on the DB schema.

> 2. The racing of centralized claims are resolved by rolling back transactions 
> and by checking the generations(see the design of "compare and update" 
> strategy in https://review.openstack.org/#/c/283253/), it also causes 
> additional overhead to db.

Its worth noting this pattern is designed to work well with a Galera
DB cluster, including one that has writes going to all the nodes.

> 3. The db overhead of filtering operation can be relaxed by moving them to 
> schedulers, that will be 38 times faster and can be executed in parallel by 
> schedulers according to the column "Placement avg query time". See 
> http://paste.openstack.org/show/488715/
> 4. The "compare and update" overhead can be partially relaxed by using 
> distributed resource claims in resource trackers. There is no need to roll 
> back transactions in updating inventories of compute local resources in order 
> to be accurate. It is confirmed by checking the db records at the end of each 
> run of eventually consistent scheduler state design.
> 5. If a) all the filtering operations are done inside schedulers,
> b) schedulers do not need to refresh caches from db because of 
> incremental updates,
> c) it is no need to do "compare and update" to compute-local 
> resources(i.e. none-shared resources),
>  then here is the performance comparison using 1 scheduler instances: 
> http://paste.openstack.org/show/488717/

The other side of the coin here is sharding.

For example, we could have a dedicated DB cluster for just the
scheduler data (need to add code to support that, but should be
possible now, I believe).

Consider if you have three types of hosts, that map directly to
specific flavors. You can shard your scheduler and DB clusters into
those groups (i.e. compute node inventory lives only in one of the
shards). When the request comes in you just route appropriate build
requests to each of the scheduler clusters.

If you have a large enough deployment, you can shard your hosts across
several DB clusters, and use a modulo or random sharding stragegy to
pick which cluster the request lands on. There are issues around
ensuring you do capacity planning that

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-01 Thread Cheng, Yingxin

Hi,

I have simulated the distributed resource management with the incremental 
update model based on Jay's benchmarking framework: 
https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration. 
The complete result lies at http://paste.openstack.org/show/488677/. It's ran 
by a VM with 4 cores and 4GB RAM, and the mysql service is using the default 
settings with the "innodb_buffer_pool_size" setting to "2G". The number of 
simulated compute nodes are set to "300".


First, the conclusions from the result of the eventually consistent scheduler 
state simulation(i.e. rows that "do claim in compute?" = Yes):
#accuracy
1. The final decision accuracy is 100%: No resource usage will exceed the real 
capacity by examining the rationality of db records at the end of each run.
2. The schedule decision accuracy is 100% if there's only one scheduler: The 
successful scheduler decisions are all succeeded in compute nodes, thus no 
retries recorded, i.e. "Count of requests processed" = "Placement query count". 
See http://paste.openstack.org/show/488696/
3. The schedule decision accuracy is 100% if "Partition strategy" is set to 
"modulo", no matter how many scheduler processes. See 
http://paste.openstack.org/show/488697/
#racing
4. No racing is happened if there's only one scheduler process or the 
"Partition strategy" is set to "modulo", explained by 2. 3.
5. The multiple-schedulers racing rate is extremely low using the "spread" or 
"random" placement strategy used by legacy filter scheduler: This rate is 3.0% 
using "spread" strategy and 0.15% using "random" strategy, note that there are 
8 workers in processing about 12000 requests within 20 seconds. The result is 
even better than resource-provider scheduler(rows that "do claim in compute?" = 
No), that's 82.9% using "spread" strategy and 2.52% using "random" strategy of 
12000 requests within 70-190 seconds. See 
http://paste.openstack.org/show/488699/. Note, retry rate is calculated by 
(Placement query count - Count of requests processed) / Count of requests 
processed * 100%
#overwhelming messages
6. The total count of messages are only affected by the number of schedulers 
and the number of schedule queries, NOT by the number of compute nodes. See 
http://paste.openstack.org/show/488701/
7. The messages per successful query is (number_of_schedulers + 2), its growth 
pattern is lineral and only affected by scheduler processes. And there is only 
1 message if the query failed. It is not a huge number plus there are no 
additional messages in order to access db during scheduling.


Second, here's what I've found in the centralized db claim design(i.e. rows 
that "do claim in compute?" = No):
1. The speed of legacy python filtering is not slow(see rows that "Filter 
strategy" = python): "Placement total query time" records the cost of all query 
time including fetching states from db and filtering using python. The actual 
cost of python filtering is (Placement_total_query_time - 
Placement_total_db_query_time), and that's only about 1/4 of total cost or even 
less. It also means python in-memory filtering is much faster than db filtering 
in this experiment. See http://paste.openstack.org/show/488710/ 
2. The speed of `db filter strategy` and the legacy `python filter strategy` 
are in the same order of magnitude, not a very huge improvement. See the 
comparison of column "Placement total query time". Note that the extra cost of 
`python filter strategy` mainly comes from "Placement total db query time"(i.e. 
fetching states from db). See http://paste.openstack.org/show/488709/


Third, my major concern of "centralized db claim" design is: Putting too much 
scheduling works into the centralized db, and it is not scalable by simply 
adding conductors and schedulers.
1. The filtering works are majorly done inside db by executing complex sqls. If 
the filtering logic is much more complex(currently only CPU and RAM are 
accounted in the experiment), the db overhead will be considerable.
2. The racing of centralized claims are resolved by rolling back transactions 
and by checking the generations(see the design of "compare and update" strategy 
in https://review.openstack.org/#/c/283253/), it also causes additional 
overhead to db.
3. The db overhead of filtering operation can be relaxed by moving them to 
schedulers, that will be 38 times faster and can be executed in parallel by 
schedulers according to the column "Placement avg query time". See 
http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using 
distributed resource claims in resource trackers. There is no need to roll back 
transactions in updating inventories of compute local resources in order to be 
accurate. It is confirmed by checking the db records at the end of each run of 
eventually consistent scheduler state design.
5. If a) all the filtering operations are done inside schedulers,
b) schedulers do not need to

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Ed Leafe

On 02/24/2016 05:56 AM, Chris Dent wrote:

>> I'd actually also be interested if this has a potential to reduce the
>> demand on the message bus. I've been investigating this for a while,
>> and I
>> found that RabbitMQ will happily consume 5 high end CPU cores on a
>> single box
>> just to serve the needs of 1000 idle compute nodes.
> 
> What do we need to do, as a community, to start treating and
> thinking about this problem as a bug rather than something we have
> to deal with or work around? "Fear of messaging" is putting a big
> limitation on our possible solutions to a fair few problems (notably
> scheduling).

Some chatter is necessary to assure system health, but yeah, this should
be considered a bug. I'm sure that if a RabbitMQ expert were able to
observe this, the response would be "you're doing it wrong", and we
might fix this.

-- 

-- Ed Leafe


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Cheng, Yingxin

Sorry for the late reply.

On 22 February 2016 at 18:45, John Garbutt wrote:
> On 21 February 2016 at 13:51, Cheng, Yingxin  wrote:
> > On 19 February 2016 at 5:58, John Garbutt wrote:
> >> On 17 February 2016 at 17:52, Clint Byrum  wrote:
> >> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Long term, I see a world where there are multiple scheduler Nova is
> >> able to use, depending on the deployment scenario.
> >
> > Technically, what I've implemented is a new type of scheduler host
> > manager `shared_state_manager.SharedHostManager`[1] with the ability
> > to synchronize host states directly from resource trackers.
> 
> Thats fine. You just get to re-use more code.
> 
> Maybe I should say multiple scheduling strategies, or something like that.
> 
> >> So a big question for me is, does the new scheduler interface work if
> >> you look at slotting in your prototype scheduler?
> >>
> >> Specifically I am thinking about this interface:
> >> https://github.com/openstack/nova/blob/master/nova/scheduler/client/_
> >> _init__.py
> 
> I am still curious if this interface is OK for your needs?
> 

The added interfaces from scheduler side is:
https://review.openstack.org/#/c/280047/2/nova/scheduler/client/__init__.py 
1. I can remove "notify_schedulers" because the same message can be sent 
through "send_commit" instead.
2. The "send_commit" interface is required because there should be a way to 
send state updates from compute node to a specific scheduler.

The added/changed interfaces from compute side is:
https://review.openstack.org/#/c/280047/2/nova/compute/rpcapi.py 
1. The "report_host_state" interface is required. When a scheduler is up, it 
must ask compute node for the latest host state. It is also required when the 
scheduler detects that it's host state is out of sync and it should ask compute 
node for a synced state(its rare due to network issues or bugs).
2. The new parameter "claim" should be added to interface 
"build_and_run_instance" because compute node should reply whether a scheduler 
claim is successful. Scheduler can thus track its claims and can be updated by 
successful claims from other schedulers immediately. The compute node can thus 
decide whether a scheduler decision is made from the "shared-state" scheduler, 
that's the *tricky* part to support both types of schedulers.

> Making this work across both types of scheduler might be tricky, but I think 
> it is
> worthwhile.
> 
> >> > This mostly agrees with recent tests I've been doing simulating
> >> > 1000 compute nodes with the fake virt driver.
> >>
> >> Overall this agrees with what I saw in production before moving us to
> >> the caching scheduler driver.
> >>
> >> I would love a nova functional test that does that test. It will help
> >> us compare these different schedulers and find the strengths and 
> >> weaknesses.
> >
> > I'm also working on implementing the functional tests of nova
> > scheduler, there is a patch showing my latest progress:
> > https://review.openstack.org/#/c/281825/
> >
> > IMO scheduler functional tests are not good at testing real
> > performance of different schedulers, because all of the services are
> > running as green threads instead of real processes. I think the better
> > way to analysis the real performance and the strengths and weaknesses
> > is to start services in different processes with fake virt driver(i.e.
> > Clint Byrum's work) or Jay Pipe's work in emulating different designs.
> 
> Having an option to run multiple process seems OK, if its needed.
> Although starting with a greenlet version that works in the gate seems best.
> 
> Lets try a few things, and see what predicts the results in real environments.

Sure.

> >> I am really interested how your prototype and the caching scheduler 
> >> compare?
> >> It looks like single node scheduler will perform in a very similar
> >> way, but multiple schedulers are less likely to race each other,
> >> although there are quite a few races?
> >
> > I think the major weakness of caching scheduler comes from its host
> > state update model, i.e. updating host states from db every `
> > CONF.scheduler_driver_task_period`
> > seconds.
> 
> The trade off is that consecutive scheduler decisions don't race each other, 
> at all.
> Say you have a burst of 1000 instance builds and you want to avoid build 
> failures
> (but accept sub optimal placement, and you are using fill first), thats a 
> very good
> trade off.
> 
> Consider a burst of 1000 deletes, it may take you 60 seconds to notice they 
> are
> all deleted and you have lots more free space, but that doesn't cause build
> failures like excessive races for the same resources will, at least under the 
> usual
> conditions where you are not yet totally full (i.e. non-HPC use cases).
> 
> I was shocked how well the caching_scheduler works in practice. I assumed it
> would be terrible, but when we tried it, it worked well.

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Chris Dent


On Tue, 23 Feb 2016, Clint Byrum wrote:


I'd actually also be interested if this has a potential to reduce the
demand on the message bus. I've been investigating this for a while, and I
found that RabbitMQ will happily consume 5 high end CPU cores on a single box
just to serve the needs of 1000 idle compute nodes.


What do we need to do, as a community, to start treating and
thinking about this problem as a bug rather than something we have
to deal with or work around? "Fear of messaging" is putting a big
limitation on our possible solutions to a fair few problems (notably
scheduling).

--
Chris Dent   (�s°□°)�s�喋擤ォ�http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Sylvain Bauza

Le 24/02/2016 01:10, Jay Pipes a écrit :

On 02/22/2016 04:23 AM, Sylvain Bauza wrote:

I won't argue against performance here. You made a very nice PoC for
testing scaling DB writes within a single python process and I trust
your findings. While I would be naturally preferring some shared-nothing
approach that can horizontally scale, one could mention that we can do
the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They are 
multi-process benchmarks.

Sorry, was unclear. I meant that I trust you for benchmarking the 
read/write performance of the models you proposed. The 'single process' 
was meant to explain that you propose a python script for that, but sure 
you can provide a multi-process call.
Again, no doubt against your findings. TBH, I had litterally no time to 
resurrect my testbed for that, but I'll play with your scripts as soon 
as I have time, it's one of my priorities.

b) The approach I've taken is indeed shared-nothing. The scheduler 
processes do not share any data whatsoever.

c) Galera isn't horizontally scalable. Never was, never will be. That 
isn't its strong-suit. Galera is best for having a 
synchronously-replicated database cluster that is incredibly easy to 
manage and administer but it isn't a performance panacea. It's focus 
is on availability not performance :)

Fair and valid point. I said Galera as an example for saying that 
operators can try to horizontally scale the DBMS if they need, vs. using 
the existing model of scaling the number of conductors if we keep the 
resource ownership by the compute nodes.

That said, most of the operators run a controller/compute situation
where all the services but the compute node are hosted on 1:N hosts.
Implementing the resource-providers-scheduler BP (and only that one)
will dramatically increase the number of writes we do on the scheduler
process (ie. on the "controller" - quoting because there is no notion of
a "controller" in Nova, it's just a deployment choice).

Yup, no doubt about it. It won't increase the *total* number of writes 
the system makes, just the concentration of those writes into the 
scheduler processes. You are trading increased writes in the scheduler 
for the challenges inherent in keeping a large distributed cache 
system valid and fresh (which itself introduces a different kind of 
writes).

No, it will increase the number of writes, because you're planning to 
write every time a request comes in, right ?
If so, it's scaling per requests, compared to the existing model where 
computes write their data every 60 secs, so it scales per compute.

Since the number of requests is at least one order of magnitude higher 
than the number of computes, my guts are that it'll write far more the DB.

That's a big game changer for operators who are currently capping their
capacity by adding more conductors. It would require them to do some DB
modifications to be able to scale their capacity. I'm not against that,
I just say it's a big thing that we need to consider and properly
communicate if agreed.

Agreed completely. I will say, however, that on a 1600 compute node 
simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
database with 128MB InnoDB buffer pool size barely breaks a sweat on 
my local machine.

Nice, that makes me very comfortable with your approach from a 
performance PoV :-)

Great work btw.

> It can be alleviated by changing to a stand-alone high

 performance database.

It doesn't need to be high-performance at all. In my benchmarks, a
small-sized stock MySQL database server is able to fulfill thousands
of placement queries and claim transactions per minute using
completely isolated non-shared, non-caching scheduler processes.

> And the cache refreshing is designed to be

replaced by to direct SQL queries according to resource-provider
scheduler spec [2].

Yes, this is correct.

> The performance bottlehead of shared-state scheduler

may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.

In terms of the number of messages used in each design, I see the
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?

True. But that's manageable by adding more conductors, right ? IMHO,
Nova performance is bound by the number of conductors you run and I like
that - because that's easy to increase the capacity.
Also, the payload could be far smaller from the existing : instead of
sending a full update for a single compute_node entry, it would only
send the diff (+ some full syncs periodically). We would then mitigate
the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless of 
whether that message contains an incremental update or a full object.

Well, it's a tautology :-)
I'm just

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Cheng, Yingxin

Very sorry for the delay, it feels hard for me to reply to all the concerns 
raised, most of you have years more experiences. I've tried hard to present 
that there is a direction to solve the issues of existing filter scheduler in 
multiple areas including performance, decision accuracy, multiple scheduler 
support and race condition. I'll also support any other solutions if they can 
solve the same issue elegantly.

@Jay Pipe:
I feel that scheduler team will agree with the design that can fulfill 
thousands of placement queries with thousands of nodes. But as a scheduler that 
will be splitted out to be used in wider areas, it's not simple to predict the 
that requirement, so I'm not agree with the statement "It doesn't need to be 
high-performance at all". There is no system that can be existed without 
performance bottleneck, including resource-provider scheduler and shared-state 
scheduler. I was trying to point out where is the potential bottleneck in each 
design and how to improve if the worst thing happens, quote:
"The performance bottleneck of resource-provider and legacy scheduler is from 
the centralized db (REMOVED: and scheduler cache refreshing). It can be 
alleviated by changing to a stand-alone high performance database. And the 
cache refreshing is designed to be replaced by to direct SQL queries according 
to resource-provider scheduler spec. The performance bottleneck of shared-state 
scheduler may come from the overwhelming update messages, it can also be 
alleviated by changing to stand-alone distributed message queue and by using 
the "MessagePipe" to merge messages."

I'm not saying that there is a bottleneck of resource-provider scheduler in 
fulfilling current design goal. The ability of resource-provider scheduler is 
already proven by a nice modeling tool implemented by Jay, I trust it. But I 
care more about the actual limit of each design and how easily they can be 
extended to increase that limit. That's why I turned to make efforts in 
scheduler functional test framework(https://review.openstack.org/#/c/281825/). 
I finally want to test scheduler functionality using greenthreads in the gate, 
and test the performance and placement of each design using real processes. And 
I hope the scheduler can open to both centralized and distributed design.

I've updated my understanding of three designs: 
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing

The "cache updates" arrows are changed to "resource updates" in 
resource-provider scheduler, because I think resource updates from virt driver 
are still needed to be updated to the central database. Hope this time it's 
right.

@Sylvain Bauze:
As the first step towards shared-state scheduler, the required changes are kept 
at minimum. There are no db modifications needed, so no rolling upgrade issues 
in data migration. The new scheduler can decide not to make decisions to old 
compute nodes, or try to refresh host states from db and use legacy way to make 
decisions until all the compute nodes are upgraded.

I have to admit that my prototype still lack efforts to deal with overwhelming 
messages. This design works best using the distributed message queue. Also if 
we try to initiate multiple scheduler processes/workers in a single host, there 
are a lot more to be done to reduce update messages between compute nodes and 
scheduler workers. But I see the potential of distributed resource 
management/scheduling and would like to make efforts in this direction.

If we are agreed that the final decision accuracy are guaranteed in both 
directions, we should care more about the final decision throughput of both 
design. Theoretically it is better because the final consumptions are made 
distributedly, but there exists difficulties in reaching that limit. However, 
the centralized design is easier to approach its theoretical performance 
because of the lightweight implementation inside scheduler and the powerful 
underlying database.

Regards,
-Yingxin

> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Wednesday, February 24, 2016 8:11 AM
> To: Sylvain Bauza <sba...@redhat.com>; OpenStack Development Mailing List
> (not for usage questions) <openstack-dev@lists.openstack.org>; Cheng, Yingxin
> <yingxin.ch...@intel.com>
> Subject: Re: [openstack-dev] [nova] A prototype implementation towards the
> "shared state scheduler"
> 
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> > I won't argue against performance here. You made a very nice PoC for
> > testing scaling DB writes within a single python process and I trust
> > your findings. While I would be naturally preferring some
> > shared-nothing approach that can horizontally scale, one could mention
> > that we can do the same with Galera clusters.
> 
> a

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Clint Byrum

Excerpts from Jay Pipes's message of 2016-02-23 16:10:46 -0800:
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> > I won't argue against performance here. You made a very nice PoC for
> > testing scaling DB writes within a single python process and I trust
> > your findings. While I would be naturally preferring some shared-nothing
> > approach that can horizontally scale, one could mention that we can do
> > the same with Galera clusters.
> 
> a) My benchmarks aren't single process comparisons. They are 
> multi-process benchmarks.
> 
> b) The approach I've taken is indeed shared-nothing. The scheduler 
> processes do not share any data whatsoever.
> 

I think this is a matter of perspective. What I read from Sylvain's
message was that the approach you've taken shares state in a database,
and shares access to all compute nodes.

I also read in to Sylvain's comments taht what he was referring to was
a system where the compute nodes divide up the resources and never share
anything at all.

> c) Galera isn't horizontally scalable. Never was, never will be. That 
> isn't its strong-suit. Galera is best for having a 
> synchronously-replicated database cluster that is incredibly easy to 
> manage and administer but it isn't a performance panacea. It's focus is 
> on availability not performance :)
> 

I also think this is a matter of perspective. Galera is actually
fantastically horizontally scalable in any situation where you have a
very high ratio of reads to writes with a need for consistent reads.

However, for OpenStack's needs, we are typically pretty low on that ratio.

> > That said, most of the operators run a controller/compute situation
> > where all the services but the compute node are hosted on 1:N hosts.
> > Implementing the resource-providers-scheduler BP (and only that one)
> > will dramatically increase the number of writes we do on the scheduler
> > process (ie. on the "controller" - quoting because there is no notion of
> > a "controller" in Nova, it's just a deployment choice).
> 
> Yup, no doubt about it. It won't increase the *total* number of writes 
> the system makes, just the concentration of those writes into the 
> scheduler processes. You are trading increased writes in the scheduler 
> for the challenges inherent in keeping a large distributed cache system 
> valid and fresh (which itself introduces a different kind of writes).
> 

Funny enough, I think of Galera as a large distributed cache that is
always kept valid and fresh. The challenges of doing this for a _busy_
cache are not unique to Galera.

> > That's a big game changer for operators who are currently capping their
> > capacity by adding more conductors. It would require them to do some DB
> > modifications to be able to scale their capacity. I'm not against that,
> > I just say it's a big thing that we need to consider and properly
> > communicate if agreed.
> 
> Agreed completely. I will say, however, that on a 1600 compute node 
> simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
> database with 128MB InnoDB buffer pool size barely breaks a sweat on my 
> local machine.
> 

That agrees with what I've seen as well. We're talking about tables of
integers for the most part, so your least expensive SSD's can keep up
with this load for many many thousands of computes.

I'd actually also be interested if this has a potential to reduce the
demand on the message bus. I've been investigating this for a while, and I
found that RabbitMQ will happily consume 5 high end CPU cores on a single box
just to serve the needs of 1000 idle compute nodes. I am sorry that I
haven't read enough of the details in your proposal, but doesn't this
mean there'd be quite a bit less load on the MQ if the only time
messages are happening is for direct RPC dispatches and error reporting?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Jay Pipes

On 02/22/2016 04:23 AM, Sylvain Bauza wrote:

I won't argue against performance here. You made a very nice PoC for
testing scaling DB writes within a single python process and I trust
your findings. While I would be naturally preferring some shared-nothing
approach that can horizontally scale, one could mention that we can do
the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They are 
multi-process benchmarks.

b) The approach I've taken is indeed shared-nothing. The scheduler 
processes do not share any data whatsoever.

c) Galera isn't horizontally scalable. Never was, never will be. That 
isn't its strong-suit. Galera is best for having a 
synchronously-replicated database cluster that is incredibly easy to 
manage and administer but it isn't a performance panacea. It's focus is 
on availability not performance :)

That said, most of the operators run a controller/compute situation
where all the services but the compute node are hosted on 1:N hosts.
Implementing the resource-providers-scheduler BP (and only that one)
will dramatically increase the number of writes we do on the scheduler
process (ie. on the "controller" - quoting because there is no notion of
a "controller" in Nova, it's just a deployment choice).

Yup, no doubt about it. It won't increase the *total* number of writes 
the system makes, just the concentration of those writes into the 
scheduler processes. You are trading increased writes in the scheduler 
for the challenges inherent in keeping a large distributed cache system 
valid and fresh (which itself introduces a different kind of writes).

That's a big game changer for operators who are currently capping their
capacity by adding more conductors. It would require them to do some DB
modifications to be able to scale their capacity. I'm not against that,
I just say it's a big thing that we need to consider and properly
communicate if agreed.

Agreed completely. I will say, however, that on a 1600 compute node 
simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
database with 128MB InnoDB buffer pool size barely breaks a sweat on my 
local machine.

> It can be alleviated by changing to a stand-alone high

 performance database.

It doesn't need to be high-performance at all. In my benchmarks, a
small-sized stock MySQL database server is able to fulfill thousands
of placement queries and claim transactions per minute using
completely isolated non-shared, non-caching scheduler processes.

> And the cache refreshing is designed to be

replaced by to direct SQL queries according to resource-provider
scheduler spec [2].

Yes, this is correct.

> The performance bottlehead of shared-state scheduler

may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.

In terms of the number of messages used in each design, I see the
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?

True. But that's manageable by adding more conductors, right ? IMHO,
Nova performance is bound by the number of conductors you run and I like
that - because that's easy to increase the capacity.
Also, the payload could be far smaller from the existing : instead of
sending a full update for a single compute_node entry, it would only
send the diff (+ some full syncs periodically). We would then mitigate
the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless of 
whether that message contains an incremental update or a full object.

The resource-providers proposal actually uses no update messages at
all (except in the abnormal case of a compute node failing to start
the resources that had previously been claimed by the scheduler). All
updates are done in a single database transaction when the claim is made.

See, I don't think that a compute node unable to start a request is an
'abnormal case'. There are many reasons why a request can't be honored
by the compute node :
  - for example, the scheduler doesn't own all the compute resources and
thus can miss some information : for example, say that you want to pin a
specific pCPU and this pCPU is already assigned. The scheduler doesn't
know *which* pCPUs are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the instance)
is made on the compute manager not at the exact same time we're
decreasing resource usage for pCPUs (because it would be done in the
scheduler process).

See my response to Chris Friesen about the above.

  - some "reserved" RAM or disk could be underestimated and
consequently, spawning a VM could be either taking fare more time than
planned (which would mean that it would be a suboptimal placement) or it
would fail which would issue a reschedule.

Again, the above is an abnormal case.

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Jay Pipes


On 02/23/2016 06:03 PM, Chris Friesen wrote:

On 02/21/2016 01:56 PM, Jay Pipes wrote:

Yingxin, sorry for the delay in responding to this thread. My comments
inline.

On 02/17/2016 12:45 AM, Cheng, Yingxin wrote:

To better illustrate the differences between shared-state,
resource-provider and legacy scheduler, I’ve drew 3 simplified pictures
[1] in emphasizing the location of resource view, the location of claim
and resource consumption, and the resource update/refresh pattern in
three kinds of schedulers. Hoping I’m correct in the “resource-provider
scheduler” part.



2) Claims of resource amounts are done in a database transaction
atomically
within each scheduler process. Therefore there are no "cache updates"
arrows
going back from compute nodes to the resource-provider DB. The only
time a
compute node would communicate with the resource-provider DB (and thus
the
scheduler at all) would be in the case of a *failed* attempt to
initialize
already-claimed resources.


Can you point me to the BP/spec that talks about this?  Where in the
code would we update the DB to reflect newly-freed resources?


I should have been more clear, sorry. I am referring only to the process 
of claiming resources and the messages involved in cache updates for 
those claims. I'm not referring to freeing resources (i.e. an instance 
termination). In those cases, there would still need to be a message 
sent to inform the scheduler that the resources had been freed. Nothing 
would change in that regard.


For information, the blueprint where we are discussing moving claims to 
the scheduler (and away from the compute nodes) is here:


https://review.openstack.org/#/c/271823/

I'm in the process of splitting the above blueprint into two. One will 
be for the proposed moving of the filters from the scheduler Python 
process to instead by filters on the database query for compute nodes. 
Another blueprint will be for the "move the claims to the scheduler" stuff.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Jay Pipes


On 02/23/2016 06:12 PM, Chris Friesen wrote:

On 02/22/2016 03:23 AM, Sylvain Bauza wrote:

See, I don't think that a compute node unable to start a request is an
'abnormal
case'. There are many reasons why a request can't be honored by the
compute node :
  - for example, the scheduler doesn't own all the compute resources
and thus
can miss some information : for example, say that you want to pin a
specific
pCPU and this pCPU is already assigned. The scheduler doesn't know
*which* pCPUs
are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the
instance) is made
on the compute manager not at the exact same time we're decreasing
resource
usage for pCPUs (because it would be done in the scheduler process).



I'm pretty sure that the existing NUMATopologyFilter is aware of which
pCPUs/Hugepages are free on each host NUMA node.


It's aware of which pCPUs and hugepages are free on each host NUMA node 
at the time of scheduling, but it doesn't actually *claim* those 
resources in the scheduler. This means that by the time the launch 
request gets to the host, another request for the same NUMA topology may 
have consumed the NUMA cell topology.


I think that's what Sylvain is referring to above.

I'd like to point out, though, that the placement of a requested NUMA 
cell topology onto an available host NUMA cell or cells *is the claim* 
of those NUMA resources. And it is the claim -- i.e. the placement of 
the requested instance NUMA topology onto the host topology -- that I 
wish to make in the scheduler.


So, Sylvain, I am indeed talking about only the 'abnormal' cases.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Chris Friesen


On 02/22/2016 03:23 AM, Sylvain Bauza wrote:


See, I don't think that a compute node unable to start a request is an 'abnormal
case'. There are many reasons why a request can't be honored by the compute 
node :
  - for example, the scheduler doesn't own all the compute resources and thus
can miss some information : for example, say that you want to pin a specific
pCPU and this pCPU is already assigned. The scheduler doesn't know *which* pCPUs
are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the instance) is made
on the compute manager not at the exact same time we're decreasing resource
usage for pCPUs (because it would be done in the scheduler process).



I'm pretty sure that the existing NUMATopologyFilter is aware of which 
pCPUs/Hugepages are free on each host NUMA node.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-23 Thread Chris Friesen


On 02/21/2016 01:56 PM, Jay Pipes wrote:

Yingxin, sorry for the delay in responding to this thread. My comments inline.

On 02/17/2016 12:45 AM, Cheng, Yingxin wrote:

To better illustrate the differences between shared-state,
resource-provider and legacy scheduler, I’ve drew 3 simplified pictures
[1] in emphasizing the location of resource view, the location of claim
and resource consumption, and the resource update/refresh pattern in
three kinds of schedulers. Hoping I’m correct in the “resource-provider
scheduler” part.



2) Claims of resource amounts are done in a database transaction atomically
within each scheduler process. Therefore there are no "cache updates" arrows
going back from compute nodes to the resource-provider DB. The only time a
compute node would communicate with the resource-provider DB (and thus the
scheduler at all) would be in the case of a *failed* attempt to initialize
already-claimed resources.


Can you point me to the BP/spec that talks about this?  Where in the code would 
we update the DB to reflect newly-freed resources?



Thanks,
Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-22 Thread John Garbutt

On 21 February 2016 at 13:51, Cheng, Yingxin  wrote:
> On 19 February 2016 at 5:58, John Garbutt wrote:
>> On 17 February 2016 at 17:52, Clint Byrum  wrote:
>> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
>> Long term, I see a world where there are multiple scheduler Nova is able to 
>> use,
>> depending on the deployment scenario.
>
> Technically, what I've implemented is a new type of scheduler host manager
> `shared_state_manager.SharedHostManager`[1] with the ability to synchronize 
> host
> states directly from resource trackers.

Thats fine. You just get to re-use more code.

Maybe I should say multiple scheduling strategies, or something like that.

>> So a big question for me is, does the new scheduler interface work if you 
>> look at
>> slotting in your prototype scheduler?
>>
>> Specifically I am thinking about this interface:
>> https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init_
>> _.py

I am still curious if this interface is OK for your needs?

Making this work across both types of scheduler might be tricky, but I
think it is worthwhile.

>> > This mostly agrees with recent tests I've been doing simulating 1000
>> > compute nodes with the fake virt driver.
>>
>> Overall this agrees with what I saw in production before moving us to the
>> caching scheduler driver.
>>
>> I would love a nova functional test that does that test. It will help us 
>> compare
>> these different schedulers and find the strengths and weaknesses.
>
> I'm also working on implementing the functional tests of nova scheduler, there
> is a patch showing my latest progress: 
> https://review.openstack.org/#/c/281825/
>
> IMO scheduler functional tests are not good at testing real performance of
> different schedulers, because all of the services are running as green threads
> instead of real processes. I think the better way to analysis the real 
> performance
> and the strengths and weaknesses is to start services in different processes 
> with
> fake virt driver(i.e. Clint Byrum's work) or Jay Pipe's work in emulating 
> different
> designs.

Having an option to run multiple process seems OK, if its needed.
Although starting with a greenlet version that works in the gate seems best.

Lets try a few things, and see what predicts the results in real environments.

>> I am really interested how your prototype and the caching scheduler compare?
>> It looks like single node scheduler will perform in a very similar way, but 
>> multiple
>> schedulers are less likely to race each other, although there are quite a few
>> races?
>
> I think the major weakness of caching scheduler comes from its host state 
> update
> model, i.e. updating host states from db every ` 
> CONF.scheduler_driver_task_period`
> seconds.

The trade off is that consecutive scheduler decisions don't race each
other, at all. Say you have a burst of 1000 instance builds and you
want to avoid build failures (but accept sub optimal placement, and
you are using fill first), thats a very good trade off.

Consider a burst of 1000 deletes, it may take you 60 seconds to notice
they are all deleted and you have lots more free space, but that
doesn't cause build failures like excessive races for the same
resources will, at least under the usual conditions where you are not
yet totally full (i.e. non-HPC use cases).

I was shocked how well the caching_scheduler works in practice. I
assumed it would be terrible, but when we tried it, it worked well.
Its a million miles from perfect, but handy for many deployment
scenarios.

Thanks,
johnthetubaguy

PS
If you need a 1000 node test cluster to play with, its worth applying
to use this one:
http://go.rackspace.com/developercloud
I am happy to recommend these efforts gets some time with that hardware.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-22 Thread Sylvain Bauza

 so we need 
to support old compute nodes.



As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult to
make shared-state and resource-provider scheduler work together.


Yes, I don't see the two approaches being particularly compatible for 
the reason you state above.


That said, what we've discussed is having a totally new scheduler 
RESTful API that would do the claims in the scheduler 
(claim_resources()) and leave the existing select_destinations() call 
as-is to allow some deprecation and fallback if everything goes 
terribly, terribly wrong.


Best,
-jay


[1]
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing 



[2] https://review.openstack.org/#/c/271823/

Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 9:48 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
the "shared state scheduler"

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :

Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.


Nice, looking forward to it then :-)

2. Thanks for providing concerns I’ve not thought it yet, they will
be in the spec soon.

3. Let me copy my thoughts from another thread about the integration
with resource-provider:

The idea is about “Only compute node knows its own final
compute-node resource view” or “The accurate resource view only
exists at the place where it is actually consumed.” I.e., The
incremental updates can only come from the actual “consumption”
action, no matter where it is(e.g. compute node, storage service,
network service, etc.). Borrow the terms from resource-provider,
compute nodes can maintain its accurate version of
“compute-node-inventory” cache, and can send incremental updates
because it actually consumes compute resources, furthermore, storage
service can also maintain an accurate version of “storage-inventory”
cache and send incremental updates if it also consumes storage
resources. If there are central services in charge of consuming all
the resources, the accurate cache and updates must come from them.


That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain



Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 5:28 PM
*To:* OpenStack Development Mailing List (not for usage questions)
    <openstack-dev@lists.openstack.org>
<mailto:openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation
towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype
https://review.openstack.org/#/c/280047/ to testify its design
goals in accuracy, performance, reliability and compatibility
improvements. It will also be an Austin Summit Session if
elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316


I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newton 
release?



Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file
so it would be the best way to discuss on the design.




2. Suggestions to improve its design and compatibility.


I don't want to go into details here (that's rather the goal of the
spec for that), but my biggest concerns would be when reviewing the
spec :
  - how this can meet the OpenStack mission statement (ie.
ubiquitous solution that would be easy to install and massively
scalable)
  - how this can be integrated with the existing (filters, weighers)
to provide a clean and simple path for operators to upgrade
  - how this can be supporting rolling upgrades (old computes
sending updates to new scheduler)
  - how can we test it
  - can we have the feature optional for operators




3. Possibilities to integrate with resource-provider bp series:
I know resource-provider is the major direction of Nova
scheduler, and there will be fundamental changes in the future,
especially according to the bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype pro

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-21 Thread Chris Dent


On Sun, 21 Feb 2016, Jay Pipes wrote:

I don't see how the shared-state scheduler is getting the most accurate 
resource view. It is only in extreme circumstances that the resource-provider 
scheduler's view of the resources in a system (all of which is stored without 
caching in the database) would differ from the "actual" inventory on a 
compute node.


I'm pretty sure this ¶ is central to the whole discussion. It's a
question of where the final truth lies and what that positioning allows
and forbids. In resource-providers, the truth, or at least the truth
that is acted upon is in the database. In shared-state, the scheduler
mirrors the resources. People have biases about that sort of stuff.

Generalizing quite bit:

All that mirroring costs quite a bit in communication terms and can go
funky if the communication goes awry. But it does mean that the compute
nodes are authoritative about themselves and have the possibility of
using/claiming/placing resources that are not under control of the
scheduler (or even nova in general).

Centralizing things in the DB cuts way back on messaging and appears to
provide both a computationally and conceptually efficient way of
calculating placement but it does so at the cost of the compute nodes
have less flexibility about managing their own resources, unless we want
the failure mode you describe elsewhere to be more common than you
implied.

I heard somewhere, but this may be wrong or out of date, that one of the
constraints with compute-nodes is that it should be possible to spawn
VMs on them that are not managed by nova. If, in the full blown
version of the resource-provider-based scheduler, we are not sending
resource usage updates on compute-node state changes to the
scheduler db and only on failure, retry rate goes up in a
heterogeneous environment. That could well be fine, a price you pay,
but I wonder if it is a concern?

I could get into some noodling here about the artifact world versus
the real world, but that's probably belaboring the point. I'm not
trying to diss or support either approach, just flesh out some of
the gaps in at least my understanding.


b) Simplicity

Goes to the above point about debuggability, but I've always tried to follow 
the mantra that the best software design is not when you've added the last 
piece to it, but rather when you've removed the last piece from it and still 
have a functioning and performant system. Having a scheduler that can tackle 
the process of tracking resources, deciding on placement, and claiming those 
resources instead of playing an intricate dance of keeping state caches valid 
will, IMHO, lead to a better scheduler.


I think it is moving in the right direction. Removing the dance of
keeping state caches valid will be a big improvement.

Better still would be removing the duplication and persistence of
information that already exists on the compute nodes. That would be
really cool, but doesn't yet seem possible with the way we do messaging
nor with the way we track shared resources (resource-pools ought to
help).

--
Chris Dent   (╯°□°)╯︵┻━┻http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-21 Thread Jay Pipes

e) would differ from the 
"actual" inventory on a compute node.



4. Design goal difference:

The fundamental design goal of the two new schedulers is different. Copy
my views from [2], I think it is the choice between “the loose
distributed consistency with retries” and “the strict centralized
consistency with locks”.


There are a couple other things that I believe we should be documenting, 
considering and measuring with regards to scheduler designs:


a) Debuggability

The ability of a system to be debugged and for requests to that system 
to be diagnosed is a critical component to the benefit of a particular 
system design. I'm hoping that by removing a lot of the moving parts 
from the legacy filter scheduler design (removing the caching, removing 
the Python-side filtering and weighing, removing the interval between 
which placement decisions can conflict, removing the cost and frequency 
of retry operations) that the resource-provider scheduler design will 
become simpler for operators to use.


b) Simplicity

Goes to the above point about debuggability, but I've always tried to 
follow the mantra that the best software design is not when you've added 
the last piece to it, but rather when you've removed the last piece from 
it and still have a functioning and performant system. Having a 
scheduler that can tackle the process of tracking resources, deciding on 
placement, and claiming those resources instead of playing an intricate 
dance of keeping state caches valid will, IMHO, lead to a better scheduler.



As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult to
make shared-state and resource-provider scheduler work together.


Yes, I don't see the two approaches being particularly compatible for 
the reason you state above.


That said, what we've discussed is having a totally new scheduler 
RESTful API that would do the claims in the scheduler 
(claim_resources()) and leave the existing select_destinations() call 
as-is to allow some deprecation and fallback if everything goes 
terribly, terribly wrong.


Best,
-jay


[1]
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing

[2] https://review.openstack.org/#/c/271823/

Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 9:48 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
the "shared state scheduler"

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :

Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.


Nice, looking forward to it then :-)

2. Thanks for providing concerns I’ve not thought it yet, they will
be in the spec soon.

3. Let me copy my thoughts from another thread about the integration
with resource-provider:

The idea is about “Only compute node knows its own final
compute-node resource view” or “The accurate resource view only
exists at the place where it is actually consumed.” I.e., The
incremental updates can only come from the actual “consumption”
action, no matter where it is(e.g. compute node, storage service,
network service, etc.). Borrow the terms from resource-provider,
compute nodes can maintain its accurate version of
“compute-node-inventory” cache, and can send incremental updates
because it actually consumes compute resources, furthermore, storage
service can also maintain an accurate version of “storage-inventory”
cache and send incremental updates if it also consumes storage
resources. If there are central services in charge of consuming all
the resources, the accurate cache and updates must come from them.


That is one of the things I'd like to see in your spec, and how you
could interact with the new model.
Thanks,
-Sylvain



Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 5:28 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
    <mailto:openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation
towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype
https://review.openstack.org/#/c/280047/ to testify its design
goals in accuracy, performance, reliability and compatibility
improvements. It will also be an Austin Summit Session if
elected:

https://www.openstack.org/summ

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-21 Thread Cheng, Yingxin

On 19 February 2016 at 5:58, John Garbutt wrote:
> On 17 February 2016 at 17:52, Clint Byrum  wrote:
> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Hi,
> >>
> >> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to
> >> testify its design goals in accuracy, performance, reliability and
> >> compatibility improvements. It will also be an Austin Summit Session
> >> if elected:
> >> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presen
> >> tation/7316
> 
> Long term, I see a world where there are multiple scheduler Nova is able to 
> use,
> depending on the deployment scenario.
> 
> We have tried to stop any more scheduler going in tree (like the solver 
> scheduler)
> while we get the interface between the nova-scheduler and the rest of Nova
> straightened out, to make that much easier.

Technically, what I've implemented is a new type of scheduler host manager
`shared_state_manager.SharedHostManager`[1] with the ability to synchronize host
states directly from resource trackers. Filter scheduler driver can choose to 
load this
manager from stevedore[2], and thus get a different update model of internal
caches. This new manager is highly compatible to current scheduler architecture
because filter scheduler with HostManager can even run with the schedulers 
loaded
SharedHostManager at the same time(tested).

So why not have this in tree to give operators more options in choosing host
managers. I also have an opinion that caching scheduler is not exactly a new 
kind
of scheduler driver, it only has a different behavior in updating host states, 
and it
should be implemented as a new kind of host manager instead.

What I'm concerned is that the resource provider scheduler is going to change 
the
architecture of filter scheduler in Jay Pipe's bp[3]. There will be no host 
manager,
even no host state caches in the future. So what I've done in keeping 
compatibilities
will become incompatibilities in the future.

[1] 
https://review.openstack.org/#/c/280047/2/nova/scheduler/shared_state_manager.py
 L55
[2] https://review.openstack.org/#/c/280047/2/setup.cfg L194
[3] https://review.openstack.org/#/c/271823 

> 
> So a big question for me is, does the new scheduler interface work if you 
> look at
> slotting in your prototype scheduler?
> 
> Specifically I am thinking about this interface:
> https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init_
> _.py

> There are several problems in this update model, proven in experiments[3]:
> >> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request is
> 355ms in average in the deployment of 3 compute nodes, compared with only
> 3ms in in-memory decision-making. Imagine there could be at most 1k nodes,
> even 10k nodes in the future.
> >> 2. Race conditions: This is not only a parallel-scheduler problem,
> >> but also a problem using only one scheduler. The detailed analysis of one-
> scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between
> the scheduler makes a decision in host state cache and the compute node
> updates its in-db resource record according to that decision in resource 
> tracker.
> A recent scheduler resource consumption in cache can be lost and overwritten
> by compute node data because of it, result in cache inconsistency and
> unexpected retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests recorded, results 
> in
> 22.6% extra performance overhead.
> >> 3. Parallel scheduler support: The design of filter scheduler leads to an 
> >> "even
> worse" performance result using parallel schedulers. In the same experiment
> with 4 schedulers on separate machines, the average db block time is increased
> to 697ms per request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
> >
> > This mostly agrees with recent tests I've been doing simulating 1000
> > compute nodes with the fake virt driver.
> 
> Overall this agrees with what I saw in production before moving us to the
> caching scheduler driver.
> 
> I would love a nova functional test that does that test. It will help us 
> compare
> these different schedulers and find the strengths and weaknesses.

I'm also working on implementing the functional tests of nova scheduler, there
is a patch showing my latest progress: https://review.openstack.org/#/c/281825/ 

IMO scheduler functional tests are not good at testing real performance of
different schedulers, because all of the services are running as green threads
instead of real processes. I think the better way to analysis the real 
performance
and the strengths and weaknesses is to start services in different processes 
with
fake virt driver(i.e. Clint Byrum's work) or Jay Pipe's work in emulating 
different
designs.

> >> 2. Since

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-19 Thread John Garbutt

On 17 February 2016 at 17:52, Clint Byrum  wrote:
> Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
>> Hi,
>>
>> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to 
>> testify its design goals in accuracy, performance, reliability and 
>> compatibility improvements. It will also be an Austin Summit Session if 
>> elected: 
>> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

Long term, I see a world where there are multiple scheduler Nova is
able to use, depending on the deployment scenario.

We have tried to stop any more scheduler going in tree (like the
solver scheduler) while we get the interface between the
nova-scheduler and the rest of Nova straightened out, to make that
much easier.

So a big question for me is, does the new scheduler interface work if
you look at slotting in your prototype scheduler?

Specifically I am thinking about this interface:
https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init__.py

>> I want to gather opinions about this idea:
>> 1. Is this feature possible to be accepted in the Newton release?
>> 2. Suggestions to improve its design and compatibility.
>> 3. Possibilities to integrate with resource-provider bp series: I know 
>> resource-provider is the major direction of Nova scheduler, and there will 
>> be fundamental changes in the future, especially according to the bp 
>> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>>  However, this prototype proposes a much faster and compatible way to make 
>> schedule decisions based on scheduler caches. The in-memory decisions are 
>> made at the same speed with the caching scheduler, but the caches are kept 
>> consistent with compute nodes as quickly as possible without db refreshing.
>>
>> Here is the detailed design of the mentioned prototype:
>>
>> >>
>> Background:
>> The host state cache maintained by host manager is the scheduler resource 
>> view during schedule decision making. It is updated whenever a request is 
>> received[1], and all the compute node records are retrieved from db every 
>> time. There are several problems in this update model, proven in 
>> experiments[3]:
>> 1. Performance: The scheduler performance is largely affected by db access 
>> in retrieving compute node records. The db block time of a single request is 
>> 355ms in average in the deployment of 3 compute nodes, compared with only 
>> 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, 
>> even 10k nodes in the future.
>> 2. Race conditions: This is not only a parallel-scheduler problem, but also 
>> a problem using only one scheduler. The detailed analysis of 
>> one-scheduler-problem is located in bug analysis[2]. In short, there is a 
>> gap between the scheduler makes a decision in host state cache and the
>> compute node updates its in-db resource record according to that decision in 
>> resource tracker. A recent scheduler resource consumption in cache can be 
>> lost and overwritten by compute node data because of it, result in cache 
>> inconsistency and unexpected retries. In a one-scheduler experiment using 
>> 3-node deployment, there are 7 retries out of 31 concurrent schedule 
>> requests recorded, results in 22.6% extra performance overhead.
>> 3. Parallel scheduler support: The design of filter scheduler leads to an 
>> "even worse" performance result using parallel schedulers. In the same 
>> experiment with 4 schedulers on separate machines, the average db block time 
>> is increased to 697ms per request and there are 16 retries out of 31 
>> schedule requests, namely 51.6% extra overhead.
>
> This mostly agrees with recent tests I've been doing simulating 1000
> compute nodes with the fake virt driver.

Overall this agrees with what I saw in production before moving us to
the caching scheduler driver.

I would love a nova functional test that does that test. It will help
us compare these different schedulers and find the strengths and
weaknesses.

> My retry rate is much lower,
> because there's less window for race conditions since there is no latency
> for the time between nova-compute getting the message that the VM is
> scheduled to it, and responding with a host update. Note that your
> database latency numbers seem much higher, we see about 200ms, and I
> wonder if you are running in a very resource constrained database
> instance.

Just to double check, you are using pymysql rather than MySQL-python
as the sqlalchemy backend?

If you use a driver that doesn't work well with eventlet, things can
get very bad, very quickly. Particularly because of the way the
scheduling works around handing back the results of the DB call. You
can get some benefits by shrinking the db and greenlet pools to reduce
the concurrency.

>> Improvements:
>> This prototype solved the mentioned issues above by implementing a new

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-18 Thread Chris Friesen


On 02/17/2016 06:59 AM, Chris Dent wrote:


The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).


I don't know if it necessarily requires a centralized datastore, but there is 
definitely interest in some sort of "reserving" of resources.


For instance, an orchestrator may want to do a two-stage setup where it reserves 
all the resources before actually trying to bring everything up.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-18 Thread Nikola Đipanov

On 02/15/2016 09:27 AM, Sylvain Bauza wrote:
> 
> 
> Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
>>
>> Hi,
>>
>>  
>>
>> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
>>  to testify its design goals
>> in accuracy, performance, reliability and compatibility improvements.
>> It will also be an Austin Summit Session if elected:
>> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>>
>>
>>  
>>
>> I want to gather opinions about this idea:
>>
>> 1. Is this feature possible to be accepted in the Newton release?
>>
> 
> Such feature requires a spec file to be written
> http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged
> 
> Ideally, I'd like to see your below ideas written in that spec file so
> it would be the best way to discuss on the design.
> 
> 

I really cannot help but protest this!

There is actual code posted, and we go back and ask people to write
documents without even bothering to look at the code. That makes no
sense to me!

I'll go and comment on the proposed code:

https://review.openstack.org/#/c/280047/

Which has infinitely more information about the idea than a random text
document.

>> 2. Suggestions to improve its design and compatibility.
>>
> 
> I don't want to go into details here (that's rather the goal of the spec
> for that), but my biggest concerns would be when reviewing the spec :
>  - how this can meet the OpenStack mission statement (ie. ubiquitous
> solution that would be easy to install and massively scalable)
>  - how this can be integrated with the existing (filters, weighers) to
> provide a clean and simple path for operators to upgrade
>  - how this can be supporting rolling upgrades (old computes sending
> updates to new scheduler)
>  - how can we test it
>  - can we have the feature optional for operators
> 

This is precisely how we make sure there is no innovation happening in
Nova ever.

Not all of the above have to be answered for the idea to have technical
merit and be useful to some users. We should be happy to have feature
branches like this available for people to try out and use and iterate
on before we slam developers with our "you need to be this tall to ride"
list.

N.

> 
>> 3. Possibilities to integrate with resource-provider bp series: I know
>> resource-provider is the major direction of Nova scheduler, and there
>> will be fundamental changes in the future, especially according to the
>> bp
>> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>> However, this prototype proposes a much faster and compatible way to
>> make schedule decisions based on scheduler caches. The in-memory
>> decisions are made at the same speed with the caching scheduler, but
>> the caches are kept consistent with compute nodes as quickly as
>> possible without db refreshing.
>>
>>  
>>
> 
> That's the key point, thanks for noticing our priorities. So, you know
> that our resource modeling is drastically subject to change in Mitaka
> and Newton. That is the new game, so I'd love to see how you plan to
> interact with that.
> Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
> your ideas because all of you are having great ideas to improve a
> current frustrating solution.
> 
> -Sylvain
> 
> 
>> Here is the detailed design of the mentioned prototype:
>>
>>  
>>
>> >>
>>
>> Background:
>>
>> The host state cache maintained by host manager is the scheduler
>> resource view during schedule decision making. It is updated whenever
>> a request is received[1], and all the compute node records are
>> retrieved from db every time. There are several problems in this
>> update model, proven in experiments[3]:
>>
>> 1. Performance: The scheduler performance is largely affected by db
>> access in retrieving compute node records. The db block time of a
>> single request is 355ms in average in the deployment of 3 compute
>> nodes, compared with only 3ms in in-memory decision-making. Imagine
>> there could be at most 1k nodes, even 10k nodes in the future.
>>
>> 2. Race conditions: This is not only a parallel-scheduler problem, but
>> also a problem using only one scheduler. The detailed analysis of
>> one-scheduler-problem is located in bug analysis[2]. In short, there
>> is a gap between the scheduler makes a decision in host state cache
>> and the
>>
>> compute node updates its in-db resource record according to that
>> decision in resource tracker. A recent scheduler resource consumption
>> in cache can be lost and overwritten by compute node data because of
>> it, result in cache inconsistency and unexpected retries. In a
>> one-scheduler experiment using 3-node deployment, there are 7 retries
>> out of 31 concurrent schedule requests recorded, results in 22.6%
>> extra performance overhead.
>>
>> 3. Parallel scheduler support: The design of filter scheduler leads to
>>

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin

> -Original Message-
> From: Clint Byrum [mailto:cl...@fewbar.com]
> Sent: Thursday, February 18, 2016 1:53 AM
> To: openstack-dev <openstack-dev@lists.openstack.org>
> Subject: Re: [openstack-dev] [nova] A prototype implementation towards the
> "shared state scheduler"
> 
> Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> > Hi,
> >
> > I've uploaded a prototype https://review.openstack.org/#/c/280047/ to
> > testify its design goals in accuracy, performance, reliability and
> > compatibility improvements. It will also be an Austin Summit Session
> > if elected:
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> > ation/7316
> >
> > I want to gather opinions about this idea:
> > 1. Is this feature possible to be accepted in the Newton release?
> > 2. Suggestions to improve its design and compatibility.
> > 3. Possibilities to integrate with resource-provider bp series: I know 
> > resource-
> provider is the major direction of Nova scheduler, and there will be 
> fundamental
> changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-
> providers-scheduler.rst. However, this prototype proposes a much faster and
> compatible way to make schedule decisions based on scheduler caches. The in-
> memory decisions are made at the same speed with the caching scheduler, but
> the caches are kept consistent with compute nodes as quickly as possible
> without db refreshing.
> >
> > Here is the detailed design of the mentioned prototype:
> >
> > >>
> > Background:
> > The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making. It is updated whenever a request is
> received[1], and all the compute node records are retrieved from db every 
> time.
> There are several problems in this update model, proven in experiments[3]:
> > 1. Performance: The scheduler performance is largely affected by db access 
> > in
> retrieving compute node records. The db block time of a single request is 
> 355ms
> in average in the deployment of 3 compute nodes, compared with only 3ms in
> in-memory decision-making. Imagine there could be at most 1k nodes, even 10k
> nodes in the future.
> > 2. Race conditions: This is not only a parallel-scheduler problem, but
> > also a problem using only one scheduler. The detailed analysis of one-
> scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between
> the scheduler makes a decision in host state cache and the compute node
> updates its in-db resource record according to that decision in resource 
> tracker.
> A recent scheduler resource consumption in cache can be lost and overwritten
> by compute node data because of it, result in cache inconsistency and
> unexpected retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests recorded, results 
> in
> 22.6% extra performance overhead.
> > 3. Parallel scheduler support: The design of filter scheduler leads to an 
> > "even
> worse" performance result using parallel schedulers. In the same experiment
> with 4 schedulers on separate machines, the average db block time is increased
> to 697ms per request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
> 
> 
> This mostly agrees with recent tests I've been doing simulating 1000 compute
> nodes with the fake virt driver. My retry rate is much lower, because there's 
> less
> window for race conditions since there is no latency for the time between 
> nova-
> compute getting the message that the VM is scheduled to it, and responding
> with a host update. Note that your database latency numbers seem much higher,
> we see about 200ms, and I wonder if you are running in a very resource
> constrained database instance.

Yes, I only have 4 cores and 16GB RAM in my desktop and I booted up 4 VMs for 
developing, testing and debugging this prototype.
Moreover, those schedulers are deployed on separate hosts so there are more 
latencies in my environment.

You have a great test environment for schedulers, you must have spent lots of 
efforts in managing 1000 compute nodes, collecting logs and making automatic 
analysis.

> 
> >
> > Improvements:
> > This prototype solved the mentioned issues above by implementing a new
> update model to scheduler host state cache. Instead of refreshing caches from
> db, every compute node maintains its accurate version of host state cache
> updated by the resource tracke

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Clint Byrum

Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> Hi,
> 
> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
> its design goals in accuracy, performance, reliability and compatibility 
> improvements. It will also be an Austin Summit Session if elected: 
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
> 
> I want to gather opinions about this idea:
> 1. Is this feature possible to be accepted in the Newton release?
> 2. Suggestions to improve its design and compatibility.
> 3. Possibilities to integrate with resource-provider bp series: I know 
> resource-provider is the major direction of Nova scheduler, and there will be 
> fundamental changes in the future, especially according to the bp 
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
>  However, this prototype proposes a much faster and compatible way to make 
> schedule decisions based on scheduler caches. The in-memory decisions are 
> made at the same speed with the caching scheduler, but the caches are kept 
> consistent with compute nodes as quickly as possible without db refreshing.
> 
> Here is the detailed design of the mentioned prototype:
> 
> >>
> Background:
> The host state cache maintained by host manager is the scheduler resource 
> view during schedule decision making. It is updated whenever a request is 
> received[1], and all the compute node records are retrieved from db every 
> time. There are several problems in this update model, proven in 
> experiments[3]:
> 1. Performance: The scheduler performance is largely affected by db access in 
> retrieving compute node records. The db block time of a single request is 
> 355ms in average in the deployment of 3 compute nodes, compared with only 3ms 
> in in-memory decision-making. Imagine there could be at most 1k nodes, even 
> 10k nodes in the future.
> 2. Race conditions: This is not only a parallel-scheduler problem, but also a 
> problem using only one scheduler. The detailed analysis of 
> one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between the scheduler makes a decision in host state cache and the
> compute node updates its in-db resource record according to that decision in 
> resource tracker. A recent scheduler resource consumption in cache can be 
> lost and overwritten by compute node data because of it, result in cache 
> inconsistency and unexpected retries. In a one-scheduler experiment using 
> 3-node deployment, there are 7 retries out of 31 concurrent schedule requests 
> recorded, results in 22.6% extra performance overhead.
> 3. Parallel scheduler support: The design of filter scheduler leads to an 
> "even worse" performance result using parallel schedulers. In the same 
> experiment with 4 schedulers on separate machines, the average db block time 
> is increased to 697ms per request and there are 16 retries out of 31 schedule 
> requests, namely 51.6% extra overhead.


This mostly agrees with recent tests I've been doing simulating 1000
compute nodes with the fake virt driver. My retry rate is much lower,
because there's less window for race conditions since there is no latency
for the time between nova-compute getting the message that the VM is
scheduled to it, and responding with a host update. Note that your
database latency numbers seem much higher, we see about 200ms, and I
wonder if you are running in a very resource constrained database
instance.

> 
> Improvements:
> This prototype solved the mentioned issues above by implementing a new update 
> model to scheduler host state cache. Instead of refreshing caches from db, 
> every compute node maintains its accurate version of host state cache updated 
> by the resource tracker, and sends incremental updates directly to 
> schedulers. So the scheduler cache are synchronized to the correct state as 
> soon as possible with the lowest overhead. Also, scheduler will send resource 
> claim with its decision to the target compute node. The compute node can 
> decide whether the resource claim is successful immediately by its local host 
> state cache and send responds back ASAP. With all the claims are tracked from 
> schedulers to compute nodes, no false overwrites will happen, and thus the 
> gaps between scheduler cache and real compute node states are minimized. The 
> benefits are obvious with recorded experiments[3] compared with caching 
> scheduler and filter scheduler:

You don't mention this, but I'm assuming this is true: At startup of a
new shared state scheduler, it fills its host state cache from the
database.

> 1. There is no db block time during scheduler decision making, the average 
> decision time per request is about 3ms in both single and multiple scheduler 
> scenarios, which is equal to the in-memory decision time of filter scheduler 
> and caching scheduler.
> 2. Since the scheduler

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin


On Wed, 17 February 2016, Sylvain Bauza wrote

(sorry, quoting off-context, but I feel it's a side point, not the main 
discussion)
Le 17/02/2016 16:40, Cheng, Yingxin a écrit :
IMHO, the authority to allocate resources is not limited by compute nodes, but 
also include network service, storage service or all other services which have 
the authority to manage their own resources. Those "shared" resources are 
coming from external services(i.e. system) which are not compute service. They 
all have responsibilities to push their own resource updates to schedulers, 
make resource reservation and consumption. The resource provider series 
provides a flexible representation of all kinds of resources, so that scheduler 
can handle them without having the specific knowledge of all the resources.

No, IMHO, the authority has to stay the entity which physically create the 
instance and own its lifecycle. What the user wants when booting is an 
instance, not something else. He can express some SLA by providing more context 
(implicit thru aggregates or flavors) or explicit (thru hints or AZs) that 
could be not compute-related (say a network segment locality or a 
volume-related thing) but at the end, it will create an instance on a compute 
node that matches the requirements.

Cinder and Neutron shouldn't manage which instances are on which hosts, they 
just have to provide the resource types and possible allocations (like a taken 
port)

-Sylvain

Yes, thought twice. The cinder project also has its own scheduler, so it is not 
the responsibility of nova-scheduler to schedule all pieces of resources. 
Nova-scheduler is responsible to boot instances, it has a limited scope to 
compute services.
-Yingxin
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Sylvain Bauza

(sorry, quoting off-context, but I feel it's a side point, not the main 
discussion)



Le 17/02/2016 16:40, Cheng, Yingxin a écrit :


IMHO, the authority to allocate resources is not limited by compute 
nodes, but also include network service, storage service or all other 
services which have the authority to manage their own resources. Those 
“shared” resources are coming from external services(i.e. system) 
which are not compute service. They all have responsibilities to push 
their own resource updates to schedulers, make resource reservation 
and consumption. The resource provider series provides a flexible 
representation of all kinds of resources, so that scheduler can handle 
them without having the specific knowledge of all the resources.




No, IMHO, the authority has to stay the entity which physically create 
the instance and own its lifecycle. What the user wants when booting is 
an instance, not something else. He can express some SLA by providing 
more context (implicit thru aggregates or flavors) or explicit (thru 
hints or AZs) that could be not compute-related (say a network segment 
locality or a volume-related thing) but at the end, it will create an 
instance on a compute node that matches the requirements.


Cinder and Neutron shouldn't manage which instances are on which hosts, 
they just have to provide the resource types and possible allocations 
(like a taken port)


-Sylvain

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin

On Wed, 17 February 2016, Sylvain Bauza wrote
Le 17/02/2016 12:59, Chris Dent a écrit :
On Wed, 17 Feb 2016, Cheng, Yingxin wrote:


To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.

That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.

So, to be clear, I'm really happy to see the resource-providers series for many 
reasons :
 - it will help us getting a nice Facade for getting the resources and 
attributing them
 - it will help a shared-storage deployment by making sure that we don't have 
some resource problems when the resource is shared
 - it will create a possibility for external resource providers to provide some 
resource types to Nova so the Nova scheduler could use them (like Neutron 
related resources)

That, I really want to have it implemented in Mitaka and Newton and I'm totally 
on-board and supporting it.

TBC, the only problem I see with the series is [2], not the whole, please.

@cdent:
As far as I know, some resources are defined "shared", simply because they are 
not the resources of compute node service. In other words, the compute node 
resource tracker does not have the authority of those "shared" resources. For 
example, the "shared" storage resources are actually managed by the storage 
service, and the "shared" network resource "IP pool" is actually owned by 
network service. If all the resources are labeled "shared" only because they 
are not owned by compute node services, the 
shared-resource-tracking/consumption problem can be solved by implementing 
resource trackers in all the authorized services. Those resource trackers are 
constantly providing incremental updates to schedulers, and have the 
responsibilities to reserve and consume resources independently/distributedly, 
no matter where they are from, compute service, storage service or network 
service etc.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.

Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

So, IMHO, we should only have the compute nodes being the authority for 
allocating resources. They are many reasons for that I provided in the spec 
review, but I can reply again :
·#1 If we consider that an external system, as a resource provider, 
will provide a single resource class usage (like network segment availability), 
it will still require the instance to be spawned *for* consuming that resource 
class, even if the scheduler accounts for it. That would mean that the 
scheduler would have to manage a list of allocations with TTL, and periodically 
verify that the allocation succeeded by asking the external system (or getting 
feedback from the external system). See, that's racy.
·#2 the scheduler is just a decision maker, by any case it doesn't 
account for the real instance creation (it doesn't hold the ownership of the 
instance). Having it being accountable for the instances usage is heavily 
difficult. Take for example a request for CPU pinning or NUMA affinity. The 
user can't really express which pin of the pCPU he will get, that's the compute 
node which will do that for him. Of course, the scheduler will help picking an 
host that can fit the request, but the real pinning

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Sylvain Bauza




Le 17/02/2016 12:59, Chris Dent a écrit :

On Wed, 17 Feb 2016, Cheng, Yingxin wrote:


To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.


That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.



So, to be clear, I'm really happy to see the resource-providers series 
for many reasons :
 - it will help us getting a nice Facade for getting the resources and 
attributing them
 - it will help a shared-storage deployment by making sure that we 
don't have some resource problems when the resource is shared
 - it will create a possibility for external resource providers to 
provide some resource types to Nova so the Nova scheduler could use them 
(like Neutron related resources)


That, I really want to have it implemented in Mitaka and Newton and I'm 
totally on-board and supporting it.


TBC, the only problem I see with the series is [2], not the whole, please.




As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.


Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.


So, IMHO, we should only have the compute nodes being the authority for 
allocating resources. They are many reasons for that I provided in the 
spec review, but I can reply again :


 * #1 If we consider that an external system, as a resource provider,
   will provide a single resource class usage (like network segment
   availability), it will still require the instance to be spawned
   *for* consuming that resource class, even if the scheduler accounts
   for it. That would mean that the scheduler would have to manage a
   list of allocations with TTL, and periodically verify that the
   allocation succeeded by asking the external system (or getting
   feedback from the external system). See, that's racy.
 * #2 the scheduler is just a decision maker, by any case it doesn't
   account for the real instance creation (it doesn't hold the
   ownership of the instance). Having it being accountable for the
   instances usage is heavily difficult. Take for example a request for
   CPU pinning or NUMA affinity. The user can't really express which
   pin of the pCPU he will get, that's the compute node which will do
   that for him. Of course, the scheduler will help picking an host
   that can fit the request, but the real pinning will happen in the
   compute node.


Also, I'm very interested in keeping an optimistic scheduler which 
wouldn't lock the entire view of the world anytime a request comes in. 
There are many papers showing different architectures and benchmarks 
against different possibilities and TBH, I'm very concerned by the 
scaling effect.
Also, we should keep in mind our new paradigm called Cells V2, which 
implies a global distributed scheduler for handling all requests. Having 
it following the same design tenets of OpenStack [3] by having a 
"eventually consistent shared-state" makes my guts saying that I'd love 
to see that.






The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).




My point is that while I truly

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Chris Dent


On Wed, 17 Feb 2016, Cheng, Yingxin wrote:


To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.


That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.


As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.


Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).

--
Chris Dent   (�s°□°)�s�喋擤ォ�http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-16 Thread Cheng, Yingxin

To better illustrate the differences between shared-state, resource-provider 
and legacy scheduler, I've drew 3 simplified pictures [1] in emphasizing the 
location of resource view, the location of claim and resource consumption, and 
the resource update/refresh pattern in three kinds of schedulers. Hoping I'm 
correct in the "resource-provider scheduler" part.


A point of view from my analysis in comparing three schedulers (before real 
experiment):
1. Performance: The performance bottlehead of resource-provider and legacy 
scheduler is from the centralized db and scheduler cache refreshing. It can be 
alleviated by changing to a stand-alone high performance database. And the 
cache refreshing is designed to be replaced by to direct SQL queries according 
to resource-provider scheduler spec [2]. The performance bottlehead of 
shared-state scheduler may come from the overwhelming update messages, it can 
also be alleviated by changing to stand-alone distributed message queue and by 
using the "MessagePipe" to merge messages.
2. Final decision accuracy: I think the accuracy of the final decision are high 
in all three schedulers, because until now the consistent resource view and the 
final resource consumption with claims are all in the same place. It's resource 
trackers in shared-state scheduler and legacy scheduler, and it's the 
resource-provider db in resource-provider scheduler.
3. Scheduler decision accuracy: IMO the order of accuracy of a single schedule 
decision is resource-provider > shared-state >> legacy scheduler. The 
resource-provider scheduler can get the accurate resource view directly from 
db. Shared-state scheduler is getting the most accurate resource view by 
constantly collecting updates from resource trackers and by tracking the 
scheduler claims from schedulers to RTs. Legacy scheduler's decision is the 
worst because it doesn't track its claims and get resource views from compute 
nodes records which are not that accurate.
4. Design goal difference:
The fundamental design goal of the two new schedulers is different. Copy my 
views from [2], I think it is the choice between "the loose distributed 
consistency with retries" and "the strict centralized consistency with locks".


As can be seen in the illustrations [1], the main compatibility issue between 
shared-state and resource-provider scheduler is caused by the different 
location of claim/consumption and the assumed consistent resource view. IMO 
unless the claims are allowed to happen in both places(resource tracker and 
resource-provider db), it seems difficult to make shared-state and 
resource-provider scheduler work together.


[1] 
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
[2] https://review.openstack.org/#/c/271823/


Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 9:48 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)


2. Thanks for providing concerns I've not thought it yet, they will be in the 
spec soon.

3. Let me copy my thoughts from another thread about the integration with 
resource-provider:
The idea is about "Only compute node knows its own final compute-node resource 
view" or "The accurate resource view only exists at the place where it is 
actually consumed." I.e., The incremental updates can only come from the actual 
"consumption" action, no matter where it is(e.g. compute node, storage service, 
network service, etc.). Borrow the terms from resource-provider, compute nodes 
can maintain its accurate version of "compute-node-inventory" cache, and can 
send incremental updates because it actually consumes compute resources, 
furthermore, storage service can also maintain an accurate version of 
"storage-inventory" cache and send incremental updates if it also consumes 
storage resources. If there are central services in charge of consuming all the 
resources, the accurate cache and updates must come from them.


That is one of the things I'd like to see in your spec, and how you could 
interact with the new model.
Thanks,
-Sylvain




Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org><mailto:openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-15 Thread Ed Leafe

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 02/15/2016 03:27 AM, Sylvain Bauza wrote:

> - can we have the feature optional for operators

One thing that concerns me is the lesson learned from simply having a
compute node's instance information sent and persisted in memory. That
was resisted by several large operators, due to overhead. This
proposal will have to store that and more in memory.

- -- 

- -- Ed Leafe
-BEGIN PGP SIGNATURE-
Version: GnuPG v2
Comment: GPGTools - https://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJWwdpaAAoJEKMgtcocwZqL7AsP/2/NqRtyepV/vmf7AMR/aI4P
sKKdD5tT3dydvNZnsvSRWTkzTfvSPQqXSEOPfbMsdAcNGFerRMQ+gz4CTq9Jioa4
4Q7H3k4BgMucCOq8ZScUHLb4Ymw2A2ksXYqVIk4BALH3H0i/V+M6XMXQC7mFLSS2
Y8wqdPfb5qIvR4Zf90XPi1wkQiL0rx3WLpN6wHYovaJS7rxMfOT8/ZoO5s/zs5FM
n9n+qcB0aMY4RT4+8J49homw1+hatPmo0lp4Hcyp7cCg1cvUidIXDDqw6ycMKRio
mKMthiNT01kCG1mSRd9U3aszXnovGqGspl7K1R1SBt+4kiHIXQ4khiTDSnNuxnk+
3GnM3gL72ZjVcppDReII9KpJ9ZPxhD4tKNsRfcZQrEjQCwGz5tY275FItyVypw1x
tVLIsPzPbqFV/AjubPVChmypWsnG3bZT4eoCK0k8stRd5SArfR61Z20ai4r1fqWO
1HU6kT1iYCwhMUk58Om35e/G1a/4t/ZiSvdzS1NermD6SGh0oMmRYgQs7nSRMZ5Y
pOIDLToVN/AnSxpgNgqsOpgLXRn3vCskrqxJqqcLiwEQnkh10gEnEvZ6gPAXOJLl
axctk4jEbKLdNoEwEXuiAFJjGlNG/8J5J3T9CHbFRuUVDKZfcgKZc4l49pTYa8iA
OiKK4a6Fw2UWbAdBTuCD
=sxd3
-END PGP SIGNATURE-

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-15 Thread Sylvain Bauza

Le 15/02/2016 10:48, Cheng, Yingxin a écrit :

Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)

2. Thanks for providing concerns I’ve not thought it yet, they will be
in the spec soon.

3. Let me copy my thoughts from another thread about the integration
with resource-provider:

The idea is about “Only compute node knows its own final compute-node
resource view” or “The accurate resource view only exists at the place
where it is actually consumed.” I.e., The incremental updates can only
come from the actual “consumption” action, no matter where it is(e.g.
compute node, storage service, network service, etc.). Borrow the
terms from resource-provider, compute nodes can maintain its accurate
version of “compute-node-inventory” cache, and can send incremental
updates because it actually consumes compute resources, furthermore,
storage service can also maintain an accurate version of
“storage-inventory” cache and send incremental updates if it also
consumes storage resources. If there are central services in charge of
consuming all the resources, the accurate cache and updates must come
from them.

That is one of the things I'd like to see in your spec, and how you
could interact with the new model.

Thanks,
-Sylvain

Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:sba...@redhat.com]
*Sent:* Monday, February 15, 2016 5:28 PM
*To:* OpenStack Development Mailing List (not for usage questions)
<openstack-dev@lists.openstack.org>
*Subject:* Re: [openstack-dev] [nova] A prototype implementation
towards the "shared state scheduler"

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
<https://review.openstack.org/#/c/280047/> to testify its design
goals in accuracy, performance, reliability and compatibility
improvements. It will also be an Austin Summit Session if elected:

https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so
it would be the best way to discuss on the design.

2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the
spec for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous
solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to
provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending
updates to new scheduler)

- how can we test it
- can we have the feature optional for operators

https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible way
to make schedule decisions based on scheduler caches. The
in-memory decisions are made at the same speed with the caching
scheduler, but the caches are kept consistent with compute nodes
as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you know
that our resource modeling is drastically subject to change in Mitaka
and Newton. That is the new game, so I'd love to see how you plan to
interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share
your ideas because all of you are having great ideas to improve a
current frustrating solution.

-Sylvain

Here is the detailed design of the mentioned prototype:

Background:

The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated
whenever a request is received[1], and all the compute node
records are retrieved from db every time. There are several
problems in this update model, proven in experiments[3]:

1. Performance: The scheduler performance is largely affected by
db access in retrieving compute node records. The db block time of
a single request is 355ms in average in the deployment of 3
compute nodes, compared with only 3ms in in-memory
decision-making. Imagine there could be at most 1k nodes, even 10k
nodes in the future.

2. Race conditions:

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-15 Thread Cheng, Yingxin

Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.

2. Thanks for providing concerns I've not thought it yet, they will be in the 
spec soon.

3. Let me copy my thoughts from another thread about the integration with 
resource-provider:
The idea is about "Only compute node knows its own final compute-node resource 
view" or "The accurate resource view only exists at the place where it is 
actually consumed." I.e., The incremental updates can only come from the actual 
"consumption" action, no matter where it is(e.g. compute node, storage service, 
network service, etc.). Borrow the terms from resource-provider, compute nodes 
can maintain its accurate version of "compute-node-inventory" cache, and can 
send incremental updates because it actually consumes compute resources, 
furthermore, storage service can also maintain an accurate version of 
"storage-inventory" cache and send incremental updates if it also consumes 
storage resources. If there are central services in charge of consuming all the 
resources, the accurate cache and updates must come from them.


Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written 
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so it would 
be the best way to discuss on the design.



2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec for 
that), but my biggest concerns would be when reviewing the spec :
 - how this can meet the OpenStack mission statement (ie. ubiquitous solution 
that would be easy to install and massively scalable)
 - how this can be integrated with the existing (filters, weighers) to provide 
a clean and simple path for operators to upgrade
 - how this can be supporting rolling upgrades (old computes sending updates to 
new scheduler)
 - how can we test it
 - can we have the feature optional for operators



3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.


That's the key point, thanks for noticing our priorities. So, you know that our 
resource modeling is drastically subject to change in Mitaka and Newton. That 
is the new game, so I'd love to see how you plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share your ideas 
because all of you are having great ideas to improve a current frustrating 
solution.

-Sylvain



Here is the detailed design of the mentioned prototype:

>>
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in 
resource tracker. A rec

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-15 Thread Sylvain Bauza

Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
to testify its design goals
in accuracy, performance, reliability and compatibility improvements.
It will also be an Austin Summit Session if elected:
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:

1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so
it would be the best way to discuss on the design.

2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec
for that), but my biggest concerns would be when reviewing the spec :
- how this can meet the OpenStack mission statement (ie. ubiquitous
solution that would be easy to install and massively scalable)
- how this can be integrated with the existing (filters, weighers) to
provide a clean and simple path for operators to upgrade
- how this can be supporting rolling upgrades (old computes sending
updates to new scheduler)

- how can we test it
- can we have the feature optional for operators

3. Possibilities to integrate with resource-provider bp series: I know
resource-provider is the major direction of Nova scheduler, and there
will be fundamental changes in the future, especially according to the
bp
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
However, this prototype proposes a much faster and compatible way to
make schedule decisions based on scheduler caches. The in-memory
decisions are made at the same speed with the caching scheduler, but
the caches are kept consistent with compute nodes as quickly as
possible without db refreshing.

-Sylvain

Here is the detailed design of the mentioned prototype:

Background:

The host state cache maintained by host manager is the scheduler
resource view during schedule decision making. It is updated whenever
a request is received[1], and all the compute node records are
retrieved from db every time. There are several problems in this
update model, proven in experiments[3]:

1. Performance: The scheduler performance is largely affected by db
access in retrieving compute node records. The db block time of a
single request is 355ms in average in the deployment of 3 compute
nodes, compared with only 3ms in in-memory decision-making. Imagine
there could be at most 1k nodes, even 10k nodes in the future.

2. Race conditions: This is not only a parallel-scheduler problem, but
also a problem using only one scheduler. The detailed analysis of
one-scheduler-problem is located in bug analysis[2]. In short, there
is a gap between the scheduler makes a decision in host state cache
and the

compute node updates its in-db resource record according to that
decision in resource tracker. A recent scheduler resource consumption
in cache can be lost and overwritten by compute node data because of
it, result in cache inconsistency and unexpected retries. In a
one-scheduler experiment using 3-node deployment, there are 7 retries
out of 31 concurrent schedule requests recorded, results in 22.6%
extra performance overhead.

3. Parallel scheduler support: The design of filter scheduler leads to
an "even worse" performance result using parallel schedulers. In the
same experiment with 4 schedulers on separate machines, the average db
block time is increased to 697ms per request and there are 16 retries
out of 31 schedule requests, namely 51.6% extra overhead.

Improvements:

This prototype solved the mentioned issues above by implementing a new
update model to scheduler host state cache. Instead of refreshing
caches from db, every compute node maintains its accurate version of
host state cache updated by the resource tracker, and sends
incremental updates directly to schedulers. So the scheduler cache are
synchronized to the correct state as soon as possible with the lowest
overhead. Also, scheduler will send resource claim with its decision
to the target compute node. The compute node can decide whether the
resource claim is successful immediately by its local host state cache
and send responds back ASAP. With all the claims are tracked from
schedulers to

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-14 Thread Boris Pavlovic

Yingxin,


Basically, what we implemented was next:

- Scheduler consumes RPC updates from Computes
- Scheduler keeps world state in memory (and each message from compute is
treat like a incremental update)
- Incremental update is shared across multiple instances of schedulers
  (so one message from computes is only consumed once)
- Schema less host state (to be able to use single scheduler service for
all resources)

^ All this was done in backward compatible way and it was really easy to
migrate.


If this was accepted, we were planing to work on making scheduler non
depend from Nova (which is actually quite simple task after those change)
 and moving that code outside of Nova.

So solutions are quite similar overall.
I hope you'll get more luck with getting them in upstream.


Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 11:08 PM, Cheng, Yingxin <yingxin.ch...@intel.com>
wrote:

> Thanks Boris, the idea is quite similar in “Do not have db accesses during
> scheduler decision making” because db blocks are introduced at the same
> time, this is very bad for the lock-free design of nova scheduler.
>
>
>
> Another important idea is that “Only compute node knows its own final
> compute-node resource view” or “The accurate resource view only exists at
> the place where it is actually consumed.” I.e., The incremental updates can
> only come from the actual “consumption” action, no matter where it is(e.g.
> compute node, storage service, network service, etc.). Borrow the terms
> from resource-provider, compute nodes can maintain its accurate version of
> “compute-node-inventory” cache, and can send incremental updates because it
> actually consumes compute resources, furthermore, storage service can also
> maintain an accurate version of “storage-inventory” cache and send
> incremental updates if it also consumes storage resources. If there are
> central services in charge of consuming all the resources, the accurate
> cache and updates must come from them.
>
>
>
> The third idea is “compatibility”. This prototype focuses on a very small
> scope by only introducing a new host_manager driver “shared_host_manager”
> with minor other changes. The driver can be changed back to “host_manager”
> very easily. It can also run with filter schedulers and caching schedulers.
> Most importantly, the filtering and weighing algorithms are kept unchanged.
> So more changes can be introduced for the complete version of “shared state
> scheduler” because it is evolving in a gradual way.
>
>
>
>
>
> Regards,
>
> -Yingxin
>
>
>
> *From:* Boris Pavlovic [mailto:bo...@pavlovic.me]
> *Sent:* Monday, February 15, 2016 1:59 PM
> *To:* OpenStack Development Mailing List (not for usage questions) <
> openstack-dev@lists.openstack.org>
> *Subject:* Re: [openstack-dev] [nova] A prototype implementation towards
> the "shared state scheduler"
>
>
>
> Yingxin,
>
>
>
> This looks quite similar to the work of this bp:
>
> https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
>
>
>
> It's really nice that somebody is still trying to push scheduler
> refactoring in this way.
>
> Thanks.
>
>
>
> Best regards,
>
> Boris Pavlovic
>
>
>
> On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin <yingxin.ch...@intel.com>
> wrote:
>
> Hi,
>
>
>
> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
> testify its design goals in accuracy, performance, reliability and
> compatibility improvements. It will also be an Austin Summit Session if
> elected:
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>
>
>
> I want to gather opinions about this idea:
>
> 1. Is this feature possible to be accepted in the Newton release?
>
> 2. Suggestions to improve its design and compatibility.
>
> 3. Possibilities to integrate with resource-provider bp series: I know
> resource-provider is the major direction of Nova scheduler, and there will
> be fundamental changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
> However, this prototype proposes a much faster and compatible way to make
> schedule decisions based on scheduler caches. The in-memory decisions are
> made at the same speed with the caching scheduler, but the caches are kept
> consistent with compute nodes as quickly as possible without db refreshing.
>
>
>
> Here is the detailed design of the mentioned prototype:
>
>
>
> >>
>
> Background:
>
> The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making.

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-14 Thread Cheng, Yingxin

Thanks Boris, the idea is quite similar in “Do not have db accesses during 
scheduler decision making” because db blocks are introduced at the same time, 
this is very bad for the lock-free design of nova scheduler.

Another important idea is that “Only compute node knows its own final 
compute-node resource view” or “The accurate resource view only exists at the 
place where it is actually consumed.” I.e., The incremental updates can only 
come from the actual “consumption” action, no matter where it is(e.g. compute 
node, storage service, network service, etc.). Borrow the terms from 
resource-provider, compute nodes can maintain its accurate version of 
“compute-node-inventory” cache, and can send incremental updates because it 
actually consumes compute resources, furthermore, storage service can also 
maintain an accurate version of “storage-inventory” cache and send incremental 
updates if it also consumes storage resources. If there are central services in 
charge of consuming all the resources, the accurate cache and updates must come 
from them.

The third idea is “compatibility”. This prototype focuses on a very small scope 
by only introducing a new host_manager driver “shared_host_manager” with minor 
other changes. The driver can be changed back to “host_manager” very easily. It 
can also run with filter schedulers and caching schedulers. Most importantly, 
the filtering and weighing algorithms are kept unchanged. So more changes can 
be introduced for the complete version of “shared state scheduler” because it 
is evolving in a gradual way.


Regards,
-Yingxin

From: Boris Pavlovic [mailto:bo...@pavlovic.me]
Sent: Monday, February 15, 2016 1:59 PM
To: OpenStack Development Mailing List (not for usage questions) 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler refactoring in 
this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin 
<yingxin.ch...@intel.com<mailto:yingxin.ch...@intel.com>> wrote:
Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:

>>
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the compute node 
updates its in-db resource record according to that decision in resource 
tracker. A recent scheduler resource consumption in cache can be lost and 
overwritten by compute node data because of it, result in cache inconsistency 
and unexpected retries. In a one-scheduler experiment using 3-node deployment, 
there are 7 retries out of 31 concurrent schedule requests recorded, results in 
22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even 
worse" performance result using parallel scheduler

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-14 Thread Boris Pavlovic

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler
refactoring in this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin 
wrote:

> Hi,
>
>
>
> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
> testify its design goals in accuracy, performance, reliability and
> compatibility improvements. It will also be an Austin Summit Session if
> elected:
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>
>
>
> I want to gather opinions about this idea:
>
> 1. Is this feature possible to be accepted in the Newton release?
>
> 2. Suggestions to improve its design and compatibility.
>
> 3. Possibilities to integrate with resource-provider bp series: I know
> resource-provider is the major direction of Nova scheduler, and there will
> be fundamental changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
> However, this prototype proposes a much faster and compatible way to make
> schedule decisions based on scheduler caches. The in-memory decisions are
> made at the same speed with the caching scheduler, but the caches are kept
> consistent with compute nodes as quickly as possible without db refreshing.
>
>
>
> Here is the detailed design of the mentioned prototype:
>
>
>
> >>
>
> Background:
>
> The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making. It is updated whenever a request is
> received[1], and all the compute node records are retrieved from db every
> time. There are several problems in this update model, proven in
> experiments[3]:
>
> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request
> is 355ms in average in the deployment of 3 compute nodes, compared with
> only 3ms in in-memory decision-making. Imagine there could be at most 1k
> nodes, even 10k nodes in the future.
>
> 2. Race conditions: This is not only a parallel-scheduler problem, but
> also a problem using only one scheduler. The detailed analysis of
> one-scheduler-problem is located in bug analysis[2]. In short, there is a
> gap between the scheduler makes a decision in host state cache and the
>
> compute node updates its in-db resource record according to that decision
> in resource tracker. A recent scheduler resource consumption in cache can
> be lost and overwritten by compute node data because of it, result in cache
> inconsistency and unexpected retries. In a one-scheduler experiment using
> 3-node deployment, there are 7 retries out of 31 concurrent schedule
> requests recorded, results in 22.6% extra performance overhead.
>
> 3. Parallel scheduler support: The design of filter scheduler leads to an
> "even worse" performance result using parallel schedulers. In the same
> experiment with 4 schedulers on separate machines, the average db block
> time is increased to 697ms per request and there are 16 retries out of 31
> schedule requests, namely 51.6% extra overhead.
>
>
>
> Improvements:
>
> This prototype solved the mentioned issues above by implementing a new
> update model to scheduler host state cache. Instead of refreshing caches
> from db, every compute node maintains its accurate version of host state
> cache updated by the resource tracker, and sends incremental updates
> directly to schedulers. So the scheduler cache are synchronized to the
> correct state as soon as possible with the lowest overhead. Also, scheduler
> will send resource claim with its decision to the target compute node. The
> compute node can decide whether the resource claim is successful
> immediately by its local host state cache and send responds back ASAP. With
> all the claims are tracked from schedulers to compute nodes, no false
> overwrites will happen, and thus the gaps between scheduler cache and real
> compute node states are minimized. The benefits are obvious with recorded
> experiments[3] compared with caching scheduler and filter scheduler:
>
> 1. There is no db block time during scheduler decision making, the average
> decision time per request is about 3ms in both single and multiple
> scheduler scenarios, which is equal to the in-memory decision time of
> filter scheduler and caching scheduler.
>
> 2. Since the scheduler claims are tracked and the "false overwrite" is
> eliminated, there should be 0 retries in one-scheduler deployment, as
> proven in the experiment. Thanks to the quick claim responding
> implementation, there are only 2 retries out of 31 requests in the
> 4-scheduler experiment.
>
> 3. All the filtering and weighing algorithms are compatible because the
> data structure of HostState is

41 matches

Mail list logo