Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-10 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: openstack-dev@lists.openstack.org
 Sent: Wednesday, March 4, 2015 9:22:43 PM
 Subject: Re: [openstack-dev] [nova] blueprint about multiple workers 
 supported in nova-scheduler
 
 On 03/04/2015 01:51 AM, Attila Fazekas wrote:
  Hi,
 
  I wonder what is the planned future of the scheduling.
 
  The scheduler does a lot of high field number query,
  which is CPU expensive when you are using sqlalchemy-orm.
  Does anyone tried to switch those operations to sqlalchemy-core ?
 
 Actually, the scheduler does virtually no SQLAlchemy ORM queries. Almost
 all database access is serialized from the nova-scheduler through the
 nova-conductor service via the nova.objects remoting framework.
 

It does not helps you.

  The scheduler does lot of thing in the application, like filtering
  what can be done on the DB level more efficiently. Why it is not done
  on the DB side ?
 
 That's a pretty big generalization. Many filters (check out NUMA
 configuration, host aggregate extra_specs matching, any of the JSON
 filters, etc) don't lend themselves to SQL column-based sorting and
 filtering.
 

What a basic SQL query can do,
and what is the limit of the SQL is two different thing.
Even if you do not move everything to the DB side,
the dataset the application need to deal with could be limited.

  There are use cases when the scheduler would need to know even more data,
  Is there a plan for keeping `everything` in all schedulers process memory
  up-to-date ?
  (Maybe zookeeper)
 
 Zookeeper has nothing to do with scheduling decisions -- only whether or
 not a compute node's service descriptor is active or not. The end goal
 (after splitting the Nova scheduler out into Gantt hopefully at the
 start of the L release cycle) is to have the Gantt database be more
 optimized to contain the resource usage amounts of all resources
 consumed in the entire cloud, and to use partitioning/sharding to scale
 the scheduler subsystem, instead of having each scheduler process handle
 requests for all resources in the cloud (or cell...)
 
What is the current optional usage of zookeeper, 
and what it could be used for is very different thing.
The resource tracking is possible. 

  The opposite way would be to move most operation into the DB side,
  since the DB already knows everything.
  (stored procedures ?)
 
 See above. This assumes that the data the scheduler is iterating over is
 well-structured and consistent, and that is a false assumption.

With stored procedures you can do almost anything,
and in many ceases it is more readable than an complex query.

 
 Best,
 -jay
 
  Best Regards,
  Attila
 
 
  - Original Message -
  From: Rui Chen chenrui.m...@gmail.com
  To: OpenStack Development Mailing List (not for usage questions)
  openstack-dev@lists.openstack.org
  Sent: Wednesday, March 4, 2015 4:51:07 AM
  Subject: [openstack-dev] [nova] blueprint about multiple workers supported
 in nova-scheduler
 
  Hi all,
 
  I want to make it easy to launch a bunch of scheduler processes on a host,
  multiple scheduler workers will make use of multiple processors of host
  and
  enhance the performance of nova-scheduler.
 
  I had registered a blueprint and commit a patch to implement it.
  https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
  This patch had applied in our performance environment and pass some test
  cases, like: concurrent booting multiple instances, currently we didn't
  find
  inconsistent issue.
 
  IMO, nova-scheduler should been scaled horizontally on easily way, the
  multiple workers should been supported as an out of box feature.
 
  Please feel free to discuss this feature, thanks.
 
  Best Regards
 
 
  __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
  __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-10 Thread Nikola Đipanov
On 03/06/2015 03:19 PM, Attila Fazekas wrote:
 Looks like we need some kind of _per compute node_ mutex in the critical 
 section,
 multiple scheduler MAY be able to schedule to two compute node at same time,
 but not for scheduling to the same compute node.
 
 If we don't want to introduce another required component or
 reinvent the wheel there are some possible trick with the existing globally 
 visible
 components like with the RDMS.
 
 `Randomized` destination choose is recommended in most of the possible 
 solutions,
 alternatives are much more complex.
 
 One SQL example:
 
 * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
 
 When the scheduler picks one (or multiple) node, he needs to verify is the 
 node(s) are 
 still good before sending the message to the n-cpu.
 
 It can be done by re-reading the ONLY the picked hypervisor(s) related data.
 with `LOCK IN SHARE MODE`.
 If the destination hyper-visors still OK:
 
 Increase the sched_cnt value exactly by 1,
 test is the UPDATE really update the required number of rows,
 the WHERE part needs to contain the previous value.
 
 You also need to update the resource usage on the hypervisor,
  by the expected cost of the new vms.
 
 If at least one selected node was ok, the transaction can be COMMITed.
 If you were able to COMMIT the transaction, the relevant messages 
  can be sent.
 
 The whole process needs to be repeated with the items which did not passed the
 post verification.
 
 If a message sending failed, `act like` migrating the vm to another host.
 
 If multiple scheduler tries to pick multiple different host in different 
 order,
 it can lead to a DEADLOCK situation.
 Solution: Try to have all scheduler to acquire to Shared RW locks in the same 
 order,
 at the end.
 
 Galera multi-writer (Active-Active) implication:
 As always, retry on deadlock. 
 
 n-sch + n-cpu crash at the same time:
 * If the scheduling is not finished properly, it might be fixed manually,
 or we need to solve which still alive scheduler instance is 
 responsible for fixing the particular scheduling..
 

So if I am reading the above correctly - you are basically proposing to
move claims to the scheduler (we would atomically check if there were
changes since the time we picked the host with the UPDATE .. WHERE using
LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
level) and then updating the usage, a.k.a doing the claim in the same
transaction.

The issue here is that we still have a window between sending the
message, and the message getting picked up by the compute host (or
timing out) or the instance outright failing, so for sure we will need
to ack/nack the claim in some way on the compute side.

I believe something like this has come up before under the umbrella term
of moving claims to the scheduler, and was discussed in some detail on
the latest Nova mid-cycle meetup, but only artifacts I could find were a
few lines on this etherpad Sylvain pointed me to [1] that I am copying here:


* White board the scheduler service interface
 ** note: this design won't change the existing way/logic of reconciling
nova db != hypervisor view
 ** gantt should just return claim ids, not entire claim objects
 ** claims are acked as being in use via the resource tracker updates
from nova-compute
 ** we still need scheduler retries for exceptional situations (admins
doing things outside openstack, hardware changes / failures)
 ** retry logic in conductor? probably a separate item/spec


As you can see - not much to go on (but that is material for a separate
thread that I may start soon).

The problem I have with this particular approach is that while it claims
to fix some of the races (and probably does) it does so by 1) turning
the current scheduling mechanism on it's head 2) and not providing any
thought into the trade-offs that it will make. For example, we may get
more correct scheduling in the general case and the correctness will not
be affected by the number of workers, but how does the fact that we now
do locking DB access on every request fare against the retry mechanism
for some of the more common usage patterns. What is the increased
overhead of calling back to he scheduler to confirm the claim? In the
end - how do we even measure that we are going in the right direction
with the new design.

I personally think that different workloads will have different needs
from the scheduler in terms of response times and tolerance to failure,
and that we need to design for that. So as an example a cloud operator
with very simple scheduling requirements may want to go for the no
locking approach and optimize for response times allowing for a small
number of instances to fail under high load/utilization due to retries,
while some others with more complicated scheduling requirements, or less
tolerance for data inconsistency might want to trade in response times
by doing locking claims in the scheduler. Some similar trade-offs and
how to 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-10 Thread Attila Fazekas




- Original Message -
 From: Nikola Đipanov ndipa...@redhat.com
 To: openstack-dev@lists.openstack.org
 Sent: Tuesday, March 10, 2015 10:53:01 AM
 Subject: Re: [openstack-dev] [nova] blueprint about multiple workers 
 supported in nova-scheduler
 
 On 03/06/2015 03:19 PM, Attila Fazekas wrote:
  Looks like we need some kind of _per compute node_ mutex in the critical
  section,
  multiple scheduler MAY be able to schedule to two compute node at same
  time,
  but not for scheduling to the same compute node.
  
  If we don't want to introduce another required component or
  reinvent the wheel there are some possible trick with the existing globally
  visible
  components like with the RDMS.
  
  `Randomized` destination choose is recommended in most of the possible
  solutions,
  alternatives are much more complex.
  
  One SQL example:
  
  * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
  
  When the scheduler picks one (or multiple) node, he needs to verify is the
  node(s) are
  still good before sending the message to the n-cpu.
  
  It can be done by re-reading the ONLY the picked hypervisor(s) related
  data.
  with `LOCK IN SHARE MODE`.
  If the destination hyper-visors still OK:
  
  Increase the sched_cnt value exactly by 1,
  test is the UPDATE really update the required number of rows,
  the WHERE part needs to contain the previous value.
  
  You also need to update the resource usage on the hypervisor,
   by the expected cost of the new vms.
  
  If at least one selected node was ok, the transaction can be COMMITed.
  If you were able to COMMIT the transaction, the relevant messages
   can be sent.
  
  The whole process needs to be repeated with the items which did not passed
  the
  post verification.
  
  If a message sending failed, `act like` migrating the vm to another host.
  
  If multiple scheduler tries to pick multiple different host in different
  order,
  it can lead to a DEADLOCK situation.
  Solution: Try to have all scheduler to acquire to Shared RW locks in the
  same order,
  at the end.
  
  Galera multi-writer (Active-Active) implication:
  As always, retry on deadlock.
  
  n-sch + n-cpu crash at the same time:
  * If the scheduling is not finished properly, it might be fixed manually,
  or we need to solve which still alive scheduler instance is
  responsible for fixing the particular scheduling..
  
 
 So if I am reading the above correctly - you are basically proposing to
 move claims to the scheduler (we would atomically check if there were
 changes since the time we picked the host with the UPDATE .. WHERE using
 LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
 level) and then updating the usage, a.k.a doing the claim in the same
 transaction.
 
 The issue here is that we still have a window between sending the
 message, and the message getting picked up by the compute host (or
 timing out) or the instance outright failing, so for sure we will need
 to ack/nack the claim in some way on the compute side.
 
 I believe something like this has come up before under the umbrella term
 of moving claims to the scheduler, and was discussed in some detail on
 the latest Nova mid-cycle meetup, but only artifacts I could find were a
 few lines on this etherpad Sylvain pointed me to [1] that I am copying here:
 

 
 * White board the scheduler service interface
  ** note: this design won't change the existing way/logic of reconciling
 nova db != hypervisor view
  ** gantt should just return claim ids, not entire claim objects
  ** claims are acked as being in use via the resource tracker updates
 from nova-compute
  ** we still need scheduler retries for exceptional situations (admins
 doing things outside openstack, hardware changes / failures)
  ** retry logic in conductor? probably a separate item/spec
 
 
 As you can see - not much to go on (but that is material for a separate
 thread that I may start soon).

In my example, the resource needs to be considered as used before we get
anything back from the compute.
The resource can be `freed` at error handling,
hopefully be migrating to another node.
 
 The problem I have with this particular approach is that while it claims
 to fix some of the races (and probably does) it does so by 1) turning
 the current scheduling mechanism on it's head 2) and not providing any
 thought into the trade-offs that it will make. For example, we may get
 more correct scheduling in the general case and the correctness will not
 be affected by the number of workers, but how does the fact that we now
 do locking DB access on every request fare against the retry mechanism
 for some of the more common usage patterns. What is the increased
 overhead of calling back to he scheduler to confirm the claim? In the
 end - how do we even measure that we are going in the right direction
 with the new design.
 
 I personally think that different workloads will have different needs
 from the scheduler

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-10 Thread Attila Fazekas




- Original Message -
 From: Attila Fazekas afaze...@redhat.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Tuesday, March 10, 2015 12:48:00 PM
 Subject: Re: [openstack-dev] [nova] blueprint about multiple workers 
 supported in nova-scheduler
 
 
 
 
 
 - Original Message -
  From: Nikola Đipanov ndipa...@redhat.com
  To: openstack-dev@lists.openstack.org
  Sent: Tuesday, March 10, 2015 10:53:01 AM
  Subject: Re: [openstack-dev] [nova] blueprint about multiple workers
  supported in nova-scheduler
  
  On 03/06/2015 03:19 PM, Attila Fazekas wrote:
   Looks like we need some kind of _per compute node_ mutex in the critical
   section,
   multiple scheduler MAY be able to schedule to two compute node at same
   time,
   but not for scheduling to the same compute node.
   
   If we don't want to introduce another required component or
   reinvent the wheel there are some possible trick with the existing
   globally
   visible
   components like with the RDMS.
   
   `Randomized` destination choose is recommended in most of the possible
   solutions,
   alternatives are much more complex.
   
   One SQL example:
   
   * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related
   table.
   
   When the scheduler picks one (or multiple) node, he needs to verify is
   the
   node(s) are
   still good before sending the message to the n-cpu.
   
   It can be done by re-reading the ONLY the picked hypervisor(s) related
   data.
   with `LOCK IN SHARE MODE`.
   If the destination hyper-visors still OK:
   
   Increase the sched_cnt value exactly by 1,
   test is the UPDATE really update the required number of rows,
   the WHERE part needs to contain the previous value.
   
   You also need to update the resource usage on the hypervisor,
by the expected cost of the new vms.
   
   If at least one selected node was ok, the transaction can be COMMITed.
   If you were able to COMMIT the transaction, the relevant messages
can be sent.
   
   The whole process needs to be repeated with the items which did not
   passed
   the
   post verification.
   
   If a message sending failed, `act like` migrating the vm to another host.
   
   If multiple scheduler tries to pick multiple different host in different
   order,
   it can lead to a DEADLOCK situation.
   Solution: Try to have all scheduler to acquire to Shared RW locks in the
   same order,
   at the end.
   
   Galera multi-writer (Active-Active) implication:
   As always, retry on deadlock.
   
   n-sch + n-cpu crash at the same time:
   * If the scheduling is not finished properly, it might be fixed manually,
   or we need to solve which still alive scheduler instance is
   responsible for fixing the particular scheduling..
   
  
  So if I am reading the above correctly - you are basically proposing to
  move claims to the scheduler (we would atomically check if there were
  changes since the time we picked the host with the UPDATE .. WHERE using
  LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
  level) and then updating the usage, a.k.a doing the claim in the same
  transaction.
  
  The issue here is that we still have a window between sending the
  message, and the message getting picked up by the compute host (or
  timing out) or the instance outright failing, so for sure we will need
  to ack/nack the claim in some way on the compute side.
  
  I believe something like this has come up before under the umbrella term
  of moving claims to the scheduler, and was discussed in some detail on
  the latest Nova mid-cycle meetup, but only artifacts I could find were a
  few lines on this etherpad Sylvain pointed me to [1] that I am copying
  here:
  
 
  
  * White board the scheduler service interface
   ** note: this design won't change the existing way/logic of reconciling
  nova db != hypervisor view
   ** gantt should just return claim ids, not entire claim objects
   ** claims are acked as being in use via the resource tracker updates
  from nova-compute
   ** we still need scheduler retries for exceptional situations (admins
  doing things outside openstack, hardware changes / failures)
   ** retry logic in conductor? probably a separate item/spec
  
  
  As you can see - not much to go on (but that is material for a separate
  thread that I may start soon).
 
 In my example, the resource needs to be considered as used before we get
 anything back from the compute.
 The resource can be `freed` at error handling,
 hopefully be migrating to another node.
  
  The problem I have with this particular approach is that while it claims
  to fix some of the races (and probably does) it does so by 1) turning
  the current scheduling mechanism on it's head 2) and not providing any
  thought into the trade-offs that it will make. For example, we may get
  more correct scheduling in the general case and the correctness will not
  be affected

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-06 Thread Attila Fazekas
Looks like we need some kind of _per compute node_ mutex in the critical 
section,
multiple scheduler MAY be able to schedule to two compute node at same time,
but not for scheduling to the same compute node.

If we don't want to introduce another required component or
reinvent the wheel there are some possible trick with the existing globally 
visible
components like with the RDMS.

`Randomized` destination choose is recommended in most of the possible 
solutions,
alternatives are much more complex.

One SQL example:

* Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.

When the scheduler picks one (or multiple) node, he needs to verify is the 
node(s) are 
still good before sending the message to the n-cpu.

It can be done by re-reading the ONLY the picked hypervisor(s) related data.
with `LOCK IN SHARE MODE`.
If the destination hyper-visors still OK:

Increase the sched_cnt value exactly by 1,
test is the UPDATE really update the required number of rows,
the WHERE part needs to contain the previous value.

You also need to update the resource usage on the hypervisor,
 by the expected cost of the new vms.

If at least one selected node was ok, the transaction can be COMMITed.
If you were able to COMMIT the transaction, the relevant messages 
 can be sent.

The whole process needs to be repeated with the items which did not passed the
post verification.

If a message sending failed, `act like` migrating the vm to another host.

If multiple scheduler tries to pick multiple different host in different order,
it can lead to a DEADLOCK situation.
Solution: Try to have all scheduler to acquire to Shared RW locks in the same 
order,
at the end.

Galera multi-writer (Active-Active) implication:
As always, retry on deadlock. 

n-sch + n-cpu crash at the same time:
* If the scheduling is not finished properly, it might be fixed manually,
or we need to solve which still alive scheduler instance is 
responsible for fixing the particular scheduling..


- Original Message -
 From: Nikola Đipanov ndipa...@redhat.com
 To: openstack-dev@lists.openstack.org
 Sent: Friday, March 6, 2015 10:29:52 AM
 Subject: Re: [openstack-dev] [nova] blueprint about multiple workers 
 supported in nova-scheduler
 
 On 03/06/2015 01:56 AM, Rui Chen wrote:
  Thank you very much for in-depth discussion about this topic, @Nikola
  and @Sylvain.
  
  I agree that we should solve the technical debt firstly, and then make
  the scheduler better.
  
 
 That was not necessarily my point.
 
 I would be happy to see work on how to make the scheduler less volatile
 when run in parallel, but the solution must acknowledge the eventually
 (or never really) consistent nature of the data scheduler has to operate
 on (in it's current design - there is also the possibility of offering
 an alternative design).
 
 I'd say that fixing the technical debt that is aimed at splitting the
 scheduler out of Nova is a mostly orthogonal effort.
 
 There have been several proposals in the past for how to make the
 scheduler horizontally scalable and improve it's performance. One that I
 remember from the Atlanta summit time-frame was the work done by Boris
 and his team [1] (they actually did some profiling and based their work
 on the bottlenecks they found). There are also some nice ideas in the
 bug lifeless filed [2] since this behaviour particularly impacts ironic.
 
 N.
 
 [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
 [2] https://bugs.launchpad.net/nova/+bug/1341420
 
 
  Best Regards.
  
  2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com
  mailto:sba...@redhat.com:
  
  
  Le 05/03/2015 13:00, Nikola Đipanov a écrit :
  
  On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
  
  Le 04/03/2015 04:51, Rui Chen a écrit :
  
  Hi all,
  
  I want to make it easy to launch a bunch of scheduler
  processes on a
  host, multiple scheduler workers will make use of
  multiple processors
  of host and enhance the performance of nova-scheduler.
  
  I had registered a blueprint and commit a patch to
  implement it.
  
  https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support
  
  https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
  
  This patch had applied in our performance environment
  and pass some
  test cases, like: concurrent booting multiple instances,
  currently we
  didn't find inconsistent issue.
  
  IMO, nova-scheduler should been scaled horizontally on
  easily way, the
  multiple workers should been supported as an out of box
  feature.
  
  Please feel free to discuss this feature, thanks

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-06 Thread Attila Fazekas




- Original Message -
 From: Attila Fazekas afaze...@redhat.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Friday, March 6, 2015 4:19:18 PM
 Subject: Re: [openstack-dev] [nova] blueprint about multiple workers 
 supported in nova-scheduler
 
 Looks like we need some kind of _per compute node_ mutex in the critical
 section,
 multiple scheduler MAY be able to schedule to two compute node at same time,
 but not for scheduling to the same compute node.
 
 If we don't want to introduce another required component or
 reinvent the wheel there are some possible trick with the existing globally
 visible
 components like with the RDMS.
 
 `Randomized` destination choose is recommended in most of the possible
 solutions,
 alternatives are much more complex.
 
 One SQL example:
 
 * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
 
 When the scheduler picks one (or multiple) node, he needs to verify is the
 node(s) are
 still good before sending the message to the n-cpu.
 
 It can be done by re-reading the ONLY the picked hypervisor(s) related data.
 with `LOCK IN SHARE MODE`.
 If the destination hyper-visors still OK:
 
 Increase the sched_cnt value exactly by 1,
 test is the UPDATE really update the required number of rows,
 the WHERE part needs to contain the previous value.

This part is very likely not needed, if all scheduler needs,
to update the (any) same field regarding to the same host, and they
acquire the RW lock for reading before they change it to WRITE lock.

Other strategy might consider pre acquiring the write lock only,
but the write intent is not sure before we re-read and verify the data.  
 
 
 You also need to update the resource usage on the hypervisor,
  by the expected cost of the new vms.
 
 If at least one selected node was ok, the transaction can be COMMITed.
 If you were able to COMMIT the transaction, the relevant messages
  can be sent.
 
 The whole process needs to be repeated with the items which did not passed
 the
 post verification.
 
 If a message sending failed, `act like` migrating the vm to another host.
 
 If multiple scheduler tries to pick multiple different host in different
 order,
 it can lead to a DEADLOCK situation.
 Solution: Try to have all scheduler to acquire to Shared RW locks in the same
 order,
 at the end.
 
 Galera multi-writer (Active-Active) implication:
 As always, retry on deadlock.
 
 n-sch + n-cpu crash at the same time:
 * If the scheduling is not finished properly, it might be fixed manually,
 or we need to solve which still alive scheduler instance is
 responsible for fixing the particular scheduling..
 
 
 - Original Message -
  From: Nikola Đipanov ndipa...@redhat.com
  To: openstack-dev@lists.openstack.org
  Sent: Friday, March 6, 2015 10:29:52 AM
  Subject: Re: [openstack-dev] [nova] blueprint about multiple workers
  supported in nova-scheduler
  
  On 03/06/2015 01:56 AM, Rui Chen wrote:
   Thank you very much for in-depth discussion about this topic, @Nikola
   and @Sylvain.
   
   I agree that we should solve the technical debt firstly, and then make
   the scheduler better.
   
  
  That was not necessarily my point.
  
  I would be happy to see work on how to make the scheduler less volatile
  when run in parallel, but the solution must acknowledge the eventually
  (or never really) consistent nature of the data scheduler has to operate
  on (in it's current design - there is also the possibility of offering
  an alternative design).
  
  I'd say that fixing the technical debt that is aimed at splitting the
  scheduler out of Nova is a mostly orthogonal effort.
  
  There have been several proposals in the past for how to make the
  scheduler horizontally scalable and improve it's performance. One that I
  remember from the Atlanta summit time-frame was the work done by Boris
  and his team [1] (they actually did some profiling and based their work
  on the bottlenecks they found). There are also some nice ideas in the
  bug lifeless filed [2] since this behaviour particularly impacts ironic.
  
  N.
  
  [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
  [2] https://bugs.launchpad.net/nova/+bug/1341420
  
  
   Best Regards.
   
   2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com
   mailto:sba...@redhat.com:
   
   
   Le 05/03/2015 13:00, Nikola Đipanov a écrit :
   
   On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
   
   Le 04/03/2015 04:51, Rui Chen a écrit :
   
   Hi all,
   
   I want to make it easy to launch a bunch of scheduler
   processes on a
   host, multiple scheduler workers will make use of
   multiple processors
   of host and enhance the performance of nova-scheduler.
   
   I had registered a blueprint and commit a patch to
   implement

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-06 Thread Mehdi Abaakouk


Hi, just some oslo.messaging thoughts about having multiple 
nova-scheduler processes (can also apply to any other daemon acting as 
rpc server),


nova-scheduler use service.Service.create() to create a rpc server, that 
one is identified by a 'topic' and a 'server' (the 
oslo.messaging.Target).
Creating multiple workers like [1] does, will result to have all workers 
that share the same identity. This is usually because the 'server' is 
set with the 'hostname', to make our life easier.
With rabbitmq for example, the 'server' attribute of the 
oslo.messaging.Target is used for a queue name, you usually have the 
following queues created:


scheduler
scheduler.scheduler-node-1
scheduler.scheduler-node-2
scheduler.scheduler-node-3
...

Keep things as-is will result that messages that go to 
scheduler.scheduler-node-1 will be processed randomly by the first ready 
worker. You will not be able to identify workers from the amqp point of 
view.
The side effect of that is if a worker stuck, bug or whatever and 
doesn't consume messages anymore, we will not be able to see it. One of 
the other worker will continue to notify that scheduler-node-1 works and 
consume new messages even if all of them are dead/stuck except one.


So I think that each rpc servers (each workers) should have a different 
'server', to get amqp queues like that:


scheduler
scheduler.scheduler-node-1-worker-1
scheduler.scheduler-node-1-worker-2
scheduler.scheduler-node-1-worker-3
scheduler.scheduler-node-2-worker-1
scheduler.scheduler-node-2-worker-2
scheduler.scheduler-node-3-worker-1
scheduler.scheduler-node-3-worker-2
...

Cheers,


[1] https://review.openstack.org/#/c/159382/
---
Mehdi Abaakouk
mail: sil...@sileht.net
irc: sileht

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-06 Thread Nikola Đipanov
On 03/06/2015 01:56 AM, Rui Chen wrote:
 Thank you very much for in-depth discussion about this topic, @Nikola
 and @Sylvain.
 
 I agree that we should solve the technical debt firstly, and then make
 the scheduler better.
 

That was not necessarily my point.

I would be happy to see work on how to make the scheduler less volatile
when run in parallel, but the solution must acknowledge the eventually
(or never really) consistent nature of the data scheduler has to operate
on (in it's current design - there is also the possibility of offering
an alternative design).

I'd say that fixing the technical debt that is aimed at splitting the
scheduler out of Nova is a mostly orthogonal effort.

There have been several proposals in the past for how to make the
scheduler horizontally scalable and improve it's performance. One that I
remember from the Atlanta summit time-frame was the work done by Boris
and his team [1] (they actually did some profiling and based their work
on the bottlenecks they found). There are also some nice ideas in the
bug lifeless filed [2] since this behaviour particularly impacts ironic.

N.

[1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
[2] https://bugs.launchpad.net/nova/+bug/1341420


 Best Regards.
 
 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com
 mailto:sba...@redhat.com:
 
 
 Le 05/03/2015 13:00, Nikola Đipanov a écrit :
 
 On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
 
 Le 04/03/2015 04:51, Rui Chen a écrit :
 
 Hi all,
 
 I want to make it easy to launch a bunch of scheduler
 processes on a
 host, multiple scheduler workers will make use of
 multiple processors
 of host and enhance the performance of nova-scheduler.
 
 I had registered a blueprint and commit a patch to
 implement it.
 
 https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support
 
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
 This patch had applied in our performance environment
 and pass some
 test cases, like: concurrent booting multiple instances,
 currently we
 didn't find inconsistent issue.
 
 IMO, nova-scheduler should been scaled horizontally on
 easily way, the
 multiple workers should been supported as an out of box
 feature.
 
 Please feel free to discuss this feature, thanks.
 
 
 As I said when reviewing your patch, I think the problem is
 not just
 making sure that the scheduler is thread-safe, it's more
 about how the
 Scheduler is accounting resources and providing a retry if those
 consumed resources are higher than what's available.
 
 Here, the main problem is that two workers can actually
 consume two
 distinct resources on the same HostState object. In that
 case, the
 HostState object is decremented by the number of taken
 resources (modulo
 what means a resource which is not an Integer...) for both,
 but nowhere
 in that section, it does check that it overrides the
 resource usage. As
 I said, it's not just about decorating a semaphore, it's
 more about
 rethinking how the Scheduler is managing its resources.
 
 
 That's why I'm -1 on your patch until [1] gets merged. Once
 this BP will
 be implemented, we will have a set of classes for managing
 heterogeneous
 types of resouces and consume them, so it would be quite
 easy to provide
 a check against them in the consume_from_instance() method.
 
 I feel that the above explanation does not give the full picture in
 addition to being factually incorrect in several places. I have
 come to
 realize that the current behaviour of the scheduler is subtle enough
 that just reading the code is not enough to understand all the edge
 cases that can come up. The evidence being that it trips up even
 people
 that have spent significant time working on the code.
 
 It is also important to consider the design choices in terms of
 tradeoffs that they were trying to make.
 
 So here are some facts about the way Nova does scheduling of
 instances
 to compute hosts, considering the amount of resources requested
 by the
 flavor (we will try to put the facts into a bigger picture later):
 
 * Scheduler receives request to chose hosts for one or more
 instances.
 * Upon every request 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Rui Chen
Thank you very much for in-depth discussion about this topic, @Nikola and
@Sylvain.

I agree that we should solve the technical debt firstly, and then make the
scheduler better.

Best Regards.

2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com:


 Le 05/03/2015 13:00, Nikola Đipanov a écrit :

  On 03/04/2015 09:23 AM, Sylvain Bauza wrote:

 Le 04/03/2015 04:51, Rui Chen a écrit :

 Hi all,

 I want to make it easy to launch a bunch of scheduler processes on a
 host, multiple scheduler workers will make use of multiple processors
 of host and enhance the performance of nova-scheduler.

 I had registered a blueprint and commit a patch to implement it.
 https://blueprints.launchpad.net/nova/+spec/scheduler-
 multiple-workers-support

 This patch had applied in our performance environment and pass some
 test cases, like: concurrent booting multiple instances, currently we
 didn't find inconsistent issue.

 IMO, nova-scheduler should been scaled horizontally on easily way, the
 multiple workers should been supported as an out of box feature.

 Please feel free to discuss this feature, thanks.


 As I said when reviewing your patch, I think the problem is not just
 making sure that the scheduler is thread-safe, it's more about how the
 Scheduler is accounting resources and providing a retry if those
 consumed resources are higher than what's available.

 Here, the main problem is that two workers can actually consume two
 distinct resources on the same HostState object. In that case, the
 HostState object is decremented by the number of taken resources (modulo
 what means a resource which is not an Integer...) for both, but nowhere
 in that section, it does check that it overrides the resource usage. As
 I said, it's not just about decorating a semaphore, it's more about
 rethinking how the Scheduler is managing its resources.


 That's why I'm -1 on your patch until [1] gets merged. Once this BP will
 be implemented, we will have a set of classes for managing heterogeneous
 types of resouces and consume them, so it would be quite easy to provide
 a check against them in the consume_from_instance() method.

  I feel that the above explanation does not give the full picture in
 addition to being factually incorrect in several places. I have come to
 realize that the current behaviour of the scheduler is subtle enough
 that just reading the code is not enough to understand all the edge
 cases that can come up. The evidence being that it trips up even people
 that have spent significant time working on the code.

 It is also important to consider the design choices in terms of
 tradeoffs that they were trying to make.

 So here are some facts about the way Nova does scheduling of instances
 to compute hosts, considering the amount of resources requested by the
 flavor (we will try to put the facts into a bigger picture later):

 * Scheduler receives request to chose hosts for one or more instances.
 * Upon every request (_not_ for every instance as there may be several
 instances in a request) the scheduler learns the state of the resources
 on all compute nodes from the central DB. This state may be inaccurate
 (meaning out of date).
 * Compute resources are update by each compute host periodically. This
 is done by updating the row in the DB.
 * The wall-clock time difference between the scheduler deciding to
 schedule an instance, and the resource consumption being reflected in
 the data the scheduler learns from the DB can be arbitrarily long (due
 to load on the compute nodes and latency of message arrival).
 * To cope with the above, there is a concept of retrying the request
 that fails on a certain compute node due to the scheduling decision
 being made with data stale at the moment of build, by default we will
 retry 3 times before giving up.
 * When running multiple instances, decisions are made in a loop, and
 internal in-memory view of the resources gets updated (the widely
 misunderstood consume_from_instance method is used for this), so as to
 keep subsequent decisions as accurate as possible. As was described
 above, this is all thrown away once the request is finished.

 Now that we understand the above, we can start to consider what changes
 when we introduce several concurrent scheduler processes.

 Several cases come to mind:
 * Concurrent requests will no longer be serialized on reading the state
 of all hosts (due to how eventlet interacts with mysql driver).
 * In the presence of a single request for a large number of instances
 there is going to be a drift in accuracy of the decisions made by other
 schedulers as they will not have the accounted for any of the instances
 until they actually get claimed on their respective hosts.

 All of the above limitations will likely not pose a problem under normal
 load and usage and can cause issues to start appearing when nodes are
 close to full or when there is heavy load. Also this changes drastically
 based on how we actually chose to 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Sylvain Bauza


Le 05/03/2015 08:54, Rui Chen a écrit :
We will face the same issue in multiple nova-scheduler process case, 
like Sylvain say, right?


Two processes/workers can actually consume two distinct resources on 
the same HostState.




No. The problem I mentioned was related to having multiple threads 
accessing the same object in memory.
By running multiple schedulers on different hosts and listening to the 
same RPC topic, it would work - with some caveats about race conditions 
too, but that's unrelated to your proposal -


If you want to run multiple nova-scheduler services, then just fire them 
up on separate machines (that's HA, eh) and that will work.


-Sylvain





2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com 
mailto:sou...@gmail.com:


Rui, you still can run multiple nova-scheduler process now.


2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com
mailto:chenrui.m...@gmail.com:

Looks like it's a complicated problem, and nova-scheduler
can't scale-out horizontally in active/active mode.

Maybe we should illustrate the problem in the HA docs.


http://docs.openstack.org/high-availability-guide/content/_schedulers.html

Thanks for everybody's attention.

2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com
mailto:mba...@redhat.com:



Attila Fazekas afaze...@redhat.com
mailto:afaze...@redhat.com wrote:

 Hi,

 I wonder what is the planned future of the scheduling.

 The scheduler does a lot of high field number query,
 which is CPU expensive when you are using sqlalchemy-orm.
 Does anyone tried to switch those operations to
sqlalchemy-core ?

An upcoming feature in SQLAlchemy 1.0 will remove the vast
majority of CPU
overhead from the query side of SQLAlchemy ORM by caching
all the work done
up until the SQL is emitted, including all the function
overhead of building
up the Query object, producing a core select() object
internally from the
Query, working out a large part of the object fetch
strategies, and finally
the string compilation of the select() into a string as
well as organizing
the typing information for result columns. With a query
that is constructed
using the “Baked” feature, all of these steps are cached
in memory and held
persistently; the same query can then be re-used at which
point all of these
steps are skipped. The system produces the cache key based
on the in-place
construction of the Query using lambdas so no major
changes to code
structure are needed; just the way the Query modifications
are performed
needs to be preceded with “lambda q:”, essentially.

With this approach, the traditional session.query(Model)
approach can go
from start to SQL being emitted with an order of magnitude
less function
calls. On the fetch side, fetching individual columns
instead of full
entities has always been an option with ORM and is about
the same speed as a
Core fetch of rows. So using ORM with minimal changes to
existing ORM code
you can get performance even better than you’d get using
Core directly,
since caching of the string compilation is also added.

On the persist side, the new bulk insert / update features
provide a bridge
from ORM-mapped objects to bulk inserts/updates without
any unit of work
sorting going on. ORM mapped objects are still more
expensive to use in that
instantiation and state change is still more expensive,
but bulk
insert/update accepts dictionaries as well, which again is
competitive with
a straight Core insert.

Both of these features are completed in the master branch,
the “baked query”
feature just needs documentation, and I’m basically two or
three tickets
away from beta releases of 1.0. The “Baked” feature itself
lives as an
extension and if we really wanted, I could backport it
into oslo.db as well
so that it works against 0.9.

So I’d ask that folks please hold off on any kind of
migration from ORM to
Core for performance reasons. I’ve spent the past several
months adding
features directly to SQLAlchemy that allow an ORM-based
app to have routes
to operations that perform just as fast as that of 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Rui Chen
My BP aims is launching multiple nova-scheduler processes on a host, like
nova-conductor.

If we run multiple nova-scheduler services on separate hosts, that will
work, forking the multiple nova-scheduler
child processes on a host that will work too? Different child processes had
different HostState object in self memory,
the only different point with HA is just launching all scheduler processes
on a host.

I'm sorry to waste some time, I just want to clarify it.


2015-03-05 17:12 GMT+08:00 Sylvain Bauza sba...@redhat.com:


 Le 05/03/2015 08:54, Rui Chen a écrit :

 We will face the same issue in multiple nova-scheduler process case, like
 Sylvain say, right?

  Two processes/workers can actually consume two distinct resources on the
 same HostState.


 No. The problem I mentioned was related to having multiple threads
 accessing the same object in memory.
 By running multiple schedulers on different hosts and listening to the
 same RPC topic, it would work - with some caveats about race conditions
 too, but that's unrelated to your proposal -

 If you want to run multiple nova-scheduler services, then just fire them
 up on separate machines (that's HA, eh) and that will work.

 -Sylvain





 2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com:

 Rui, you still can run multiple nova-scheduler process now.


 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com:

 Looks like it's a complicated problem, and nova-scheduler can't
 scale-out horizontally in active/active mode.

  Maybe we should illustrate the problem in the HA docs.


 http://docs.openstack.org/high-availability-guide/content/_schedulers.html

 Thanks for everybody's attention.

 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com:



 Attila Fazekas afaze...@redhat.com wrote:

  Hi,
 
  I wonder what is the planned future of the scheduling.
 
  The scheduler does a lot of high field number query,
  which is CPU expensive when you are using sqlalchemy-orm.
  Does anyone tried to switch those operations to sqlalchemy-core ?

 An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of
 CPU
 overhead from the query side of SQLAlchemy ORM by caching all the work
 done
 up until the SQL is emitted, including all the function overhead of
 building
 up the Query object, producing a core select() object internally from
 the
 Query, working out a large part of the object fetch strategies, and
 finally
 the string compilation of the select() into a string as well as
 organizing
 the typing information for result columns. With a query that is
 constructed
 using the “Baked” feature, all of these steps are cached in memory and
 held
 persistently; the same query can then be re-used at which point all of
 these
 steps are skipped. The system produces the cache key based on the
 in-place
 construction of the Query using lambdas so no major changes to code
 structure are needed; just the way the Query modifications are performed
 needs to be preceded with “lambda q:”, essentially.

 With this approach, the traditional session.query(Model) approach can go
 from start to SQL being emitted with an order of magnitude less function
 calls. On the fetch side, fetching individual columns instead of full
 entities has always been an option with ORM and is about the same speed
 as a
 Core fetch of rows. So using ORM with minimal changes to existing ORM
 code
 you can get performance even better than you’d get using Core directly,
 since caching of the string compilation is also added.

 On the persist side, the new bulk insert / update features provide a
 bridge
 from ORM-mapped objects to bulk inserts/updates without any unit of work
 sorting going on. ORM mapped objects are still more expensive to use in
 that
 instantiation and state change is still more expensive, but bulk
 insert/update accepts dictionaries as well, which again is competitive
 with
 a straight Core insert.

 Both of these features are completed in the master branch, the “baked
 query”
 feature just needs documentation, and I’m basically two or three tickets
 away from beta releases of 1.0. The “Baked” feature itself lives as an
 extension and if we really wanted, I could backport it into oslo.db as
 well
 so that it works against 0.9.

 So I’d ask that folks please hold off on any kind of migration from ORM
 to
 Core for performance reasons. I’ve spent the past several months adding
 features directly to SQLAlchemy that allow an ORM-based app to have
 routes
 to operations that perform just as fast as that of Core without a
 rewrite of
 code.

  The scheduler does lot of thing in the application, like filtering
  what can be done on the DB level more efficiently. Why it is not done
  on the DB side ?
 
  There are use cases when the scheduler would need to know even more
 data,
  Is there a plan for keeping `everything` in all schedulers process
 memory up-to-date ?
  (Maybe zookeeper)
 
  The opposite way would be to move most operation into the DB side,
  since the DB 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Nikola Đipanov
On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
 
 Le 04/03/2015 04:51, Rui Chen a écrit :
 Hi all,

 I want to make it easy to launch a bunch of scheduler processes on a
 host, multiple scheduler workers will make use of multiple processors
 of host and enhance the performance of nova-scheduler.

 I had registered a blueprint and commit a patch to implement it.
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

 This patch had applied in our performance environment and pass some
 test cases, like: concurrent booting multiple instances, currently we
 didn't find inconsistent issue.

 IMO, nova-scheduler should been scaled horizontally on easily way, the
 multiple workers should been supported as an out of box feature.

 Please feel free to discuss this feature, thanks.
 
 
 As I said when reviewing your patch, I think the problem is not just
 making sure that the scheduler is thread-safe, it's more about how the
 Scheduler is accounting resources and providing a retry if those
 consumed resources are higher than what's available.
 
 Here, the main problem is that two workers can actually consume two
 distinct resources on the same HostState object. In that case, the
 HostState object is decremented by the number of taken resources (modulo
 what means a resource which is not an Integer...) for both, but nowhere
 in that section, it does check that it overrides the resource usage. As
 I said, it's not just about decorating a semaphore, it's more about
 rethinking how the Scheduler is managing its resources.
 
 
 That's why I'm -1 on your patch until [1] gets merged. Once this BP will
 be implemented, we will have a set of classes for managing heterogeneous
 types of resouces and consume them, so it would be quite easy to provide
 a check against them in the consume_from_instance() method.
 

I feel that the above explanation does not give the full picture in
addition to being factually incorrect in several places. I have come to
realize that the current behaviour of the scheduler is subtle enough
that just reading the code is not enough to understand all the edge
cases that can come up. The evidence being that it trips up even people
that have spent significant time working on the code.

It is also important to consider the design choices in terms of
tradeoffs that they were trying to make.

So here are some facts about the way Nova does scheduling of instances
to compute hosts, considering the amount of resources requested by the
flavor (we will try to put the facts into a bigger picture later):

* Scheduler receives request to chose hosts for one or more instances.
* Upon every request (_not_ for every instance as there may be several
instances in a request) the scheduler learns the state of the resources
on all compute nodes from the central DB. This state may be inaccurate
(meaning out of date).
* Compute resources are update by each compute host periodically. This
is done by updating the row in the DB.
* The wall-clock time difference between the scheduler deciding to
schedule an instance, and the resource consumption being reflected in
the data the scheduler learns from the DB can be arbitrarily long (due
to load on the compute nodes and latency of message arrival).
* To cope with the above, there is a concept of retrying the request
that fails on a certain compute node due to the scheduling decision
being made with data stale at the moment of build, by default we will
retry 3 times before giving up.
* When running multiple instances, decisions are made in a loop, and
internal in-memory view of the resources gets updated (the widely
misunderstood consume_from_instance method is used for this), so as to
keep subsequent decisions as accurate as possible. As was described
above, this is all thrown away once the request is finished.

Now that we understand the above, we can start to consider what changes
when we introduce several concurrent scheduler processes.

Several cases come to mind:
* Concurrent requests will no longer be serialized on reading the state
of all hosts (due to how eventlet interacts with mysql driver).
* In the presence of a single request for a large number of instances
there is going to be a drift in accuracy of the decisions made by other
schedulers as they will not have the accounted for any of the instances
until they actually get claimed on their respective hosts.

All of the above limitations will likely not pose a problem under normal
load and usage and can cause issues to start appearing when nodes are
close to full or when there is heavy load. Also this changes drastically
based on how we actually chose to utilize hosts (see a very interesting
Ironic bug [1])

Weather any of the above matters to users is dependant heavily on their
use-case though. This is why I feel we should be providing more information.

Finally - I think it is important to accept that the scheduler service
will always have to operate under the assumptions of stale data, and

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-05 Thread Sylvain Bauza


Le 05/03/2015 13:00, Nikola Đipanov a écrit :

On 03/04/2015 09:23 AM, Sylvain Bauza wrote:

Le 04/03/2015 04:51, Rui Chen a écrit :

Hi all,

I want to make it easy to launch a bunch of scheduler processes on a
host, multiple scheduler workers will make use of multiple processors
of host and enhance the performance of nova-scheduler.

I had registered a blueprint and commit a patch to implement it.
https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

This patch had applied in our performance environment and pass some
test cases, like: concurrent booting multiple instances, currently we
didn't find inconsistent issue.

IMO, nova-scheduler should been scaled horizontally on easily way, the
multiple workers should been supported as an out of box feature.

Please feel free to discuss this feature, thanks.


As I said when reviewing your patch, I think the problem is not just
making sure that the scheduler is thread-safe, it's more about how the
Scheduler is accounting resources and providing a retry if those
consumed resources are higher than what's available.

Here, the main problem is that two workers can actually consume two
distinct resources on the same HostState object. In that case, the
HostState object is decremented by the number of taken resources (modulo
what means a resource which is not an Integer...) for both, but nowhere
in that section, it does check that it overrides the resource usage. As
I said, it's not just about decorating a semaphore, it's more about
rethinking how the Scheduler is managing its resources.


That's why I'm -1 on your patch until [1] gets merged. Once this BP will
be implemented, we will have a set of classes for managing heterogeneous
types of resouces and consume them, so it would be quite easy to provide
a check against them in the consume_from_instance() method.


I feel that the above explanation does not give the full picture in
addition to being factually incorrect in several places. I have come to
realize that the current behaviour of the scheduler is subtle enough
that just reading the code is not enough to understand all the edge
cases that can come up. The evidence being that it trips up even people
that have spent significant time working on the code.

It is also important to consider the design choices in terms of
tradeoffs that they were trying to make.

So here are some facts about the way Nova does scheduling of instances
to compute hosts, considering the amount of resources requested by the
flavor (we will try to put the facts into a bigger picture later):

* Scheduler receives request to chose hosts for one or more instances.
* Upon every request (_not_ for every instance as there may be several
instances in a request) the scheduler learns the state of the resources
on all compute nodes from the central DB. This state may be inaccurate
(meaning out of date).
* Compute resources are update by each compute host periodically. This
is done by updating the row in the DB.
* The wall-clock time difference between the scheduler deciding to
schedule an instance, and the resource consumption being reflected in
the data the scheduler learns from the DB can be arbitrarily long (due
to load on the compute nodes and latency of message arrival).
* To cope with the above, there is a concept of retrying the request
that fails on a certain compute node due to the scheduling decision
being made with data stale at the moment of build, by default we will
retry 3 times before giving up.
* When running multiple instances, decisions are made in a loop, and
internal in-memory view of the resources gets updated (the widely
misunderstood consume_from_instance method is used for this), so as to
keep subsequent decisions as accurate as possible. As was described
above, this is all thrown away once the request is finished.

Now that we understand the above, we can start to consider what changes
when we introduce several concurrent scheduler processes.

Several cases come to mind:
* Concurrent requests will no longer be serialized on reading the state
of all hosts (due to how eventlet interacts with mysql driver).
* In the presence of a single request for a large number of instances
there is going to be a drift in accuracy of the decisions made by other
schedulers as they will not have the accounted for any of the instances
until they actually get claimed on their respective hosts.

All of the above limitations will likely not pose a problem under normal
load and usage and can cause issues to start appearing when nodes are
close to full or when there is heavy load. Also this changes drastically
based on how we actually chose to utilize hosts (see a very interesting
Ironic bug [1])

Weather any of the above matters to users is dependant heavily on their
use-case though. This is why I feel we should be providing more information.

Finally - I think it is important to accept that the scheduler service
will always have to operate under the assumptions of 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Mike Bayer


Attila Fazekas afaze...@redhat.com wrote:

 Hi,
 
 I wonder what is the planned future of the scheduling.
 
 The scheduler does a lot of high field number query,
 which is CPU expensive when you are using sqlalchemy-orm.
 Does anyone tried to switch those operations to sqlalchemy-core ?

An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU
overhead from the query side of SQLAlchemy ORM by caching all the work done
up until the SQL is emitted, including all the function overhead of building
up the Query object, producing a core select() object internally from the
Query, working out a large part of the object fetch strategies, and finally
the string compilation of the select() into a string as well as organizing
the typing information for result columns. With a query that is constructed
using the “Baked” feature, all of these steps are cached in memory and held
persistently; the same query can then be re-used at which point all of these
steps are skipped. The system produces the cache key based on the in-place
construction of the Query using lambdas so no major changes to code
structure are needed; just the way the Query modifications are performed
needs to be preceded with “lambda q:”, essentially.

With this approach, the traditional session.query(Model) approach can go
from start to SQL being emitted with an order of magnitude less function
calls. On the fetch side, fetching individual columns instead of full
entities has always been an option with ORM and is about the same speed as a
Core fetch of rows. So using ORM with minimal changes to existing ORM code
you can get performance even better than you’d get using Core directly,
since caching of the string compilation is also added.

On the persist side, the new bulk insert / update features provide a bridge
from ORM-mapped objects to bulk inserts/updates without any unit of work
sorting going on. ORM mapped objects are still more expensive to use in that
instantiation and state change is still more expensive, but bulk
insert/update accepts dictionaries as well, which again is competitive with
a straight Core insert.

Both of these features are completed in the master branch, the “baked query”
feature just needs documentation, and I’m basically two or three tickets
away from beta releases of 1.0. The “Baked” feature itself lives as an
extension and if we really wanted, I could backport it into oslo.db as well
so that it works against 0.9.

So I’d ask that folks please hold off on any kind of migration from ORM to
Core for performance reasons. I’ve spent the past several months adding
features directly to SQLAlchemy that allow an ORM-based app to have routes
to operations that perform just as fast as that of Core without a rewrite of
code.

 The scheduler does lot of thing in the application, like filtering 
 what can be done on the DB level more efficiently. Why it is not done
 on the DB side ? 
 
 There are use cases when the scheduler would need to know even more data,
 Is there a plan for keeping `everything` in all schedulers process memory 
 up-to-date ?
 (Maybe zookeeper)
 
 The opposite way would be to move most operation into the DB side,
 since the DB already knows everything. 
 (stored procedures ?)
 
 Best Regards,
 Attila
 
 
 - Original Message -
 From: Rui Chen chenrui.m...@gmail.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Wednesday, March 4, 2015 4:51:07 AM
 Subject: [openstack-dev] [nova] blueprint about multiple workers supported   
 in nova-scheduler
 
 Hi all,
 
 I want to make it easy to launch a bunch of scheduler processes on a host,
 multiple scheduler workers will make use of multiple processors of host and
 enhance the performance of nova-scheduler.
 
 I had registered a blueprint and commit a patch to implement it.
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
 This patch had applied in our performance environment and pass some test
 cases, like: concurrent booting multiple instances, currently we didn't find
 inconsistent issue.
 
 IMO, nova-scheduler should been scaled horizontally on easily way, the
 multiple workers should been supported as an out of box feature.
 
 Please feel free to discuss this feature, thanks.
 
 Best Regards
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Jay Pipes

On 03/04/2015 01:51 AM, Attila Fazekas wrote:

Hi,

I wonder what is the planned future of the scheduling.

The scheduler does a lot of high field number query,
which is CPU expensive when you are using sqlalchemy-orm.
Does anyone tried to switch those operations to sqlalchemy-core ?


Actually, the scheduler does virtually no SQLAlchemy ORM queries. Almost 
all database access is serialized from the nova-scheduler through the 
nova-conductor service via the nova.objects remoting framework.



The scheduler does lot of thing in the application, like filtering
what can be done on the DB level more efficiently. Why it is not done
on the DB side ?


That's a pretty big generalization. Many filters (check out NUMA 
configuration, host aggregate extra_specs matching, any of the JSON 
filters, etc) don't lend themselves to SQL column-based sorting and 
filtering.



There are use cases when the scheduler would need to know even more data,
Is there a plan for keeping `everything` in all schedulers process memory 
up-to-date ?
(Maybe zookeeper)


Zookeeper has nothing to do with scheduling decisions -- only whether or 
not a compute node's service descriptor is active or not. The end goal 
(after splitting the Nova scheduler out into Gantt hopefully at the 
start of the L release cycle) is to have the Gantt database be more 
optimized to contain the resource usage amounts of all resources 
consumed in the entire cloud, and to use partitioning/sharding to scale 
the scheduler subsystem, instead of having each scheduler process handle 
requests for all resources in the cloud (or cell...)



The opposite way would be to move most operation into the DB side,
since the DB already knows everything.
(stored procedures ?)


See above. This assumes that the data the scheduler is iterating over is 
well-structured and consistent, and that is a false assumption.


Best,
-jay


Best Regards,
Attila


- Original Message -

From: Rui Chen chenrui.m...@gmail.com
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.org
Sent: Wednesday, March 4, 2015 4:51:07 AM
Subject: [openstack-dev] [nova] blueprint about multiple workers supported  
in nova-scheduler

Hi all,

I want to make it easy to launch a bunch of scheduler processes on a host,
multiple scheduler workers will make use of multiple processors of host and
enhance the performance of nova-scheduler.

I had registered a blueprint and commit a patch to implement it.
https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

This patch had applied in our performance environment and pass some test
cases, like: concurrent booting multiple instances, currently we didn't find
inconsistent issue.

IMO, nova-scheduler should been scaled horizontally on easily way, the
multiple workers should been supported as an out of box feature.

Please feel free to discuss this feature, thanks.

Best Regards


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Rui Chen
Looks like it's a complicated problem, and nova-scheduler can't scale-out
horizontally in active/active mode.

Maybe we should illustrate the problem in the HA docs.

http://docs.openstack.org/high-availability-guide/content/_schedulers.html

Thanks for everybody's attention.

2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com:



 Attila Fazekas afaze...@redhat.com wrote:

  Hi,
 
  I wonder what is the planned future of the scheduling.
 
  The scheduler does a lot of high field number query,
  which is CPU expensive when you are using sqlalchemy-orm.
  Does anyone tried to switch those operations to sqlalchemy-core ?

 An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU
 overhead from the query side of SQLAlchemy ORM by caching all the work done
 up until the SQL is emitted, including all the function overhead of
 building
 up the Query object, producing a core select() object internally from the
 Query, working out a large part of the object fetch strategies, and finally
 the string compilation of the select() into a string as well as organizing
 the typing information for result columns. With a query that is constructed
 using the “Baked” feature, all of these steps are cached in memory and held
 persistently; the same query can then be re-used at which point all of
 these
 steps are skipped. The system produces the cache key based on the in-place
 construction of the Query using lambdas so no major changes to code
 structure are needed; just the way the Query modifications are performed
 needs to be preceded with “lambda q:”, essentially.

 With this approach, the traditional session.query(Model) approach can go
 from start to SQL being emitted with an order of magnitude less function
 calls. On the fetch side, fetching individual columns instead of full
 entities has always been an option with ORM and is about the same speed as
 a
 Core fetch of rows. So using ORM with minimal changes to existing ORM code
 you can get performance even better than you’d get using Core directly,
 since caching of the string compilation is also added.

 On the persist side, the new bulk insert / update features provide a bridge
 from ORM-mapped objects to bulk inserts/updates without any unit of work
 sorting going on. ORM mapped objects are still more expensive to use in
 that
 instantiation and state change is still more expensive, but bulk
 insert/update accepts dictionaries as well, which again is competitive with
 a straight Core insert.

 Both of these features are completed in the master branch, the “baked
 query”
 feature just needs documentation, and I’m basically two or three tickets
 away from beta releases of 1.0. The “Baked” feature itself lives as an
 extension and if we really wanted, I could backport it into oslo.db as well
 so that it works against 0.9.

 So I’d ask that folks please hold off on any kind of migration from ORM to
 Core for performance reasons. I’ve spent the past several months adding
 features directly to SQLAlchemy that allow an ORM-based app to have routes
 to operations that perform just as fast as that of Core without a rewrite
 of
 code.

  The scheduler does lot of thing in the application, like filtering
  what can be done on the DB level more efficiently. Why it is not done
  on the DB side ?
 
  There are use cases when the scheduler would need to know even more data,
  Is there a plan for keeping `everything` in all schedulers process
 memory up-to-date ?
  (Maybe zookeeper)
 
  The opposite way would be to move most operation into the DB side,
  since the DB already knows everything.
  (stored procedures ?)
 
  Best Regards,
  Attila
 
 
  - Original Message -
  From: Rui Chen chenrui.m...@gmail.com
  To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
  Sent: Wednesday, March 4, 2015 4:51:07 AM
  Subject: [openstack-dev] [nova] blueprint about multiple workers
 supported   in nova-scheduler
 
  Hi all,
 
  I want to make it easy to launch a bunch of scheduler processes on a
 host,
  multiple scheduler workers will make use of multiple processors of host
 and
  enhance the performance of nova-scheduler.
 
  I had registered a blueprint and commit a patch to implement it.
 
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
  This patch had applied in our performance environment and pass some test
  cases, like: concurrent booting multiple instances, currently we didn't
 find
  inconsistent issue.
 
  IMO, nova-scheduler should been scaled horizontally on easily way, the
  multiple workers should been supported as an out of box feature.
 
  Please feel free to discuss this feature, thanks.
 
  Best Regards
 
 
 
 __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Rui Chen
We will face the same issue in multiple nova-scheduler process case, like
Sylvain say, right?

Two processes/workers can actually consume two distinct resources on the
same HostState.




2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com:

 Rui, you still can run multiple nova-scheduler process now.


 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com:

 Looks like it's a complicated problem, and nova-scheduler can't scale-out
 horizontally in active/active mode.

 Maybe we should illustrate the problem in the HA docs.

 http://docs.openstack.org/high-availability-guide/content/_schedulers.html

 Thanks for everybody's attention.

 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com:



 Attila Fazekas afaze...@redhat.com wrote:

  Hi,
 
  I wonder what is the planned future of the scheduling.
 
  The scheduler does a lot of high field number query,
  which is CPU expensive when you are using sqlalchemy-orm.
  Does anyone tried to switch those operations to sqlalchemy-core ?

 An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of
 CPU
 overhead from the query side of SQLAlchemy ORM by caching all the work
 done
 up until the SQL is emitted, including all the function overhead of
 building
 up the Query object, producing a core select() object internally from the
 Query, working out a large part of the object fetch strategies, and
 finally
 the string compilation of the select() into a string as well as
 organizing
 the typing information for result columns. With a query that is
 constructed
 using the “Baked” feature, all of these steps are cached in memory and
 held
 persistently; the same query can then be re-used at which point all of
 these
 steps are skipped. The system produces the cache key based on the
 in-place
 construction of the Query using lambdas so no major changes to code
 structure are needed; just the way the Query modifications are performed
 needs to be preceded with “lambda q:”, essentially.

 With this approach, the traditional session.query(Model) approach can go
 from start to SQL being emitted with an order of magnitude less function
 calls. On the fetch side, fetching individual columns instead of full
 entities has always been an option with ORM and is about the same speed
 as a
 Core fetch of rows. So using ORM with minimal changes to existing ORM
 code
 you can get performance even better than you’d get using Core directly,
 since caching of the string compilation is also added.

 On the persist side, the new bulk insert / update features provide a
 bridge
 from ORM-mapped objects to bulk inserts/updates without any unit of work
 sorting going on. ORM mapped objects are still more expensive to use in
 that
 instantiation and state change is still more expensive, but bulk
 insert/update accepts dictionaries as well, which again is competitive
 with
 a straight Core insert.

 Both of these features are completed in the master branch, the “baked
 query”
 feature just needs documentation, and I’m basically two or three tickets
 away from beta releases of 1.0. The “Baked” feature itself lives as an
 extension and if we really wanted, I could backport it into oslo.db as
 well
 so that it works against 0.9.

 So I’d ask that folks please hold off on any kind of migration from ORM
 to
 Core for performance reasons. I’ve spent the past several months adding
 features directly to SQLAlchemy that allow an ORM-based app to have
 routes
 to operations that perform just as fast as that of Core without a
 rewrite of
 code.

  The scheduler does lot of thing in the application, like filtering
  what can be done on the DB level more efficiently. Why it is not done
  on the DB side ?
 
  There are use cases when the scheduler would need to know even more
 data,
  Is there a plan for keeping `everything` in all schedulers process
 memory up-to-date ?
  (Maybe zookeeper)
 
  The opposite way would be to move most operation into the DB side,
  since the DB already knows everything.
  (stored procedures ?)
 
  Best Regards,
  Attila
 
 
  - Original Message -
  From: Rui Chen chenrui.m...@gmail.com
  To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
  Sent: Wednesday, March 4, 2015 4:51:07 AM
  Subject: [openstack-dev] [nova] blueprint about multiple workers
 supported   in nova-scheduler
 
  Hi all,
 
  I want to make it easy to launch a bunch of scheduler processes on a
 host,
  multiple scheduler workers will make use of multiple processors of
 host and
  enhance the performance of nova-scheduler.
 
  I had registered a blueprint and commit a patch to implement it.
 
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
  This patch had applied in our performance environment and pass some
 test
  cases, like: concurrent booting multiple instances, currently we
 didn't find
  inconsistent issue.
 
  IMO, nova-scheduler should been scaled horizontally on easily way, the
  multiple 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Alex Xu
Rui, you still can run multiple nova-scheduler process now.

2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com:

 Looks like it's a complicated problem, and nova-scheduler can't scale-out
 horizontally in active/active mode.

 Maybe we should illustrate the problem in the HA docs.

 http://docs.openstack.org/high-availability-guide/content/_schedulers.html

 Thanks for everybody's attention.

 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com:



 Attila Fazekas afaze...@redhat.com wrote:

  Hi,
 
  I wonder what is the planned future of the scheduling.
 
  The scheduler does a lot of high field number query,
  which is CPU expensive when you are using sqlalchemy-orm.
  Does anyone tried to switch those operations to sqlalchemy-core ?

 An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU
 overhead from the query side of SQLAlchemy ORM by caching all the work
 done
 up until the SQL is emitted, including all the function overhead of
 building
 up the Query object, producing a core select() object internally from the
 Query, working out a large part of the object fetch strategies, and
 finally
 the string compilation of the select() into a string as well as organizing
 the typing information for result columns. With a query that is
 constructed
 using the “Baked” feature, all of these steps are cached in memory and
 held
 persistently; the same query can then be re-used at which point all of
 these
 steps are skipped. The system produces the cache key based on the in-place
 construction of the Query using lambdas so no major changes to code
 structure are needed; just the way the Query modifications are performed
 needs to be preceded with “lambda q:”, essentially.

 With this approach, the traditional session.query(Model) approach can go
 from start to SQL being emitted with an order of magnitude less function
 calls. On the fetch side, fetching individual columns instead of full
 entities has always been an option with ORM and is about the same speed
 as a
 Core fetch of rows. So using ORM with minimal changes to existing ORM code
 you can get performance even better than you’d get using Core directly,
 since caching of the string compilation is also added.

 On the persist side, the new bulk insert / update features provide a
 bridge
 from ORM-mapped objects to bulk inserts/updates without any unit of work
 sorting going on. ORM mapped objects are still more expensive to use in
 that
 instantiation and state change is still more expensive, but bulk
 insert/update accepts dictionaries as well, which again is competitive
 with
 a straight Core insert.

 Both of these features are completed in the master branch, the “baked
 query”
 feature just needs documentation, and I’m basically two or three tickets
 away from beta releases of 1.0. The “Baked” feature itself lives as an
 extension and if we really wanted, I could backport it into oslo.db as
 well
 so that it works against 0.9.

 So I’d ask that folks please hold off on any kind of migration from ORM to
 Core for performance reasons. I’ve spent the past several months adding
 features directly to SQLAlchemy that allow an ORM-based app to have routes
 to operations that perform just as fast as that of Core without a rewrite
 of
 code.

  The scheduler does lot of thing in the application, like filtering
  what can be done on the DB level more efficiently. Why it is not done
  on the DB side ?
 
  There are use cases when the scheduler would need to know even more
 data,
  Is there a plan for keeping `everything` in all schedulers process
 memory up-to-date ?
  (Maybe zookeeper)
 
  The opposite way would be to move most operation into the DB side,
  since the DB already knows everything.
  (stored procedures ?)
 
  Best Regards,
  Attila
 
 
  - Original Message -
  From: Rui Chen chenrui.m...@gmail.com
  To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
  Sent: Wednesday, March 4, 2015 4:51:07 AM
  Subject: [openstack-dev] [nova] blueprint about multiple workers
 supported   in nova-scheduler
 
  Hi all,
 
  I want to make it easy to launch a bunch of scheduler processes on a
 host,
  multiple scheduler workers will make use of multiple processors of
 host and
  enhance the performance of nova-scheduler.
 
  I had registered a blueprint and commit a patch to implement it.
 
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
  This patch had applied in our performance environment and pass some
 test
  cases, like: concurrent booting multiple instances, currently we
 didn't find
  inconsistent issue.
 
  IMO, nova-scheduler should been scaled horizontally on easily way, the
  multiple workers should been supported as an out of box feature.
 
  Please feel free to discuss this feature, thanks.
 
  Best Regards
 
 
 
 __
  OpenStack Development Mailing List (not 

Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Attila Fazekas
Hi,

I wonder what is the planned future of the scheduling.

The scheduler does a lot of high field number query,
which is CPU expensive when you are using sqlalchemy-orm.
Does anyone tried to switch those operations to sqlalchemy-core ?

The scheduler does lot of thing in the application, like filtering 
what can be done on the DB level more efficiently. Why it is not done
on the DB side ? 

There are use cases when the scheduler would need to know even more data,
Is there a plan for keeping `everything` in all schedulers process memory 
up-to-date ?
(Maybe zookeeper)

The opposite way would be to move most operation into the DB side,
since the DB already knows everything. 
(stored procedures ?)

Best Regards,
Attila


- Original Message -
 From: Rui Chen chenrui.m...@gmail.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Wednesday, March 4, 2015 4:51:07 AM
 Subject: [openstack-dev] [nova] blueprint about multiple workers supported
 in nova-scheduler
 
 Hi all,
 
 I want to make it easy to launch a bunch of scheduler processes on a host,
 multiple scheduler workers will make use of multiple processors of host and
 enhance the performance of nova-scheduler.
 
 I had registered a blueprint and commit a patch to implement it.
 https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
 
 This patch had applied in our performance environment and pass some test
 cases, like: concurrent booting multiple instances, currently we didn't find
 inconsistent issue.
 
 IMO, nova-scheduler should been scaled horizontally on easily way, the
 multiple workers should been supported as an out of box feature.
 
 Please feel free to discuss this feature, thanks.
 
 Best Regards
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

2015-03-04 Thread Sylvain Bauza


Le 04/03/2015 04:51, Rui Chen a écrit :

Hi all,

I want to make it easy to launch a bunch of scheduler processes on a 
host, multiple scheduler workers will make use of multiple processors 
of host and enhance the performance of nova-scheduler.


I had registered a blueprint and commit a patch to implement it.
https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support

This patch had applied in our performance environment and pass some 
test cases, like: concurrent booting multiple instances, currently we 
didn't find inconsistent issue.


IMO, nova-scheduler should been scaled horizontally on easily way, the 
multiple workers should been supported as an out of box feature.


Please feel free to discuss this feature, thanks.



As I said when reviewing your patch, I think the problem is not just 
making sure that the scheduler is thread-safe, it's more about how the 
Scheduler is accounting resources and providing a retry if those 
consumed resources are higher than what's available.


Here, the main problem is that two workers can actually consume two 
distinct resources on the same HostState object. In that case, the 
HostState object is decremented by the number of taken resources (modulo 
what means a resource which is not an Integer...) for both, but nowhere 
in that section, it does check that it overrides the resource usage. As 
I said, it's not just about decorating a semaphore, it's more about 
rethinking how the Scheduler is managing its resources.



That's why I'm -1 on your patch until [1] gets merged. Once this BP will 
be implemented, we will have a set of classes for managing heterogeneous 
types of resouces and consume them, so it would be quite easy to provide 
a check against them in the consume_from_instance() method.


-Sylvain

[1] 
http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/resource-objects.html


Best Regards



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev