Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
- Original Message - From: Jay Pipes jaypi...@gmail.com To: openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 9:22:43 PM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler On 03/04/2015 01:51 AM, Attila Fazekas wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? Actually, the scheduler does virtually no SQLAlchemy ORM queries. Almost all database access is serialized from the nova-scheduler through the nova-conductor service via the nova.objects remoting framework. It does not helps you. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? That's a pretty big generalization. Many filters (check out NUMA configuration, host aggregate extra_specs matching, any of the JSON filters, etc) don't lend themselves to SQL column-based sorting and filtering. What a basic SQL query can do, and what is the limit of the SQL is two different thing. Even if you do not move everything to the DB side, the dataset the application need to deal with could be limited. There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) Zookeeper has nothing to do with scheduling decisions -- only whether or not a compute node's service descriptor is active or not. The end goal (after splitting the Nova scheduler out into Gantt hopefully at the start of the L release cycle) is to have the Gantt database be more optimized to contain the resource usage amounts of all resources consumed in the entire cloud, and to use partitioning/sharding to scale the scheduler subsystem, instead of having each scheduler process handle requests for all resources in the cloud (or cell...) What is the current optional usage of zookeeper, and what it could be used for is very different thing. The resource tracking is possible. The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) See above. This assumes that the data the scheduler is iterating over is well-structured and consistent, and that is a false assumption. With stored procedures you can do almost anything, and in many ceases it is more readable than an complex query. Best, -jay Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
On 03/06/2015 03:19 PM, Attila Fazekas wrote: Looks like we need some kind of _per compute node_ mutex in the critical section, multiple scheduler MAY be able to schedule to two compute node at same time, but not for scheduling to the same compute node. If we don't want to introduce another required component or reinvent the wheel there are some possible trick with the existing globally visible components like with the RDMS. `Randomized` destination choose is recommended in most of the possible solutions, alternatives are much more complex. One SQL example: * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are still good before sending the message to the n-cpu. It can be done by re-reading the ONLY the picked hypervisor(s) related data. with `LOCK IN SHARE MODE`. If the destination hyper-visors still OK: Increase the sched_cnt value exactly by 1, test is the UPDATE really update the required number of rows, the WHERE part needs to contain the previous value. You also need to update the resource usage on the hypervisor, by the expected cost of the new vms. If at least one selected node was ok, the transaction can be COMMITed. If you were able to COMMIT the transaction, the relevant messages can be sent. The whole process needs to be repeated with the items which did not passed the post verification. If a message sending failed, `act like` migrating the vm to another host. If multiple scheduler tries to pick multiple different host in different order, it can lead to a DEADLOCK situation. Solution: Try to have all scheduler to acquire to Shared RW locks in the same order, at the end. Galera multi-writer (Active-Active) implication: As always, retry on deadlock. n-sch + n-cpu crash at the same time: * If the scheduling is not finished properly, it might be fixed manually, or we need to solve which still alive scheduler instance is responsible for fixing the particular scheduling.. So if I am reading the above correctly - you are basically proposing to move claims to the scheduler (we would atomically check if there were changes since the time we picked the host with the UPDATE .. WHERE using LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation level) and then updating the usage, a.k.a doing the claim in the same transaction. The issue here is that we still have a window between sending the message, and the message getting picked up by the compute host (or timing out) or the instance outright failing, so for sure we will need to ack/nack the claim in some way on the compute side. I believe something like this has come up before under the umbrella term of moving claims to the scheduler, and was discussed in some detail on the latest Nova mid-cycle meetup, but only artifacts I could find were a few lines on this etherpad Sylvain pointed me to [1] that I am copying here: * White board the scheduler service interface ** note: this design won't change the existing way/logic of reconciling nova db != hypervisor view ** gantt should just return claim ids, not entire claim objects ** claims are acked as being in use via the resource tracker updates from nova-compute ** we still need scheduler retries for exceptional situations (admins doing things outside openstack, hardware changes / failures) ** retry logic in conductor? probably a separate item/spec As you can see - not much to go on (but that is material for a separate thread that I may start soon). The problem I have with this particular approach is that while it claims to fix some of the races (and probably does) it does so by 1) turning the current scheduling mechanism on it's head 2) and not providing any thought into the trade-offs that it will make. For example, we may get more correct scheduling in the general case and the correctness will not be affected by the number of workers, but how does the fact that we now do locking DB access on every request fare against the retry mechanism for some of the more common usage patterns. What is the increased overhead of calling back to he scheduler to confirm the claim? In the end - how do we even measure that we are going in the right direction with the new design. I personally think that different workloads will have different needs from the scheduler in terms of response times and tolerance to failure, and that we need to design for that. So as an example a cloud operator with very simple scheduling requirements may want to go for the no locking approach and optimize for response times allowing for a small number of instances to fail under high load/utilization due to retries, while some others with more complicated scheduling requirements, or less tolerance for data inconsistency might want to trade in response times by doing locking claims in the scheduler. Some similar trade-offs and how to
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
- Original Message - From: Nikola Đipanov ndipa...@redhat.com To: openstack-dev@lists.openstack.org Sent: Tuesday, March 10, 2015 10:53:01 AM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler On 03/06/2015 03:19 PM, Attila Fazekas wrote: Looks like we need some kind of _per compute node_ mutex in the critical section, multiple scheduler MAY be able to schedule to two compute node at same time, but not for scheduling to the same compute node. If we don't want to introduce another required component or reinvent the wheel there are some possible trick with the existing globally visible components like with the RDMS. `Randomized` destination choose is recommended in most of the possible solutions, alternatives are much more complex. One SQL example: * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are still good before sending the message to the n-cpu. It can be done by re-reading the ONLY the picked hypervisor(s) related data. with `LOCK IN SHARE MODE`. If the destination hyper-visors still OK: Increase the sched_cnt value exactly by 1, test is the UPDATE really update the required number of rows, the WHERE part needs to contain the previous value. You also need to update the resource usage on the hypervisor, by the expected cost of the new vms. If at least one selected node was ok, the transaction can be COMMITed. If you were able to COMMIT the transaction, the relevant messages can be sent. The whole process needs to be repeated with the items which did not passed the post verification. If a message sending failed, `act like` migrating the vm to another host. If multiple scheduler tries to pick multiple different host in different order, it can lead to a DEADLOCK situation. Solution: Try to have all scheduler to acquire to Shared RW locks in the same order, at the end. Galera multi-writer (Active-Active) implication: As always, retry on deadlock. n-sch + n-cpu crash at the same time: * If the scheduling is not finished properly, it might be fixed manually, or we need to solve which still alive scheduler instance is responsible for fixing the particular scheduling.. So if I am reading the above correctly - you are basically proposing to move claims to the scheduler (we would atomically check if there were changes since the time we picked the host with the UPDATE .. WHERE using LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation level) and then updating the usage, a.k.a doing the claim in the same transaction. The issue here is that we still have a window between sending the message, and the message getting picked up by the compute host (or timing out) or the instance outright failing, so for sure we will need to ack/nack the claim in some way on the compute side. I believe something like this has come up before under the umbrella term of moving claims to the scheduler, and was discussed in some detail on the latest Nova mid-cycle meetup, but only artifacts I could find were a few lines on this etherpad Sylvain pointed me to [1] that I am copying here: * White board the scheduler service interface ** note: this design won't change the existing way/logic of reconciling nova db != hypervisor view ** gantt should just return claim ids, not entire claim objects ** claims are acked as being in use via the resource tracker updates from nova-compute ** we still need scheduler retries for exceptional situations (admins doing things outside openstack, hardware changes / failures) ** retry logic in conductor? probably a separate item/spec As you can see - not much to go on (but that is material for a separate thread that I may start soon). In my example, the resource needs to be considered as used before we get anything back from the compute. The resource can be `freed` at error handling, hopefully be migrating to another node. The problem I have with this particular approach is that while it claims to fix some of the races (and probably does) it does so by 1) turning the current scheduling mechanism on it's head 2) and not providing any thought into the trade-offs that it will make. For example, we may get more correct scheduling in the general case and the correctness will not be affected by the number of workers, but how does the fact that we now do locking DB access on every request fare against the retry mechanism for some of the more common usage patterns. What is the increased overhead of calling back to he scheduler to confirm the claim? In the end - how do we even measure that we are going in the right direction with the new design. I personally think that different workloads will have different needs from the scheduler
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
- Original Message - From: Attila Fazekas afaze...@redhat.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Tuesday, March 10, 2015 12:48:00 PM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler - Original Message - From: Nikola Đipanov ndipa...@redhat.com To: openstack-dev@lists.openstack.org Sent: Tuesday, March 10, 2015 10:53:01 AM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler On 03/06/2015 03:19 PM, Attila Fazekas wrote: Looks like we need some kind of _per compute node_ mutex in the critical section, multiple scheduler MAY be able to schedule to two compute node at same time, but not for scheduling to the same compute node. If we don't want to introduce another required component or reinvent the wheel there are some possible trick with the existing globally visible components like with the RDMS. `Randomized` destination choose is recommended in most of the possible solutions, alternatives are much more complex. One SQL example: * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are still good before sending the message to the n-cpu. It can be done by re-reading the ONLY the picked hypervisor(s) related data. with `LOCK IN SHARE MODE`. If the destination hyper-visors still OK: Increase the sched_cnt value exactly by 1, test is the UPDATE really update the required number of rows, the WHERE part needs to contain the previous value. You also need to update the resource usage on the hypervisor, by the expected cost of the new vms. If at least one selected node was ok, the transaction can be COMMITed. If you were able to COMMIT the transaction, the relevant messages can be sent. The whole process needs to be repeated with the items which did not passed the post verification. If a message sending failed, `act like` migrating the vm to another host. If multiple scheduler tries to pick multiple different host in different order, it can lead to a DEADLOCK situation. Solution: Try to have all scheduler to acquire to Shared RW locks in the same order, at the end. Galera multi-writer (Active-Active) implication: As always, retry on deadlock. n-sch + n-cpu crash at the same time: * If the scheduling is not finished properly, it might be fixed manually, or we need to solve which still alive scheduler instance is responsible for fixing the particular scheduling.. So if I am reading the above correctly - you are basically proposing to move claims to the scheduler (we would atomically check if there were changes since the time we picked the host with the UPDATE .. WHERE using LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation level) and then updating the usage, a.k.a doing the claim in the same transaction. The issue here is that we still have a window between sending the message, and the message getting picked up by the compute host (or timing out) or the instance outright failing, so for sure we will need to ack/nack the claim in some way on the compute side. I believe something like this has come up before under the umbrella term of moving claims to the scheduler, and was discussed in some detail on the latest Nova mid-cycle meetup, but only artifacts I could find were a few lines on this etherpad Sylvain pointed me to [1] that I am copying here: * White board the scheduler service interface ** note: this design won't change the existing way/logic of reconciling nova db != hypervisor view ** gantt should just return claim ids, not entire claim objects ** claims are acked as being in use via the resource tracker updates from nova-compute ** we still need scheduler retries for exceptional situations (admins doing things outside openstack, hardware changes / failures) ** retry logic in conductor? probably a separate item/spec As you can see - not much to go on (but that is material for a separate thread that I may start soon). In my example, the resource needs to be considered as used before we get anything back from the compute. The resource can be `freed` at error handling, hopefully be migrating to another node. The problem I have with this particular approach is that while it claims to fix some of the races (and probably does) it does so by 1) turning the current scheduling mechanism on it's head 2) and not providing any thought into the trade-offs that it will make. For example, we may get more correct scheduling in the general case and the correctness will not be affected
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Looks like we need some kind of _per compute node_ mutex in the critical section, multiple scheduler MAY be able to schedule to two compute node at same time, but not for scheduling to the same compute node. If we don't want to introduce another required component or reinvent the wheel there are some possible trick with the existing globally visible components like with the RDMS. `Randomized` destination choose is recommended in most of the possible solutions, alternatives are much more complex. One SQL example: * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are still good before sending the message to the n-cpu. It can be done by re-reading the ONLY the picked hypervisor(s) related data. with `LOCK IN SHARE MODE`. If the destination hyper-visors still OK: Increase the sched_cnt value exactly by 1, test is the UPDATE really update the required number of rows, the WHERE part needs to contain the previous value. You also need to update the resource usage on the hypervisor, by the expected cost of the new vms. If at least one selected node was ok, the transaction can be COMMITed. If you were able to COMMIT the transaction, the relevant messages can be sent. The whole process needs to be repeated with the items which did not passed the post verification. If a message sending failed, `act like` migrating the vm to another host. If multiple scheduler tries to pick multiple different host in different order, it can lead to a DEADLOCK situation. Solution: Try to have all scheduler to acquire to Shared RW locks in the same order, at the end. Galera multi-writer (Active-Active) implication: As always, retry on deadlock. n-sch + n-cpu crash at the same time: * If the scheduling is not finished properly, it might be fixed manually, or we need to solve which still alive scheduler instance is responsible for fixing the particular scheduling.. - Original Message - From: Nikola Đipanov ndipa...@redhat.com To: openstack-dev@lists.openstack.org Sent: Friday, March 6, 2015 10:29:52 AM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler On 03/06/2015 01:56 AM, Rui Chen wrote: Thank you very much for in-depth discussion about this topic, @Nikola and @Sylvain. I agree that we should solve the technical debt firstly, and then make the scheduler better. That was not necessarily my point. I would be happy to see work on how to make the scheduler less volatile when run in parallel, but the solution must acknowledge the eventually (or never really) consistent nature of the data scheduler has to operate on (in it's current design - there is also the possibility of offering an alternative design). I'd say that fixing the technical debt that is aimed at splitting the scheduler out of Nova is a mostly orthogonal effort. There have been several proposals in the past for how to make the scheduler horizontally scalable and improve it's performance. One that I remember from the Atlanta summit time-frame was the work done by Boris and his team [1] (they actually did some profiling and based their work on the bottlenecks they found). There are also some nice ideas in the bug lifeless filed [2] since this behaviour particularly impacts ironic. N. [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler [2] https://bugs.launchpad.net/nova/+bug/1341420 Best Regards. 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com mailto:sba...@redhat.com: Le 05/03/2015 13:00, Nikola Đipanov a écrit : On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
- Original Message - From: Attila Fazekas afaze...@redhat.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Friday, March 6, 2015 4:19:18 PM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Looks like we need some kind of _per compute node_ mutex in the critical section, multiple scheduler MAY be able to schedule to two compute node at same time, but not for scheduling to the same compute node. If we don't want to introduce another required component or reinvent the wheel there are some possible trick with the existing globally visible components like with the RDMS. `Randomized` destination choose is recommended in most of the possible solutions, alternatives are much more complex. One SQL example: * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table. When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are still good before sending the message to the n-cpu. It can be done by re-reading the ONLY the picked hypervisor(s) related data. with `LOCK IN SHARE MODE`. If the destination hyper-visors still OK: Increase the sched_cnt value exactly by 1, test is the UPDATE really update the required number of rows, the WHERE part needs to contain the previous value. This part is very likely not needed, if all scheduler needs, to update the (any) same field regarding to the same host, and they acquire the RW lock for reading before they change it to WRITE lock. Other strategy might consider pre acquiring the write lock only, but the write intent is not sure before we re-read and verify the data. You also need to update the resource usage on the hypervisor, by the expected cost of the new vms. If at least one selected node was ok, the transaction can be COMMITed. If you were able to COMMIT the transaction, the relevant messages can be sent. The whole process needs to be repeated with the items which did not passed the post verification. If a message sending failed, `act like` migrating the vm to another host. If multiple scheduler tries to pick multiple different host in different order, it can lead to a DEADLOCK situation. Solution: Try to have all scheduler to acquire to Shared RW locks in the same order, at the end. Galera multi-writer (Active-Active) implication: As always, retry on deadlock. n-sch + n-cpu crash at the same time: * If the scheduling is not finished properly, it might be fixed manually, or we need to solve which still alive scheduler instance is responsible for fixing the particular scheduling.. - Original Message - From: Nikola Đipanov ndipa...@redhat.com To: openstack-dev@lists.openstack.org Sent: Friday, March 6, 2015 10:29:52 AM Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler On 03/06/2015 01:56 AM, Rui Chen wrote: Thank you very much for in-depth discussion about this topic, @Nikola and @Sylvain. I agree that we should solve the technical debt firstly, and then make the scheduler better. That was not necessarily my point. I would be happy to see work on how to make the scheduler less volatile when run in parallel, but the solution must acknowledge the eventually (or never really) consistent nature of the data scheduler has to operate on (in it's current design - there is also the possibility of offering an alternative design). I'd say that fixing the technical debt that is aimed at splitting the scheduler out of Nova is a mostly orthogonal effort. There have been several proposals in the past for how to make the scheduler horizontally scalable and improve it's performance. One that I remember from the Atlanta summit time-frame was the work done by Boris and his team [1] (they actually did some profiling and based their work on the bottlenecks they found). There are also some nice ideas in the bug lifeless filed [2] since this behaviour particularly impacts ironic. N. [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler [2] https://bugs.launchpad.net/nova/+bug/1341420 Best Regards. 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com mailto:sba...@redhat.com: Le 05/03/2015 13:00, Nikola Đipanov a écrit : On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Hi, just some oslo.messaging thoughts about having multiple nova-scheduler processes (can also apply to any other daemon acting as rpc server), nova-scheduler use service.Service.create() to create a rpc server, that one is identified by a 'topic' and a 'server' (the oslo.messaging.Target). Creating multiple workers like [1] does, will result to have all workers that share the same identity. This is usually because the 'server' is set with the 'hostname', to make our life easier. With rabbitmq for example, the 'server' attribute of the oslo.messaging.Target is used for a queue name, you usually have the following queues created: scheduler scheduler.scheduler-node-1 scheduler.scheduler-node-2 scheduler.scheduler-node-3 ... Keep things as-is will result that messages that go to scheduler.scheduler-node-1 will be processed randomly by the first ready worker. You will not be able to identify workers from the amqp point of view. The side effect of that is if a worker stuck, bug or whatever and doesn't consume messages anymore, we will not be able to see it. One of the other worker will continue to notify that scheduler-node-1 works and consume new messages even if all of them are dead/stuck except one. So I think that each rpc servers (each workers) should have a different 'server', to get amqp queues like that: scheduler scheduler.scheduler-node-1-worker-1 scheduler.scheduler-node-1-worker-2 scheduler.scheduler-node-1-worker-3 scheduler.scheduler-node-2-worker-1 scheduler.scheduler-node-2-worker-2 scheduler.scheduler-node-3-worker-1 scheduler.scheduler-node-3-worker-2 ... Cheers, [1] https://review.openstack.org/#/c/159382/ --- Mehdi Abaakouk mail: sil...@sileht.net irc: sileht __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
On 03/06/2015 01:56 AM, Rui Chen wrote: Thank you very much for in-depth discussion about this topic, @Nikola and @Sylvain. I agree that we should solve the technical debt firstly, and then make the scheduler better. That was not necessarily my point. I would be happy to see work on how to make the scheduler less volatile when run in parallel, but the solution must acknowledge the eventually (or never really) consistent nature of the data scheduler has to operate on (in it's current design - there is also the possibility of offering an alternative design). I'd say that fixing the technical debt that is aimed at splitting the scheduler out of Nova is a mostly orthogonal effort. There have been several proposals in the past for how to make the scheduler horizontally scalable and improve it's performance. One that I remember from the Atlanta summit time-frame was the work done by Boris and his team [1] (they actually did some profiling and based their work on the bottlenecks they found). There are also some nice ideas in the bug lifeless filed [2] since this behaviour particularly impacts ironic. N. [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler [2] https://bugs.launchpad.net/nova/+bug/1341420 Best Regards. 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com mailto:sba...@redhat.com: Le 05/03/2015 13:00, Nikola Đipanov a écrit : On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. As I said when reviewing your patch, I think the problem is not just making sure that the scheduler is thread-safe, it's more about how the Scheduler is accounting resources and providing a retry if those consumed resources are higher than what's available. Here, the main problem is that two workers can actually consume two distinct resources on the same HostState object. In that case, the HostState object is decremented by the number of taken resources (modulo what means a resource which is not an Integer...) for both, but nowhere in that section, it does check that it overrides the resource usage. As I said, it's not just about decorating a semaphore, it's more about rethinking how the Scheduler is managing its resources. That's why I'm -1 on your patch until [1] gets merged. Once this BP will be implemented, we will have a set of classes for managing heterogeneous types of resouces and consume them, so it would be quite easy to provide a check against them in the consume_from_instance() method. I feel that the above explanation does not give the full picture in addition to being factually incorrect in several places. I have come to realize that the current behaviour of the scheduler is subtle enough that just reading the code is not enough to understand all the edge cases that can come up. The evidence being that it trips up even people that have spent significant time working on the code. It is also important to consider the design choices in terms of tradeoffs that they were trying to make. So here are some facts about the way Nova does scheduling of instances to compute hosts, considering the amount of resources requested by the flavor (we will try to put the facts into a bigger picture later): * Scheduler receives request to chose hosts for one or more instances. * Upon every request
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Thank you very much for in-depth discussion about this topic, @Nikola and @Sylvain. I agree that we should solve the technical debt firstly, and then make the scheduler better. Best Regards. 2015-03-05 21:12 GMT+08:00 Sylvain Bauza sba...@redhat.com: Le 05/03/2015 13:00, Nikola Đipanov a écrit : On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler- multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. As I said when reviewing your patch, I think the problem is not just making sure that the scheduler is thread-safe, it's more about how the Scheduler is accounting resources and providing a retry if those consumed resources are higher than what's available. Here, the main problem is that two workers can actually consume two distinct resources on the same HostState object. In that case, the HostState object is decremented by the number of taken resources (modulo what means a resource which is not an Integer...) for both, but nowhere in that section, it does check that it overrides the resource usage. As I said, it's not just about decorating a semaphore, it's more about rethinking how the Scheduler is managing its resources. That's why I'm -1 on your patch until [1] gets merged. Once this BP will be implemented, we will have a set of classes for managing heterogeneous types of resouces and consume them, so it would be quite easy to provide a check against them in the consume_from_instance() method. I feel that the above explanation does not give the full picture in addition to being factually incorrect in several places. I have come to realize that the current behaviour of the scheduler is subtle enough that just reading the code is not enough to understand all the edge cases that can come up. The evidence being that it trips up even people that have spent significant time working on the code. It is also important to consider the design choices in terms of tradeoffs that they were trying to make. So here are some facts about the way Nova does scheduling of instances to compute hosts, considering the amount of resources requested by the flavor (we will try to put the facts into a bigger picture later): * Scheduler receives request to chose hosts for one or more instances. * Upon every request (_not_ for every instance as there may be several instances in a request) the scheduler learns the state of the resources on all compute nodes from the central DB. This state may be inaccurate (meaning out of date). * Compute resources are update by each compute host periodically. This is done by updating the row in the DB. * The wall-clock time difference between the scheduler deciding to schedule an instance, and the resource consumption being reflected in the data the scheduler learns from the DB can be arbitrarily long (due to load on the compute nodes and latency of message arrival). * To cope with the above, there is a concept of retrying the request that fails on a certain compute node due to the scheduling decision being made with data stale at the moment of build, by default we will retry 3 times before giving up. * When running multiple instances, decisions are made in a loop, and internal in-memory view of the resources gets updated (the widely misunderstood consume_from_instance method is used for this), so as to keep subsequent decisions as accurate as possible. As was described above, this is all thrown away once the request is finished. Now that we understand the above, we can start to consider what changes when we introduce several concurrent scheduler processes. Several cases come to mind: * Concurrent requests will no longer be serialized on reading the state of all hosts (due to how eventlet interacts with mysql driver). * In the presence of a single request for a large number of instances there is going to be a drift in accuracy of the decisions made by other schedulers as they will not have the accounted for any of the instances until they actually get claimed on their respective hosts. All of the above limitations will likely not pose a problem under normal load and usage and can cause issues to start appearing when nodes are close to full or when there is heavy load. Also this changes drastically based on how we actually chose to
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Le 05/03/2015 08:54, Rui Chen a écrit : We will face the same issue in multiple nova-scheduler process case, like Sylvain say, right? Two processes/workers can actually consume two distinct resources on the same HostState. No. The problem I mentioned was related to having multiple threads accessing the same object in memory. By running multiple schedulers on different hosts and listening to the same RPC topic, it would work - with some caveats about race conditions too, but that's unrelated to your proposal - If you want to run multiple nova-scheduler services, then just fire them up on separate machines (that's HA, eh) and that will work. -Sylvain 2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com mailto:sou...@gmail.com: Rui, you still can run multiple nova-scheduler process now. 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com mailto:chenrui.m...@gmail.com: Looks like it's a complicated problem, and nova-scheduler can't scale-out horizontally in active/active mode. Maybe we should illustrate the problem in the HA docs. http://docs.openstack.org/high-availability-guide/content/_schedulers.html Thanks for everybody's attention. 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com mailto:mba...@redhat.com: Attila Fazekas afaze...@redhat.com mailto:afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
My BP aims is launching multiple nova-scheduler processes on a host, like nova-conductor. If we run multiple nova-scheduler services on separate hosts, that will work, forking the multiple nova-scheduler child processes on a host that will work too? Different child processes had different HostState object in self memory, the only different point with HA is just launching all scheduler processes on a host. I'm sorry to waste some time, I just want to clarify it. 2015-03-05 17:12 GMT+08:00 Sylvain Bauza sba...@redhat.com: Le 05/03/2015 08:54, Rui Chen a écrit : We will face the same issue in multiple nova-scheduler process case, like Sylvain say, right? Two processes/workers can actually consume two distinct resources on the same HostState. No. The problem I mentioned was related to having multiple threads accessing the same object in memory. By running multiple schedulers on different hosts and listening to the same RPC topic, it would work - with some caveats about race conditions too, but that's unrelated to your proposal - If you want to run multiple nova-scheduler services, then just fire them up on separate machines (that's HA, eh) and that will work. -Sylvain 2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com: Rui, you still can run multiple nova-scheduler process now. 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com: Looks like it's a complicated problem, and nova-scheduler can't scale-out horizontally in active/active mode. Maybe we should illustrate the problem in the HA docs. http://docs.openstack.org/high-availability-guide/content/_schedulers.html Thanks for everybody's attention. 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com: Attila Fazekas afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of Core without a rewrite of code. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. As I said when reviewing your patch, I think the problem is not just making sure that the scheduler is thread-safe, it's more about how the Scheduler is accounting resources and providing a retry if those consumed resources are higher than what's available. Here, the main problem is that two workers can actually consume two distinct resources on the same HostState object. In that case, the HostState object is decremented by the number of taken resources (modulo what means a resource which is not an Integer...) for both, but nowhere in that section, it does check that it overrides the resource usage. As I said, it's not just about decorating a semaphore, it's more about rethinking how the Scheduler is managing its resources. That's why I'm -1 on your patch until [1] gets merged. Once this BP will be implemented, we will have a set of classes for managing heterogeneous types of resouces and consume them, so it would be quite easy to provide a check against them in the consume_from_instance() method. I feel that the above explanation does not give the full picture in addition to being factually incorrect in several places. I have come to realize that the current behaviour of the scheduler is subtle enough that just reading the code is not enough to understand all the edge cases that can come up. The evidence being that it trips up even people that have spent significant time working on the code. It is also important to consider the design choices in terms of tradeoffs that they were trying to make. So here are some facts about the way Nova does scheduling of instances to compute hosts, considering the amount of resources requested by the flavor (we will try to put the facts into a bigger picture later): * Scheduler receives request to chose hosts for one or more instances. * Upon every request (_not_ for every instance as there may be several instances in a request) the scheduler learns the state of the resources on all compute nodes from the central DB. This state may be inaccurate (meaning out of date). * Compute resources are update by each compute host periodically. This is done by updating the row in the DB. * The wall-clock time difference between the scheduler deciding to schedule an instance, and the resource consumption being reflected in the data the scheduler learns from the DB can be arbitrarily long (due to load on the compute nodes and latency of message arrival). * To cope with the above, there is a concept of retrying the request that fails on a certain compute node due to the scheduling decision being made with data stale at the moment of build, by default we will retry 3 times before giving up. * When running multiple instances, decisions are made in a loop, and internal in-memory view of the resources gets updated (the widely misunderstood consume_from_instance method is used for this), so as to keep subsequent decisions as accurate as possible. As was described above, this is all thrown away once the request is finished. Now that we understand the above, we can start to consider what changes when we introduce several concurrent scheduler processes. Several cases come to mind: * Concurrent requests will no longer be serialized on reading the state of all hosts (due to how eventlet interacts with mysql driver). * In the presence of a single request for a large number of instances there is going to be a drift in accuracy of the decisions made by other schedulers as they will not have the accounted for any of the instances until they actually get claimed on their respective hosts. All of the above limitations will likely not pose a problem under normal load and usage and can cause issues to start appearing when nodes are close to full or when there is heavy load. Also this changes drastically based on how we actually chose to utilize hosts (see a very interesting Ironic bug [1]) Weather any of the above matters to users is dependant heavily on their use-case though. This is why I feel we should be providing more information. Finally - I think it is important to accept that the scheduler service will always have to operate under the assumptions of stale data, and
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Le 05/03/2015 13:00, Nikola Đipanov a écrit : On 03/04/2015 09:23 AM, Sylvain Bauza wrote: Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. As I said when reviewing your patch, I think the problem is not just making sure that the scheduler is thread-safe, it's more about how the Scheduler is accounting resources and providing a retry if those consumed resources are higher than what's available. Here, the main problem is that two workers can actually consume two distinct resources on the same HostState object. In that case, the HostState object is decremented by the number of taken resources (modulo what means a resource which is not an Integer...) for both, but nowhere in that section, it does check that it overrides the resource usage. As I said, it's not just about decorating a semaphore, it's more about rethinking how the Scheduler is managing its resources. That's why I'm -1 on your patch until [1] gets merged. Once this BP will be implemented, we will have a set of classes for managing heterogeneous types of resouces and consume them, so it would be quite easy to provide a check against them in the consume_from_instance() method. I feel that the above explanation does not give the full picture in addition to being factually incorrect in several places. I have come to realize that the current behaviour of the scheduler is subtle enough that just reading the code is not enough to understand all the edge cases that can come up. The evidence being that it trips up even people that have spent significant time working on the code. It is also important to consider the design choices in terms of tradeoffs that they were trying to make. So here are some facts about the way Nova does scheduling of instances to compute hosts, considering the amount of resources requested by the flavor (we will try to put the facts into a bigger picture later): * Scheduler receives request to chose hosts for one or more instances. * Upon every request (_not_ for every instance as there may be several instances in a request) the scheduler learns the state of the resources on all compute nodes from the central DB. This state may be inaccurate (meaning out of date). * Compute resources are update by each compute host periodically. This is done by updating the row in the DB. * The wall-clock time difference between the scheduler deciding to schedule an instance, and the resource consumption being reflected in the data the scheduler learns from the DB can be arbitrarily long (due to load on the compute nodes and latency of message arrival). * To cope with the above, there is a concept of retrying the request that fails on a certain compute node due to the scheduling decision being made with data stale at the moment of build, by default we will retry 3 times before giving up. * When running multiple instances, decisions are made in a loop, and internal in-memory view of the resources gets updated (the widely misunderstood consume_from_instance method is used for this), so as to keep subsequent decisions as accurate as possible. As was described above, this is all thrown away once the request is finished. Now that we understand the above, we can start to consider what changes when we introduce several concurrent scheduler processes. Several cases come to mind: * Concurrent requests will no longer be serialized on reading the state of all hosts (due to how eventlet interacts with mysql driver). * In the presence of a single request for a large number of instances there is going to be a drift in accuracy of the decisions made by other schedulers as they will not have the accounted for any of the instances until they actually get claimed on their respective hosts. All of the above limitations will likely not pose a problem under normal load and usage and can cause issues to start appearing when nodes are close to full or when there is heavy load. Also this changes drastically based on how we actually chose to utilize hosts (see a very interesting Ironic bug [1]) Weather any of the above matters to users is dependant heavily on their use-case though. This is why I feel we should be providing more information. Finally - I think it is important to accept that the scheduler service will always have to operate under the assumptions of
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Attila Fazekas afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of Core without a rewrite of code. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
On 03/04/2015 01:51 AM, Attila Fazekas wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? Actually, the scheduler does virtually no SQLAlchemy ORM queries. Almost all database access is serialized from the nova-scheduler through the nova-conductor service via the nova.objects remoting framework. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? That's a pretty big generalization. Many filters (check out NUMA configuration, host aggregate extra_specs matching, any of the JSON filters, etc) don't lend themselves to SQL column-based sorting and filtering. There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) Zookeeper has nothing to do with scheduling decisions -- only whether or not a compute node's service descriptor is active or not. The end goal (after splitting the Nova scheduler out into Gantt hopefully at the start of the L release cycle) is to have the Gantt database be more optimized to contain the resource usage amounts of all resources consumed in the entire cloud, and to use partitioning/sharding to scale the scheduler subsystem, instead of having each scheduler process handle requests for all resources in the cloud (or cell...) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) See above. This assumes that the data the scheduler is iterating over is well-structured and consistent, and that is a false assumption. Best, -jay Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Looks like it's a complicated problem, and nova-scheduler can't scale-out horizontally in active/active mode. Maybe we should illustrate the problem in the HA docs. http://docs.openstack.org/high-availability-guide/content/_schedulers.html Thanks for everybody's attention. 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com: Attila Fazekas afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of Core without a rewrite of code. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
We will face the same issue in multiple nova-scheduler process case, like Sylvain say, right? Two processes/workers can actually consume two distinct resources on the same HostState. 2015-03-05 13:26 GMT+08:00 Alex Xu sou...@gmail.com: Rui, you still can run multiple nova-scheduler process now. 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com: Looks like it's a complicated problem, and nova-scheduler can't scale-out horizontally in active/active mode. Maybe we should illustrate the problem in the HA docs. http://docs.openstack.org/high-availability-guide/content/_schedulers.html Thanks for everybody's attention. 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com: Attila Fazekas afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of Core without a rewrite of code. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Rui, you still can run multiple nova-scheduler process now. 2015-03-05 10:55 GMT+08:00 Rui Chen chenrui.m...@gmail.com: Looks like it's a complicated problem, and nova-scheduler can't scale-out horizontally in active/active mode. Maybe we should illustrate the problem in the HA docs. http://docs.openstack.org/high-availability-guide/content/_schedulers.html Thanks for everybody's attention. 2015-03-05 5:38 GMT+08:00 Mike Bayer mba...@redhat.com: Attila Fazekas afaze...@redhat.com wrote: Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? An upcoming feature in SQLAlchemy 1.0 will remove the vast majority of CPU overhead from the query side of SQLAlchemy ORM by caching all the work done up until the SQL is emitted, including all the function overhead of building up the Query object, producing a core select() object internally from the Query, working out a large part of the object fetch strategies, and finally the string compilation of the select() into a string as well as organizing the typing information for result columns. With a query that is constructed using the “Baked” feature, all of these steps are cached in memory and held persistently; the same query can then be re-used at which point all of these steps are skipped. The system produces the cache key based on the in-place construction of the Query using lambdas so no major changes to code structure are needed; just the way the Query modifications are performed needs to be preceded with “lambda q:”, essentially. With this approach, the traditional session.query(Model) approach can go from start to SQL being emitted with an order of magnitude less function calls. On the fetch side, fetching individual columns instead of full entities has always been an option with ORM and is about the same speed as a Core fetch of rows. So using ORM with minimal changes to existing ORM code you can get performance even better than you’d get using Core directly, since caching of the string compilation is also added. On the persist side, the new bulk insert / update features provide a bridge from ORM-mapped objects to bulk inserts/updates without any unit of work sorting going on. ORM mapped objects are still more expensive to use in that instantiation and state change is still more expensive, but bulk insert/update accepts dictionaries as well, which again is competitive with a straight Core insert. Both of these features are completed in the master branch, the “baked query” feature just needs documentation, and I’m basically two or three tickets away from beta releases of 1.0. The “Baked” feature itself lives as an extension and if we really wanted, I could backport it into oslo.db as well so that it works against 0.9. So I’d ask that folks please hold off on any kind of migration from ORM to Core for performance reasons. I’ve spent the past several months adding features directly to SQLAlchemy that allow an ORM-based app to have routes to operations that perform just as fast as that of Core without a rewrite of code. The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Hi, I wonder what is the planned future of the scheduling. The scheduler does a lot of high field number query, which is CPU expensive when you are using sqlalchemy-orm. Does anyone tried to switch those operations to sqlalchemy-core ? The scheduler does lot of thing in the application, like filtering what can be done on the DB level more efficiently. Why it is not done on the DB side ? There are use cases when the scheduler would need to know even more data, Is there a plan for keeping `everything` in all schedulers process memory up-to-date ? (Maybe zookeeper) The opposite way would be to move most operation into the DB side, since the DB already knows everything. (stored procedures ?) Best Regards, Attila - Original Message - From: Rui Chen chenrui.m...@gmail.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Wednesday, March 4, 2015 4:51:07 AM Subject: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
Le 04/03/2015 04:51, Rui Chen a écrit : Hi all, I want to make it easy to launch a bunch of scheduler processes on a host, multiple scheduler workers will make use of multiple processors of host and enhance the performance of nova-scheduler. I had registered a blueprint and commit a patch to implement it. https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support This patch had applied in our performance environment and pass some test cases, like: concurrent booting multiple instances, currently we didn't find inconsistent issue. IMO, nova-scheduler should been scaled horizontally on easily way, the multiple workers should been supported as an out of box feature. Please feel free to discuss this feature, thanks. As I said when reviewing your patch, I think the problem is not just making sure that the scheduler is thread-safe, it's more about how the Scheduler is accounting resources and providing a retry if those consumed resources are higher than what's available. Here, the main problem is that two workers can actually consume two distinct resources on the same HostState object. In that case, the HostState object is decremented by the number of taken resources (modulo what means a resource which is not an Integer...) for both, but nowhere in that section, it does check that it overrides the resource usage. As I said, it's not just about decorating a semaphore, it's more about rethinking how the Scheduler is managing its resources. That's why I'm -1 on your patch until [1] gets merged. Once this BP will be implemented, we will have a set of classes for managing heterogeneous types of resouces and consume them, so it would be quite easy to provide a check against them in the consume_from_instance() method. -Sylvain [1] http://specs.openstack.org/openstack/nova-specs/specs/kilo/approved/resource-objects.html Best Regards __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev