Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
On 25 February 2015 at 16:52, Carl Baldwin wrote: > jOn Mon, Feb 23, 2015 at 5:07 AM, Salvatore Orlando > wrote: > > Lazy-Stacker summary: > > I am doing some work on Neutron IPAM code for IP Allocation, and I need > to > > found whether it's better to use db locking queries (SELECT ... FOR > UPDATE) > > or some sort of non-blocking algorithm. > > Some measures suggest that for this specific problem db-level locking is > > more efficient even when using multi-master DB clusters, which kind of > > counters recent findings by other contributors [2]... but also backs > those > > from others [7]. > > > > The long and boring thing: > > > > In the past few months there's been a fair amount of discussion > concerning > > the use of locking queries (ie: SELECT ... FOR UPDATE) in various > OpenStack > > projects, especially in connection with the fact that this kind of > queries > > is likely to trigger write set certification failures in synchronous db > > replication engines such as MySQL Galera - as pointed out in [1] and > > thoroughly analysed in [2]. > > Neutron, in particular, uses this construct often - and the correctness > of > > the whole IPAM logic depends on it. On this regard Neutron is also > guilty of > > not implementing any sort of retry mechanism. Eugene, Rossella, and > others > > have been working towards fixing this issue. Some examples of their work > can > > be found at [3] and [4]. > > > > However, techniques such as "compare and swap" described in [2] can be > used > > for implementing non-blocking algorithms which avoid at all the write-set > > certification issue. > > A neutron example of such technique can be found in [5]. However, a bug > in > > this process found by Attila [6], and the subsequent gerrit discussion > [7], > > raised further questions on locking vs non-blocking solutions. > > At this stage, we know that using database-level locks triggers write-set > > certification failures in synchronous multi master modes. This failures > are > > reported as deadlock errors. Such errors can be dealt with retrying the > > transaction. However, this is claimed as not efficient in [2], whereas > > Attila in [7] argues the opposite. > > > > As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily > relies > > on this construct, I decided to spend some time to put everything to the > > test. > > To this aim I devised 3 algorithms. > > > > A-1) The classic one which locks the availability ranges for a subnet > until > > allocation is complete, augmented with a simple and brutal retry > mechanism > > [9]. > > So, similar to today's implementation. > > > A-2) A wait-free 2 step process, in which the first one creates the IP > > allocation entry, and the second completes it by adjusting availability > > ranges [10]. Primary key violations will trigger retries. > > My summary: queries RESERVED ip requests and subtracts them from > availability. Attempts to generate an IP from the resulting set. > Adjust availability in an idempotent way as a post-step. Does that > sound about right? > Correct. > > Seems there are four flavors of this. Did you compare them? Which > one did you use to compare with the other two? > There are indeed 4 flavors. Random and sequential, and for each one of them with or without a preliminary range check to verify whether the IP address has already been removed by a concurrent thread and hence skip the process. Random allocation works terribly with availability ranges. This is mainly because adjusting them with random allocation in the great majority of cases means doing 1 update + 1 insert query at least, whereas sequential allocation usually results in a single update query. The range check before retrying brings improvements in the majority of cases tested in this experimental campaign. However, its effectiveness decreases when the concurrency level is low, so it might not be the best solution. > > A-3) A wait-free, lock-free 3 step process [11], which adopts > > compare-and-swap [2] and the same election process as the bully algorithm > > [12]. > > Also, 2 flavors of this one: random and sequential. I'll admit I did > not fully grok this implementation. It looks to me like it would be > inefficient. > It sounds inefficient because we are making very little assumptions on the underlying database. I think its effectiveness can be improved; nevertheless is its complexity that worries me. > > > Unfortunately neutron does not create IP records for every address in a > > subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we > > cannot simply apply the simple CAS technique introduced in [2]. > > What if it created records for only N addresses and did the same sort > of thing with those? That would mitigate the issue with IPv6 and > large IPv4 subnets. If those N are exhausted then we'd have to run a > routine to create more records. Multiple workers could run that > routine at the same time but I imagine that they would
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
On 2/25/15, 10:52 AM, "Carl Baldwin" wrote: > >Wondering if Kilo should just focus on creating the interface which >will allow us to create multiple implementations and swap them out >during the Liberty development cycle. Hopefully, this could include >even something like your option 2 below. > >I think I would go with the locking solution first because it is most >like what we have today. +1. The pluggable framework will make it easy to swap these in and out, to get real-world experience with each. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
On 25 February 2015 at 02:35, Robert Collins wrote: > On 24 February 2015 at 01:07, Salvatore Orlando > wrote: > > Lazy-Stacker summary: > ... > > In the medium term, there are a few things we might consider for > Neutron's > > "built-in IPAM". > > 1) Move the allocation logic out of the driver, thus making IPAM an > > independent service. The API workers will then communicate with the IPAM > > service through a message bus, where IP allocation requests will be > > "naturally serialized" > > 2) Use 3-party software as dogpile, zookeeper but even memcached to > > implement distributed coordination. I have nothing against it, and I > reckon > > Neutron can only benefit for it (in case you're considering of arguing > that > > "it does not scale", please also provide solid arguments to support your > > claim!). Nevertheless, I do believe API request processing should proceed > > undisturbed as much as possible. If processing an API requests requires > > distributed coordination among several components then it probably means > > that an asynchronous paradigm is more suitable for that API request. > > So data is great. It sounds like as long as we have an appropriate > retry decorator in place, that write locks are better here, at least > for up to 30 threads. But can we trust the data? > Not unless you can prove the process to obtain them is correct. Otherwise we'd still think the sun rotates around the earth. > > One thing I'm not clear on is the SQL statement count. You say 100 > queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per > allocation ? So is that 2 queries over 50 allocations over 20 threads? > So the query number reported in the thread is for a single node test. The numbers for the galera tests are on github, and if you have a galera environment you can try and run the experiment there too. The algorithm indeed should perform a single select query for each IP allocation and the number appears to be really too high. It is coming from sqlalchemy hooks, so I guess it's reliable. It's worth noting that I put the count for all queries, including those for setting up the environment, and verifying the algorithm successful completion, so those should be removed. I can easily enable debug logging and provide a detailed breakdown of db operations for every algorithm. > I'm not clear on what the request parameter in the test json files > does, and AFAICT your threads each do one request each. As such I > suspect that you may be seeing less concurrency - and thus contention > - than real-world setups where APIs are deployed to run worker > processes in separate processes and requests are coming in > willy-nilly. The size of each algorithms workload is so small that its > feasible to imagine the thread completing before the GIL bytecount > code trigger (see > https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and > the GIL's lack of fairness would exacerbate that. > I have a retry counter which testifies that contention is actually occurring. Indeed algorithms which do sequential allocation see a lot of contention, so I do not think that I'm just fooling myself and the tests are actually running serially! Anyway, the multiprocess suggestion is very valid and I will repeat the experiments (I'm afraid that won't happen before Friday), because I did not consider the GIL aspect you mention, as I dumbly expected that python will simple spawn a different pthread for each thread and let the OS do the scheduling. > > If I may suggest: > - use multiprocessing or some other worker-pool approach rather than > threads > - or set setcheckinterval down low (e.g. to 20 or something) > - do multiple units of work (in separate transactions) within each > worker, aim for e.g. 10 seconds or work or some such. > This last suggestion also makes sense. > - log with enough detail that we can report on the actual concurrency > achieved. E.g. log the time in us when each transaction starts and > finishes, then we can assess how many concurrent requests were > actually running. > I put simple output on github only, but full debug logging can be achieved by simply changing a constant. However, I'm collecting the number of retries for each thread as an indirect marker of concurrency level. > > If the results are still the same - great, full steam ahead. If not, > well lets revisit :) > Obviously. We're not religious here. We'll simply do what the data suggest as the best way forward. > > -Rob > > > -- > Robert Collins > Distinguished Technologist > HP Converged Cloud > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
On 25 February 2015 at 13:50, Eugene Nikanorov wrote: > Thanks for putting this all together, Salvatore. > > I just want to comment on this suggestion: > > 1) Move the allocation logic out of the driver, thus making IPAM an > independent service. The API workers will then communicate with the IPAM > service through a message bus, where IP allocation requests will be > "naturally serialized" > > Right now port creation is already a distributed process involving several > parties. > Adding one more actor outside Neutron which can be communicated with > message bus just to serialize requests makes me think of how terrible > troubleshooting could be in case of applied load, when communication over > mq slows down or interrupts. > Not to say such service would be SPoF and a contention point. > So, this of course could be an option, but personally I'd not like to see > it as a default. > Basically here I'm just braindumping. I have no idea on whether this could be scalable, reliable or maintainable (see reply to Clint's post). I wish I could prototype code for this, but I'm terribly slow. The days were I was able to produce thousands of working LOCs per day are long gone. Anyway it is right that port creation is already a fairly complex workflow. However, IPAM will be anyway a synchronous operation within this workflow. Indeed if the IPAM process does not complete port wiring and securing in the agents cannot occur. So I do not expect it to add significant difficulties in troubleshooting, for which I might add that the issue is not really due to complex communication patterns, but to the fact that Neutron still does not have a decent mechanism to correlate events occurring on the server and in the agents, thus forcing developers and operators to read logs as if they were hieroglyphics. > > Thanks, > Eugene. > > On Wed, Feb 25, 2015 at 4:35 AM, Robert Collins > wrote: > >> On 24 February 2015 at 01:07, Salvatore Orlando >> wrote: >> > Lazy-Stacker summary: >> ... >> > In the medium term, there are a few things we might consider for >> Neutron's >> > "built-in IPAM". >> > 1) Move the allocation logic out of the driver, thus making IPAM an >> > independent service. The API workers will then communicate with the IPAM >> > service through a message bus, where IP allocation requests will be >> > "naturally serialized" >> > 2) Use 3-party software as dogpile, zookeeper but even memcached to >> > implement distributed coordination. I have nothing against it, and I >> reckon >> > Neutron can only benefit for it (in case you're considering of arguing >> that >> > "it does not scale", please also provide solid arguments to support your >> > claim!). Nevertheless, I do believe API request processing should >> proceed >> > undisturbed as much as possible. If processing an API requests requires >> > distributed coordination among several components then it probably means >> > that an asynchronous paradigm is more suitable for that API request. >> >> So data is great. It sounds like as long as we have an appropriate >> retry decorator in place, that write locks are better here, at least >> for up to 30 threads. But can we trust the data? >> >> One thing I'm not clear on is the SQL statement count. You say 100 >> queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per >> allocation ? So is that 2 queries over 50 allocations over 20 threads? >> >> I'm not clear on what the request parameter in the test json files >> does, and AFAICT your threads each do one request each. As such I >> suspect that you may be seeing less concurrency - and thus contention >> - than real-world setups where APIs are deployed to run worker >> processes in separate processes and requests are coming in >> willy-nilly. The size of each algorithms workload is so small that its >> feasible to imagine the thread completing before the GIL bytecount >> code trigger (see >> https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and >> the GIL's lack of fairness would exacerbate that. >> >> If I may suggest: >> - use multiprocessing or some other worker-pool approach rather than >> threads >> - or set setcheckinterval down low (e.g. to 20 or something) >> - do multiple units of work (in separate transactions) within each >> worker, aim for e.g. 10 seconds or work or some such. >> - log with enough detail that we can report on the actual concurrency >> achieved. E.g. log the time in us when each transaction starts and >> finishes, then we can assess how many concurrent requests were >> actually running. >> >> If the results are still the same - great, full steam ahead. If not, >> well lets revisit :) >> >> -Rob >> >> >> -- >> Robert Collins >> Distinguished Technologist >> HP Converged Cloud >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://li
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
jOn Mon, Feb 23, 2015 at 5:07 AM, Salvatore Orlando wrote: > Lazy-Stacker summary: > I am doing some work on Neutron IPAM code for IP Allocation, and I need to > found whether it's better to use db locking queries (SELECT ... FOR UPDATE) > or some sort of non-blocking algorithm. > Some measures suggest that for this specific problem db-level locking is > more efficient even when using multi-master DB clusters, which kind of > counters recent findings by other contributors [2]... but also backs those > from others [7]. > > The long and boring thing: > > In the past few months there's been a fair amount of discussion concerning > the use of locking queries (ie: SELECT ... FOR UPDATE) in various OpenStack > projects, especially in connection with the fact that this kind of queries > is likely to trigger write set certification failures in synchronous db > replication engines such as MySQL Galera - as pointed out in [1] and > thoroughly analysed in [2]. > Neutron, in particular, uses this construct often - and the correctness of > the whole IPAM logic depends on it. On this regard Neutron is also guilty of > not implementing any sort of retry mechanism. Eugene, Rossella, and others > have been working towards fixing this issue. Some examples of their work can > be found at [3] and [4]. > > However, techniques such as "compare and swap" described in [2] can be used > for implementing non-blocking algorithms which avoid at all the write-set > certification issue. > A neutron example of such technique can be found in [5]. However, a bug in > this process found by Attila [6], and the subsequent gerrit discussion [7], > raised further questions on locking vs non-blocking solutions. > At this stage, we know that using database-level locks triggers write-set > certification failures in synchronous multi master modes. This failures are > reported as deadlock errors. Such errors can be dealt with retrying the > transaction. However, this is claimed as not efficient in [2], whereas > Attila in [7] argues the opposite. > > As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily relies > on this construct, I decided to spend some time to put everything to the > test. > To this aim I devised 3 algorithms. > > A-1) The classic one which locks the availability ranges for a subnet until > allocation is complete, augmented with a simple and brutal retry mechanism > [9]. So, similar to today's implementation. > A-2) A wait-free 2 step process, in which the first one creates the IP > allocation entry, and the second completes it by adjusting availability > ranges [10]. Primary key violations will trigger retries. My summary: queries RESERVED ip requests and subtracts them from availability. Attempts to generate an IP from the resulting set. Adjust availability in an idempotent way as a post-step. Does that sound about right? Seems there are four flavors of this. Did you compare them? Which one did you use to compare with the other two? > A-3) A wait-free, lock-free 3 step process [11], which adopts > compare-and-swap [2] and the same election process as the bully algorithm > [12]. Also, 2 flavors of this one: random and sequential. I'll admit I did not fully grok this implementation. It looks to me like it would be inefficient. > Unfortunately neutron does not create IP records for every address in a > subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we > cannot simply apply the simple CAS technique introduced in [2]. What if it created records for only N addresses and did the same sort of thing with those? That would mitigate the issue with IPv6 and large IPv4 subnets. If those N are exhausted then we'd have to run a routine to create more records. Multiple workers could run that routine at the same time but I imagine that they would all attempt to insert the same set of new records and only one would win. The others would hit key violations and try to allocate again. > I then tested them under arbitrary concurrent load from multiple workers. > The tests were performed first on a single node backend, and then a 3-node > mysql Galera cluster, installed and configured using the official > documentation [13]. > The single-node tests showed, as it might have been expected, that the > algorithm leveraging db-level locks is way more efficient [14] > While it's pointless discussing the full results, one important aspect here > is that with db-level locks are a lot more "gentle" with the database, and > this result in consistently faster execution times. > With 20 concurrent threads, every thread completed in about 0.06 seconds > with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with A-3 > 0.32. With A-2 we saw over 500 queries, a little more than 200 with A-3, and > just a bit more than 100 with A-1. > It's worth noting that what makes A-3 faster than A-2 is a random allocation > strategy which drastically reduces collisions. A-3's performance with > sequential
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
Thanks Clint. I think you are bringing an interesting and disruptive perspective into this discussion. Disruptive because one thing that has not been considered so far in this thread is that perhaps we don't need at all to leverage multi-master capabilities for write operations. More comments inline, Salvatore On 25 February 2015 at 14:40, Clint Byrum wrote: > Excerpts from Salvatore Orlando's message of 2015-02-23 04:07:38 -0800: > > Lazy-Stacker summary: > > I am doing some work on Neutron IPAM code for IP Allocation, and I need > to > > found whether it's better to use db locking queries (SELECT ... FOR > UPDATE) > > or some sort of non-blocking algorithm. > > Some measures suggest that for this specific problem db-level locking is > > more efficient even when using multi-master DB clusters, which kind of > > counters recent findings by other contributors [2]... but also backs > those > > from others [7]. > > > > Thanks Salvatore, the story and data you produced is quite interesting. > > > > > With the test on the Galera cluster I was expecting a terrible slowdown > in > > A-1 because of deadlocks caused by certification failures. I was > extremely > > disappointed that the slowdown I measured however does not make any of > the > > other algorithms a viable alternative. > > On the Galera cluster I did not run extensive collections for A-2. Indeed > > primary key violations seem to triggers db deadlock because of failed > write > > set certification too (but I have not yet tested this). > > I run tests with 10 threads on each node, for a total of 30 workers. Some > > results are available at [15]. There was indeed a slow down in A-1 (about > > 20%), whereas A-3 performance stayed pretty much constant. Regardless, > A-1 > > was still at least 3 times faster than A-3. > > As A-3's queries are mostly select (about 75% of them) use of caches > might > > make it a lot faster; also the algorithm is probably inefficient and can > be > > optimised in several areas. Still, I suspect it can be made faster than > > A-1. At this stage I am leaning towards adoption db-level-locks with > > retries for Neutron's IPAM. However, since I never trust myself, I wonder > > if there is something important that I'm neglecting and will hit me down > > the road. > > > > The thing is, nobody should actually be running blindly with writes > being sprayed out to all nodes in a Galera cluster. So A-1 won't slow > down _at all_ if you just use Galera as an ACTIVE/PASSIVE write master. > It won't scale any worse for writes, since all writes go to all nodes > anyway. If writes are sent to a singe node, obviously there will be no contention at all. I think Mike in some other thread mentioned the intention of supporting this scheme as part of the db facade work he's doing. On the other hand, as you say in this way you'll be effectively using the DB cluster as ACTIVE/PASSIVE w.r.t writes. If the fact that there won't be any performance or scalability penalty is proven, then I think I have no concern about adopting this model. Note that I'm not saying this because I don't trust you, but merely because of my ignorance - I understand that in principle what you say is correct, but since galera replicates through write sets and simply by replying transaction logs, things might be not as obvious, and perhaps I'd look at some data supporting this claim. > For reads we can very easily start to identify hot-spot reads > that can be sent to all nodes and are tolerant of a few seconds latency. > A few second latency sound rather bad, but obviously we need to put everything into the right context. Indeed it seems to me that by "hot-spot" you mean reads which are performed fairly often? In your opinion, what would be the caveats in always distributing reads to all nodes? > > > In the medium term, there are a few things we might consider for > Neutron's > > "built-in IPAM". > > 1) Move the allocation logic out of the driver, thus making IPAM an > > independent service. The API workers will then communicate with the IPAM > > service through a message bus, where IP allocation requests will be > > "naturally serialized" > > This would rely on said message bus guaranteeing ordered delivery. That > is going to scale far worse, and be more complicated to maintain, than > Galera with a few retries on failover. > I suspect this as well, but on the other hand I believe some components adopted by several OpenStack projects adopt exactly this pattern. I have however no data point to make any sort of claim regarding their scalability. >From an architectural perspective you can make things better by scaling out the RPC server with multiple workers, but then you'd be again at the starting point! When I propose stuff on the mailing list I typically do full brain dumps, even the terrible ideas, which are unfortunately occur to me very frequently. > > > 2) Use 3-party software as dogpile, zookeeper but even memcached to > > implement distributed coordina
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
Excerpts from Salvatore Orlando's message of 2015-02-23 04:07:38 -0800: > Lazy-Stacker summary: > I am doing some work on Neutron IPAM code for IP Allocation, and I need to > found whether it's better to use db locking queries (SELECT ... FOR UPDATE) > or some sort of non-blocking algorithm. > Some measures suggest that for this specific problem db-level locking is > more efficient even when using multi-master DB clusters, which kind of > counters recent findings by other contributors [2]... but also backs those > from others [7]. > Thanks Salvatore, the story and data you produced is quite interesting. > > With the test on the Galera cluster I was expecting a terrible slowdown in > A-1 because of deadlocks caused by certification failures. I was extremely > disappointed that the slowdown I measured however does not make any of the > other algorithms a viable alternative. > On the Galera cluster I did not run extensive collections for A-2. Indeed > primary key violations seem to triggers db deadlock because of failed write > set certification too (but I have not yet tested this). > I run tests with 10 threads on each node, for a total of 30 workers. Some > results are available at [15]. There was indeed a slow down in A-1 (about > 20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1 > was still at least 3 times faster than A-3. > As A-3's queries are mostly select (about 75% of them) use of caches might > make it a lot faster; also the algorithm is probably inefficient and can be > optimised in several areas. Still, I suspect it can be made faster than > A-1. At this stage I am leaning towards adoption db-level-locks with > retries for Neutron's IPAM. However, since I never trust myself, I wonder > if there is something important that I'm neglecting and will hit me down > the road. > The thing is, nobody should actually be running blindly with writes being sprayed out to all nodes in a Galera cluster. So A-1 won't slow down _at all_ if you just use Galera as an ACTIVE/PASSIVE write master. It won't scale any worse for writes, since all writes go to all nodes anyway. For reads we can very easily start to identify hot-spot reads that can be sent to all nodes and are tolerant of a few seconds latency. > In the medium term, there are a few things we might consider for Neutron's > "built-in IPAM". > 1) Move the allocation logic out of the driver, thus making IPAM an > independent service. The API workers will then communicate with the IPAM > service through a message bus, where IP allocation requests will be > "naturally serialized" This would rely on said message bus guaranteeing ordered delivery. That is going to scale far worse, and be more complicated to maintain, than Galera with a few retries on failover. > 2) Use 3-party software as dogpile, zookeeper but even memcached to > implement distributed coordination. I have nothing against it, and I reckon > Neutron can only benefit for it (in case you're considering of arguing that > "it does not scale", please also provide solid arguments to support your > claim!). Nevertheless, I do believe API request processing should proceed > undisturbed as much as possible. If processing an API requests requires > distributed coordination among several components then it probably means > that an asynchronous paradigm is more suitable for that API request. > If we all decide that having a load balancer sending all writes and reads to one Galera node is not acceptable for some reason, then we should consider a distributed locking method that might scale better, like ZK/etcd or the like. But I think just figuring out why we want to send all writes and reads to all nodes is a better short/medium term goal. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
Thanks for putting this all together, Salvatore. I just want to comment on this suggestion: > 1) Move the allocation logic out of the driver, thus making IPAM an independent service. The API workers will then communicate with the IPAM service through a message bus, where IP allocation requests will be "naturally serialized" Right now port creation is already a distributed process involving several parties. Adding one more actor outside Neutron which can be communicated with message bus just to serialize requests makes me think of how terrible troubleshooting could be in case of applied load, when communication over mq slows down or interrupts. Not to say such service would be SPoF and a contention point. So, this of course could be an option, but personally I'd not like to see it as a default. Thanks, Eugene. On Wed, Feb 25, 2015 at 4:35 AM, Robert Collins wrote: > On 24 February 2015 at 01:07, Salvatore Orlando > wrote: > > Lazy-Stacker summary: > ... > > In the medium term, there are a few things we might consider for > Neutron's > > "built-in IPAM". > > 1) Move the allocation logic out of the driver, thus making IPAM an > > independent service. The API workers will then communicate with the IPAM > > service through a message bus, where IP allocation requests will be > > "naturally serialized" > > 2) Use 3-party software as dogpile, zookeeper but even memcached to > > implement distributed coordination. I have nothing against it, and I > reckon > > Neutron can only benefit for it (in case you're considering of arguing > that > > "it does not scale", please also provide solid arguments to support your > > claim!). Nevertheless, I do believe API request processing should proceed > > undisturbed as much as possible. If processing an API requests requires > > distributed coordination among several components then it probably means > > that an asynchronous paradigm is more suitable for that API request. > > So data is great. It sounds like as long as we have an appropriate > retry decorator in place, that write locks are better here, at least > for up to 30 threads. But can we trust the data? > > One thing I'm not clear on is the SQL statement count. You say 100 > queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per > allocation ? So is that 2 queries over 50 allocations over 20 threads? > > I'm not clear on what the request parameter in the test json files > does, and AFAICT your threads each do one request each. As such I > suspect that you may be seeing less concurrency - and thus contention > - than real-world setups where APIs are deployed to run worker > processes in separate processes and requests are coming in > willy-nilly. The size of each algorithms workload is so small that its > feasible to imagine the thread completing before the GIL bytecount > code trigger (see > https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and > the GIL's lack of fairness would exacerbate that. > > If I may suggest: > - use multiprocessing or some other worker-pool approach rather than > threads > - or set setcheckinterval down low (e.g. to 20 or something) > - do multiple units of work (in separate transactions) within each > worker, aim for e.g. 10 seconds or work or some such. > - log with enough detail that we can report on the actual concurrency > achieved. E.g. log the time in us when each transaction starts and > finishes, then we can assess how many concurrent requests were > actually running. > > If the results are still the same - great, full steam ahead. If not, > well lets revisit :) > > -Rob > > > -- > Robert Collins > Distinguished Technologist > HP Converged Cloud > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
On 24 February 2015 at 01:07, Salvatore Orlando wrote: > Lazy-Stacker summary: ... > In the medium term, there are a few things we might consider for Neutron's > "built-in IPAM". > 1) Move the allocation logic out of the driver, thus making IPAM an > independent service. The API workers will then communicate with the IPAM > service through a message bus, where IP allocation requests will be > "naturally serialized" > 2) Use 3-party software as dogpile, zookeeper but even memcached to > implement distributed coordination. I have nothing against it, and I reckon > Neutron can only benefit for it (in case you're considering of arguing that > "it does not scale", please also provide solid arguments to support your > claim!). Nevertheless, I do believe API request processing should proceed > undisturbed as much as possible. If processing an API requests requires > distributed coordination among several components then it probably means > that an asynchronous paradigm is more suitable for that API request. So data is great. It sounds like as long as we have an appropriate retry decorator in place, that write locks are better here, at least for up to 30 threads. But can we trust the data? One thing I'm not clear on is the SQL statement count. You say 100 queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per allocation ? So is that 2 queries over 50 allocations over 20 threads? I'm not clear on what the request parameter in the test json files does, and AFAICT your threads each do one request each. As such I suspect that you may be seeing less concurrency - and thus contention - than real-world setups where APIs are deployed to run worker processes in separate processes and requests are coming in willy-nilly. The size of each algorithms workload is so small that its feasible to imagine the thread completing before the GIL bytecount code trigger (see https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and the GIL's lack of fairness would exacerbate that. If I may suggest: - use multiprocessing or some other worker-pool approach rather than threads - or set setcheckinterval down low (e.g. to 20 or something) - do multiple units of work (in separate transactions) within each worker, aim for e.g. 10 seconds or work or some such. - log with enough detail that we can report on the actual concurrency achieved. E.g. log the time in us when each transaction starts and finishes, then we can assess how many concurrent requests were actually running. If the results are still the same - great, full steam ahead. If not, well lets revisit :) -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM
Lazy-Stacker summary: I am doing some work on Neutron IPAM code for IP Allocation, and I need to found whether it's better to use db locking queries (SELECT ... FOR UPDATE) or some sort of non-blocking algorithm. Some measures suggest that for this specific problem db-level locking is more efficient even when using multi-master DB clusters, which kind of counters recent findings by other contributors [2]... but also backs those from others [7]. The long and boring thing: In the past few months there's been a fair amount of discussion concerning the use of locking queries (ie: SELECT ... FOR UPDATE) in various OpenStack projects, especially in connection with the fact that this kind of queries is likely to trigger write set certification failures in synchronous db replication engines such as MySQL Galera - as pointed out in [1] and thoroughly analysed in [2]. Neutron, in particular, uses this construct often - and the correctness of the whole IPAM logic depends on it. On this regard Neutron is also guilty of not implementing any sort of retry mechanism. Eugene, Rossella, and others have been working towards fixing this issue. Some examples of their work can be found at [3] and [4]. However, techniques such as "compare and swap" described in [2] can be used for implementing non-blocking algorithms which avoid at all the write-set certification issue. A neutron example of such technique can be found in [5]. However, a bug in this process found by Attila [6], and the subsequent gerrit discussion [7], raised further questions on locking vs non-blocking solutions. At this stage, we know that using database-level locks triggers write-set certification failures in synchronous multi master modes. This failures are reported as deadlock errors. Such errors can be dealt with retrying the transaction. However, this is claimed as not efficient in [2], whereas Attila in [7] argues the opposite. As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily relies on this construct, I decided to spend some time to put everything to the test. To this aim I devised 3 algorithms. A-1) The classic one which locks the availability ranges for a subnet until allocation is complete, augmented with a simple and brutal retry mechanism [9]. A-2) A wait-free 2 step process, in which the first one creates the IP allocation entry, and the second completes it by adjusting availability ranges [10]. Primary key violations will trigger retries. A-3) A wait-free, lock-free 3 step process [11], which adopts compare-and-swap [2] and the same election process as the bully algorithm [12]. Unfortunately neutron does not create IP records for every address in a subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we cannot simply apply the simple CAS technique introduced in [2]. I then tested them under arbitrary concurrent load from multiple workers. The tests were performed first on a single node backend, and then a 3-node mysql Galera cluster, installed and configured using the official documentation [13]. The single-node tests showed, as it might have been expected, that the algorithm leveraging db-level locks is way more efficient [14] While it's pointless discussing the full results, one important aspect here is that with db-level locks are a lot more "gentle" with the database, and this result in consistently faster execution times. With 20 concurrent threads, every thread completed in about 0.06 seconds with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with A-3 0.32. With A-2 we saw over 500 queries, a little more than 200 with A-3, and just a bit more than 100 with A-1. It's worth noting that what makes A-3 faster than A-2 is a random allocation strategy which drastically reduces collisions. A-3's performance with sequential allocations are much worse (the avg. thread runtime with 20 concurrent threads is about 0.75 seconds) With the test on the Galera cluster I was expecting a terrible slowdown in A-1 because of deadlocks caused by certification failures. I was extremely disappointed that the slowdown I measured however does not make any of the other algorithms a viable alternative. On the Galera cluster I did not run extensive collections for A-2. Indeed primary key violations seem to triggers db deadlock because of failed write set certification too (but I have not yet tested this). I run tests with 10 threads on each node, for a total of 30 workers. Some results are available at [15]. There was indeed a slow down in A-1 (about 20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1 was still at least 3 times faster than A-3. As A-3's queries are mostly select (about 75% of them) use of caches might make it a lot faster; also the algorithm is probably inefficient and can be optimised in several areas. Still, I suspect it can be made faster than A-1. At this stage I am leaning towards adoption db-level-locks with retries for Neutron's IPAM. However, since I never trus