Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Eugene Nikanorov
Thanks for putting this all together, Salvatore.

I just want to comment on this suggestion:
 1) Move the allocation logic out of the driver, thus making IPAM an
independent service. The API workers will then communicate with the IPAM
service through a message bus, where IP allocation requests will be
naturally serialized

Right now port creation is already a distributed process involving several
parties.
Adding one more actor outside Neutron which can be communicated with
message bus just to serialize requests makes me think of how terrible
troubleshooting could be in case of applied load, when communication over
mq slows down or interrupts.
Not to say such service would be SPoF and a contention point.
So, this of course could be an option, but personally I'd not like to see
it as a default.

Thanks,
Eugene.

On Wed, Feb 25, 2015 at 4:35 AM, Robert Collins robe...@robertcollins.net
wrote:

 On 24 February 2015 at 01:07, Salvatore Orlando sorla...@nicira.com
 wrote:
  Lazy-Stacker summary:
 ...
  In the medium term, there are a few things we might consider for
 Neutron's
  built-in IPAM.
  1) Move the allocation logic out of the driver, thus making IPAM an
  independent service. The API workers will then communicate with the IPAM
  service through a message bus, where IP allocation requests will be
  naturally serialized
  2) Use 3-party software as dogpile, zookeeper but even memcached to
  implement distributed coordination. I have nothing against it, and I
 reckon
  Neutron can only benefit for it (in case you're considering of arguing
 that
  it does not scale, please also provide solid arguments to support your
  claim!). Nevertheless, I do believe API request processing should proceed
  undisturbed as much as possible. If processing an API requests requires
  distributed coordination among several components then it probably means
  that an asynchronous paradigm is more suitable for that API request.

 So data is great. It sounds like as long as we have an appropriate
 retry decorator in place, that write locks are better here, at least
 for up to 30 threads. But can we trust the data?

 One thing I'm not clear on is the SQL statement count.  You say 100
 queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per
 allocation ? So is that 2 queries over 50 allocations over 20 threads?

 I'm not clear on what the request parameter in the test json files
 does, and AFAICT your threads each do one request each. As such I
 suspect that you may be seeing less concurrency - and thus contention
 - than real-world setups where APIs are deployed to run worker
 processes in separate processes and requests are coming in
 willy-nilly. The size of each algorithms workload is so small that its
 feasible to imagine the thread completing before the GIL bytecount
 code trigger (see
 https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and
 the GIL's lack of fairness would exacerbate that.

 If I may suggest:
  - use multiprocessing or some other worker-pool approach rather than
 threads
  - or set setcheckinterval down low (e.g. to 20 or something)
  - do multiple units of work (in separate transactions) within each
 worker, aim for e.g. 10 seconds or work or some such.
  - log with enough detail that we can report on the actual concurrency
 achieved. E.g. log the time in us when each transaction starts and
 finishes, then we can assess how many concurrent requests were
 actually running.

 If the results are still the same - great, full steam ahead. If not,
 well lets revisit :)

 -Rob


 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Clint Byrum
Excerpts from Salvatore Orlando's message of 2015-02-23 04:07:38 -0800:
 Lazy-Stacker summary:
 I am doing some work on Neutron IPAM code for IP Allocation, and I need to
 found whether it's better to use db locking queries (SELECT ... FOR UPDATE)
 or some sort of non-blocking algorithm.
 Some measures suggest that for this specific problem db-level locking is
 more efficient even when using multi-master DB clusters, which kind of
 counters recent findings by other contributors [2]... but also backs those
 from others [7].
 

Thanks Salvatore, the story and data you produced is quite interesting.

 
 With the test on the Galera cluster I was expecting a terrible slowdown in
 A-1 because of deadlocks caused by certification failures. I was extremely
 disappointed that the slowdown I measured however does not make any of the
 other algorithms a viable alternative.
 On the Galera cluster I did not run extensive collections for A-2. Indeed
 primary key violations seem to triggers db deadlock because of failed write
 set certification too (but I have not yet tested this).
 I run tests with 10 threads on each node, for a total of 30 workers. Some
 results are available at [15]. There was indeed a slow down in A-1 (about
 20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1
 was still at least 3 times faster than A-3.
 As A-3's queries are mostly select (about 75% of them) use of caches might
 make it a lot faster; also the algorithm is probably inefficient and can be
 optimised in several areas. Still, I suspect it can be made faster than
 A-1. At this stage I am leaning towards adoption db-level-locks with
 retries for Neutron's IPAM. However, since I never trust myself, I wonder
 if there is something important that I'm neglecting and will hit me down
 the road.
 

The thing is, nobody should actually be running blindly with writes
being sprayed out to all nodes in a Galera cluster. So A-1 won't slow
down _at all_ if you just use Galera as an ACTIVE/PASSIVE write master.
It won't scale any worse for writes, since all writes go to all nodes
anyway. For reads we can very easily start to identify hot-spot reads
that can be sent to all nodes and are tolerant of a few seconds latency.

 In the medium term, there are a few things we might consider for Neutron's
 built-in IPAM.
 1) Move the allocation logic out of the driver, thus making IPAM an
 independent service. The API workers will then communicate with the IPAM
 service through a message bus, where IP allocation requests will be
 naturally serialized

This would rely on said message bus guaranteeing ordered delivery. That
is going to scale far worse, and be more complicated to maintain, than
Galera with a few retries on failover.

 2) Use 3-party software as dogpile, zookeeper but even memcached to
 implement distributed coordination. I have nothing against it, and I reckon
 Neutron can only benefit for it (in case you're considering of arguing that
 it does not scale, please also provide solid arguments to support your
 claim!). Nevertheless, I do believe API request processing should proceed
 undisturbed as much as possible. If processing an API requests requires
 distributed coordination among several components then it probably means
 that an asynchronous paradigm is more suitable for that API request.
 

If we all decide that having a load balancer sending all writes and
reads to one Galera node is not acceptable for some reason, then we
should consider a distributed locking method that might scale better,
like ZK/etcd or the like. But I think just figuring out why we want to
send all writes and reads to all nodes is a better short/medium term
goal.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Salvatore Orlando
Thanks Clint.

I think you are bringing an interesting and disruptive perspective into
this discussion.
Disruptive because one thing that has not been considered so far in this
thread is that perhaps we don't need at all to leverage multi-master
capabilities for write operations.

More comments inline,
Salvatore

On 25 February 2015 at 14:40, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Salvatore Orlando's message of 2015-02-23 04:07:38 -0800:
  Lazy-Stacker summary:
  I am doing some work on Neutron IPAM code for IP Allocation, and I need
 to
  found whether it's better to use db locking queries (SELECT ... FOR
 UPDATE)
  or some sort of non-blocking algorithm.
  Some measures suggest that for this specific problem db-level locking is
  more efficient even when using multi-master DB clusters, which kind of
  counters recent findings by other contributors [2]... but also backs
 those
  from others [7].
 

 Thanks Salvatore, the story and data you produced is quite interesting.

 
  With the test on the Galera cluster I was expecting a terrible slowdown
 in
  A-1 because of deadlocks caused by certification failures. I was
 extremely
  disappointed that the slowdown I measured however does not make any of
 the
  other algorithms a viable alternative.
  On the Galera cluster I did not run extensive collections for A-2. Indeed
  primary key violations seem to triggers db deadlock because of failed
 write
  set certification too (but I have not yet tested this).
  I run tests with 10 threads on each node, for a total of 30 workers. Some
  results are available at [15]. There was indeed a slow down in A-1 (about
  20%), whereas A-3 performance stayed pretty much constant. Regardless,
 A-1
  was still at least 3 times faster than A-3.
  As A-3's queries are mostly select (about 75% of them) use of caches
 might
  make it a lot faster; also the algorithm is probably inefficient and can
 be
  optimised in several areas. Still, I suspect it can be made faster than
  A-1. At this stage I am leaning towards adoption db-level-locks with
  retries for Neutron's IPAM. However, since I never trust myself, I wonder
  if there is something important that I'm neglecting and will hit me down
  the road.
 

 The thing is, nobody should actually be running blindly with writes
 being sprayed out to all nodes in a Galera cluster. So A-1 won't slow
 down _at all_ if you just use Galera as an ACTIVE/PASSIVE write master.
 It won't scale any worse for writes, since all writes go to all nodes
 anyway.


If writes are sent to a singe node, obviously there will be no contention
at all.
I think Mike in some other thread mentioned the intention of supporting
this scheme as part of the db facade work he's doing.
On the other hand, as you say in this way you'll be effectively using the
DB cluster as ACTIVE/PASSIVE w.r.t writes.
If the fact that there won't be any performance or scalability penalty is
proven, then I think I have no concern about adopting this model. Note that
I'm not saying this because I don't trust you, but merely because of my
ignorance - I understand that in principle what you say is correct, but
since galera replicates through write sets and simply by replying
transaction logs, things might be not as obvious, and perhaps I'd look at
some data supporting this claim.


 For reads we can very easily start to identify hot-spot reads
 that can be sent to all nodes and are tolerant of a few seconds latency.


A few second latency sound rather bad, but obviously we need to put
everything into the right context.
Indeed it seems to me that by hot-spot you mean reads which are performed
fairly often?
In your opinion, what would be the caveats in always distributing reads to
all nodes?



  In the medium term, there are a few things we might consider for
 Neutron's
  built-in IPAM.
  1) Move the allocation logic out of the driver, thus making IPAM an
  independent service. The API workers will then communicate with the IPAM
  service through a message bus, where IP allocation requests will be
  naturally serialized

 This would rely on said message bus guaranteeing ordered delivery. That
 is going to scale far worse, and be more complicated to maintain, than
 Galera with a few retries on failover.


I suspect this as well, but on the other hand I believe some components
adopted by several OpenStack projects adopt exactly this pattern. I have
however no data point to make any sort of claim regarding their scalability.
From an architectural perspective you can make things better by scaling out
the RPC server with multiple workers, but then you'd be again at the
starting point!
When I propose stuff on the mailing list I typically do full brain dumps,
even the terrible ideas, which are unfortunately occur to me very
frequently.



  2) Use 3-party software as dogpile, zookeeper but even memcached to
  implement distributed coordination. I have nothing against it, and I
 reckon
  Neutron can only benefit for it (in case 

Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Salvatore Orlando
On 25 February 2015 at 13:50, Eugene Nikanorov enikano...@mirantis.com
wrote:

 Thanks for putting this all together, Salvatore.

 I just want to comment on this suggestion:
  1) Move the allocation logic out of the driver, thus making IPAM an
 independent service. The API workers will then communicate with the IPAM
 service through a message bus, where IP allocation requests will be
 naturally serialized

 Right now port creation is already a distributed process involving several
 parties.
 Adding one more actor outside Neutron which can be communicated with
 message bus just to serialize requests makes me think of how terrible
 troubleshooting could be in case of applied load, when communication over
 mq slows down or interrupts.
 Not to say such service would be SPoF and a contention point.
 So, this of course could be an option, but personally I'd not like to see
 it as a default.


Basically here I'm just braindumping. I have no idea on whether this could
be scalable, reliable or maintainable (see reply to Clint's post). I wish I
could prototype code for this, but I'm terribly slow. The days were I was
able to produce thousands of working LOCs per day are long gone.

Anyway it is right that port creation is already a fairly complex workflow.
However, IPAM will be anyway a synchronous operation within this workflow.
Indeed if the IPAM process does not complete port wiring and securing in
the agents cannot occur. So I do not expect it to add significant
difficulties in troubleshooting, for which I might add that the issue is
not really due to complex communication patterns, but to the fact that
Neutron still does not have a decent mechanism to correlate events
occurring on the server and in the agents, thus forcing developers and
operators to read logs as if they were hieroglyphics.




 Thanks,
 Eugene.

 On Wed, Feb 25, 2015 at 4:35 AM, Robert Collins robe...@robertcollins.net
  wrote:

 On 24 February 2015 at 01:07, Salvatore Orlando sorla...@nicira.com
 wrote:
  Lazy-Stacker summary:
 ...
  In the medium term, there are a few things we might consider for
 Neutron's
  built-in IPAM.
  1) Move the allocation logic out of the driver, thus making IPAM an
  independent service. The API workers will then communicate with the IPAM
  service through a message bus, where IP allocation requests will be
  naturally serialized
  2) Use 3-party software as dogpile, zookeeper but even memcached to
  implement distributed coordination. I have nothing against it, and I
 reckon
  Neutron can only benefit for it (in case you're considering of arguing
 that
  it does not scale, please also provide solid arguments to support your
  claim!). Nevertheless, I do believe API request processing should
 proceed
  undisturbed as much as possible. If processing an API requests requires
  distributed coordination among several components then it probably means
  that an asynchronous paradigm is more suitable for that API request.

 So data is great. It sounds like as long as we have an appropriate
 retry decorator in place, that write locks are better here, at least
 for up to 30 threads. But can we trust the data?

 One thing I'm not clear on is the SQL statement count.  You say 100
 queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per
 allocation ? So is that 2 queries over 50 allocations over 20 threads?

 I'm not clear on what the request parameter in the test json files
 does, and AFAICT your threads each do one request each. As such I
 suspect that you may be seeing less concurrency - and thus contention
 - than real-world setups where APIs are deployed to run worker
 processes in separate processes and requests are coming in
 willy-nilly. The size of each algorithms workload is so small that its
 feasible to imagine the thread completing before the GIL bytecount
 code trigger (see
 https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and
 the GIL's lack of fairness would exacerbate that.

 If I may suggest:
  - use multiprocessing or some other worker-pool approach rather than
 threads
  - or set setcheckinterval down low (e.g. to 20 or something)
  - do multiple units of work (in separate transactions) within each
 worker, aim for e.g. 10 seconds or work or some such.
  - log with enough detail that we can report on the actual concurrency
 achieved. E.g. log the time in us when each transaction starts and
 finishes, then we can assess how many concurrent requests were
 actually running.

 If the results are still the same - great, full steam ahead. If not,
 well lets revisit :)

 -Rob


 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 

Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Carl Baldwin
jOn Mon, Feb 23, 2015 at 5:07 AM, Salvatore Orlando sorla...@nicira.com wrote:
 Lazy-Stacker summary:
 I am doing some work on Neutron IPAM code for IP Allocation, and I need to
 found whether it's better to use db locking queries (SELECT ... FOR UPDATE)
 or some sort of non-blocking algorithm.
 Some measures suggest that for this specific problem db-level locking is
 more efficient even when using multi-master DB clusters, which kind of
 counters recent findings by other contributors [2]... but also backs those
 from others [7].

 The long and boring thing:

 In the past few months there's been a fair amount of discussion concerning
 the use of locking queries (ie: SELECT ... FOR UPDATE) in various OpenStack
 projects, especially in connection with the fact that this kind of queries
 is likely to trigger write set certification failures in synchronous db
 replication engines such as MySQL Galera - as pointed out in [1] and
 thoroughly analysed in [2].
 Neutron, in particular, uses this construct often - and the correctness of
 the whole IPAM logic depends on it. On this regard Neutron is also guilty of
 not implementing any sort of retry mechanism. Eugene, Rossella, and others
 have been working towards fixing this issue. Some examples of their work can
 be found at [3] and [4].

 However, techniques such as compare and swap described in [2] can be used
 for implementing non-blocking algorithms which avoid at all the write-set
 certification issue.
 A neutron example of such technique can be found in [5]. However, a bug in
 this process found by Attila [6], and the subsequent gerrit discussion [7],
 raised further questions on locking vs non-blocking solutions.
 At this stage, we know that using database-level locks triggers write-set
 certification failures in synchronous multi master modes. This failures are
 reported as deadlock errors. Such errors can be dealt with retrying the
 transaction. However, this is claimed as not efficient in [2], whereas
 Attila in [7] argues the opposite.

 As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily relies
 on this construct, I decided to spend some time to put everything to the
 test.
 To this aim I devised 3 algorithms.

 A-1) The classic one which locks the availability ranges for a subnet until
 allocation is complete, augmented with a simple and brutal retry mechanism
 [9].

So, similar to today's implementation.

 A-2) A wait-free 2 step process, in which the first one creates the IP
 allocation entry, and the second completes it by adjusting availability
 ranges [10]. Primary key violations will trigger retries.

My summary:  queries RESERVED ip requests and subtracts them from
availability.  Attempts to generate an IP from the resulting set.
Adjust availability in an idempotent way as a post-step.  Does that
sound about right?

Seems there are four flavors of this.  Did you compare them?  Which
one did you use to compare with the other two?

 A-3) A wait-free, lock-free 3 step process [11], which adopts
 compare-and-swap [2] and the same election process as the bully algorithm
 [12].

Also, 2 flavors of this one:  random and sequential.  I'll admit I did
not fully grok this implementation.  It looks to me like it would be
inefficient.

 Unfortunately neutron does not create IP records for every address in a
 subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we
 cannot simply apply the simple CAS technique introduced in [2].

What if it created records for only N addresses and did the same sort
of thing with those?  That would mitigate the issue with IPv6 and
large IPv4 subnets.  If those N are exhausted then we'd have to run a
routine to create more records.  Multiple workers could run that
routine at the same time but I imagine that they would all attempt to
insert the same set of new records and only one would win.  The others
would hit key violations and try to allocate again.

 I then tested them under arbitrary concurrent load from multiple workers.
 The tests were performed first on a single node backend, and then a 3-node
 mysql Galera cluster, installed and configured using the official
 documentation [13].
 The single-node tests showed, as it might have been expected, that the
 algorithm leveraging db-level locks is way more efficient [14]
 While it's pointless discussing the full results, one important aspect here
 is that with db-level locks are a lot more gentle with the database, and
 this result in consistently faster execution times.
 With 20 concurrent threads, every thread completed in about 0.06 seconds
 with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with A-3
 0.32. With A-2 we saw over 500 queries, a little more than 200 with A-3, and
 just a bit more than 100 with A-1.
 It's worth noting that what makes A-3 faster than A-2 is a random allocation
 strategy which drastically reduces collisions. A-3's performance with
 sequential allocations are much worse (the avg. thread runtime 

Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Salvatore Orlando
On 25 February 2015 at 16:52, Carl Baldwin c...@ecbaldwin.net wrote:

 jOn Mon, Feb 23, 2015 at 5:07 AM, Salvatore Orlando sorla...@nicira.com
 wrote:
  Lazy-Stacker summary:
  I am doing some work on Neutron IPAM code for IP Allocation, and I need
 to
  found whether it's better to use db locking queries (SELECT ... FOR
 UPDATE)
  or some sort of non-blocking algorithm.
  Some measures suggest that for this specific problem db-level locking is
  more efficient even when using multi-master DB clusters, which kind of
  counters recent findings by other contributors [2]... but also backs
 those
  from others [7].
 
  The long and boring thing:
 
  In the past few months there's been a fair amount of discussion
 concerning
  the use of locking queries (ie: SELECT ... FOR UPDATE) in various
 OpenStack
  projects, especially in connection with the fact that this kind of
 queries
  is likely to trigger write set certification failures in synchronous db
  replication engines such as MySQL Galera - as pointed out in [1] and
  thoroughly analysed in [2].
  Neutron, in particular, uses this construct often - and the correctness
 of
  the whole IPAM logic depends on it. On this regard Neutron is also
 guilty of
  not implementing any sort of retry mechanism. Eugene, Rossella, and
 others
  have been working towards fixing this issue. Some examples of their work
 can
  be found at [3] and [4].
 
  However, techniques such as compare and swap described in [2] can be
 used
  for implementing non-blocking algorithms which avoid at all the write-set
  certification issue.
  A neutron example of such technique can be found in [5]. However, a bug
 in
  this process found by Attila [6], and the subsequent gerrit discussion
 [7],
  raised further questions on locking vs non-blocking solutions.
  At this stage, we know that using database-level locks triggers write-set
  certification failures in synchronous multi master modes. This failures
 are
  reported as deadlock errors. Such errors can be dealt with retrying the
  transaction. However, this is claimed as not efficient in [2], whereas
  Attila in [7] argues the opposite.
 
  As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily
 relies
  on this construct, I decided to spend some time to put everything to the
  test.
  To this aim I devised 3 algorithms.
 
  A-1) The classic one which locks the availability ranges for a subnet
 until
  allocation is complete, augmented with a simple and brutal retry
 mechanism
  [9].

 So, similar to today's implementation.

  A-2) A wait-free 2 step process, in which the first one creates the IP
  allocation entry, and the second completes it by adjusting availability
  ranges [10]. Primary key violations will trigger retries.

 My summary:  queries RESERVED ip requests and subtracts them from
 availability.  Attempts to generate an IP from the resulting set.
 Adjust availability in an idempotent way as a post-step.  Does that
 sound about right?


Correct.


 Seems there are four flavors of this.  Did you compare them?  Which
 one did you use to compare with the other two?


There are indeed 4 flavors. Random and sequential, and for each one of them
with or without a preliminary range check to verify whether the IP address
has already been removed by a concurrent thread and hence skip the process.

Random allocation works terribly with availability ranges. This is mainly
because adjusting them with random allocation in the great majority of
cases means doing 1 update + 1 insert query at least, whereas sequential
allocation usually results in a single update query.
The range check before retrying brings improvements in the majority of
cases tested in this experimental campaign. However, its effectiveness
decreases when the concurrency level is low, so it might not be the best
solution.


  A-3) A wait-free, lock-free 3 step process [11], which adopts
  compare-and-swap [2] and the same election process as the bully algorithm
  [12].

 Also, 2 flavors of this one:  random and sequential.  I'll admit I did
 not fully grok this implementation.  It looks to me like it would be
 inefficient.


It sounds inefficient because we are making very little assumptions on the
underlying database. I think its effectiveness can be improved;
nevertheless is its complexity that worries me.


  Unfortunately neutron does not create IP records for every address in a
  subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we
  cannot simply apply the simple CAS technique introduced in [2].

 What if it created records for only N addresses and did the same sort
 of thing with those?  That would mitigate the issue with IPv6 and
 large IPv4 subnets.  If those N are exhausted then we'd have to run a
 routine to create more records.  Multiple workers could run that
 routine at the same time but I imagine that they would all attempt to
 insert the same set of new records and only one would win.  The others
 would hit key 

Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread John Belamaric


On 2/25/15, 10:52 AM, Carl Baldwin c...@ecbaldwin.net wrote:

Wondering if Kilo should just focus on creating the interface which
will allow us to create multiple implementations and swap them out
during the Liberty development cycle.  Hopefully, this could include
even something like your option 2 below.

I think I would go with the locking solution first because it is most
like what we have today.

+1. The pluggable framework will make it easy to swap these in and out, to
get real-world experience with each.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-24 Thread Robert Collins
On 24 February 2015 at 01:07, Salvatore Orlando sorla...@nicira.com wrote:
 Lazy-Stacker summary:
...
 In the medium term, there are a few things we might consider for Neutron's
 built-in IPAM.
 1) Move the allocation logic out of the driver, thus making IPAM an
 independent service. The API workers will then communicate with the IPAM
 service through a message bus, where IP allocation requests will be
 naturally serialized
 2) Use 3-party software as dogpile, zookeeper but even memcached to
 implement distributed coordination. I have nothing against it, and I reckon
 Neutron can only benefit for it (in case you're considering of arguing that
 it does not scale, please also provide solid arguments to support your
 claim!). Nevertheless, I do believe API request processing should proceed
 undisturbed as much as possible. If processing an API requests requires
 distributed coordination among several components then it probably means
 that an asynchronous paradigm is more suitable for that API request.

So data is great. It sounds like as long as we have an appropriate
retry decorator in place, that write locks are better here, at least
for up to 30 threads. But can we trust the data?

One thing I'm not clear on is the SQL statement count.  You say 100
queries for A-1 with a time on Galera of 0.06*1.2=0.072 seconds per
allocation ? So is that 2 queries over 50 allocations over 20 threads?

I'm not clear on what the request parameter in the test json files
does, and AFAICT your threads each do one request each. As such I
suspect that you may be seeing less concurrency - and thus contention
- than real-world setups where APIs are deployed to run worker
processes in separate processes and requests are coming in
willy-nilly. The size of each algorithms workload is so small that its
feasible to imagine the thread completing before the GIL bytecount
code trigger (see
https://docs.python.org/2/library/sys.html#sys.setcheckinterval) and
the GIL's lack of fairness would exacerbate that.

If I may suggest:
 - use multiprocessing or some other worker-pool approach rather than threads
 - or set setcheckinterval down low (e.g. to 20 or something)
 - do multiple units of work (in separate transactions) within each
worker, aim for e.g. 10 seconds or work or some such.
 - log with enough detail that we can report on the actual concurrency
achieved. E.g. log the time in us when each transaction starts and
finishes, then we can assess how many concurrent requests were
actually running.

If the results are still the same - great, full steam ahead. If not,
well lets revisit :)

-Rob


-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-23 Thread Salvatore Orlando
Lazy-Stacker summary:
I am doing some work on Neutron IPAM code for IP Allocation, and I need to
found whether it's better to use db locking queries (SELECT ... FOR UPDATE)
or some sort of non-blocking algorithm.
Some measures suggest that for this specific problem db-level locking is
more efficient even when using multi-master DB clusters, which kind of
counters recent findings by other contributors [2]... but also backs those
from others [7].

The long and boring thing:

In the past few months there's been a fair amount of discussion concerning
the use of locking queries (ie: SELECT ... FOR UPDATE) in various OpenStack
projects, especially in connection with the fact that this kind of queries
is likely to trigger write set certification failures in synchronous db
replication engines such as MySQL Galera - as pointed out in [1] and
thoroughly analysed in [2].
Neutron, in particular, uses this construct often - and the correctness of
the whole IPAM logic depends on it. On this regard Neutron is also guilty
of not implementing any sort of retry mechanism. Eugene, Rossella, and
others have been working towards fixing this issue. Some examples of their
work can be found at [3] and [4].

However, techniques such as compare and swap described in [2] can be used
for implementing non-blocking algorithms which avoid at all the write-set
certification issue.
A neutron example of such technique can be found in [5]. However, a bug in
this process found by Attila [6], and the subsequent gerrit discussion [7],
raised further questions on locking vs non-blocking solutions.
At this stage, we know that using database-level locks triggers write-set
certification failures in synchronous multi master modes. This failures are
reported as deadlock errors. Such errors can be dealt with retrying the
transaction. However, this is claimed as not efficient in [2], whereas
Attila in [7] argues the opposite.

As I'm rewriting Neutron's IPAM logic as a part of [8], which heavily
relies on this construct, I decided to spend some time to put everything to
the test.
To this aim I devised 3 algorithms.

A-1) The classic one which locks the availability ranges for a subnet until
allocation is complete, augmented with a simple and brutal retry mechanism
[9].
A-2) A wait-free 2 step process, in which the first one creates the IP
allocation entry, and the second completes it by adjusting availability
ranges [10]. Primary key violations will trigger retries.
A-3) A wait-free, lock-free 3 step process [11], which adopts
compare-and-swap [2] and the same election process as the bully algorithm
[12].

Unfortunately neutron does not create IP records for every address in a
subnet's CIDR - this would be fairly bad for IPv6 networks. Therefore we
cannot simply apply the simple CAS technique introduced in [2].

I then tested them under arbitrary concurrent load from multiple workers.
The tests were performed first on a single node backend, and then a 3-node
mysql Galera cluster, installed and configured using the official
documentation [13].
The single-node tests showed, as it might have been expected, that the
algorithm leveraging db-level locks is way more efficient [14]
While it's pointless discussing the full results, one important aspect here
is that with db-level locks are a lot more gentle with the database, and
this result in consistently faster execution times.
With 20 concurrent threads, every thread completed in about 0.06 seconds
with db-level locks (A-1). With A-2 it took about 0.45 seconds, and with
A-3 0.32. With A-2 we saw over 500 queries, a little more than 200 with
A-3, and just a bit more than 100 with A-1.
It's worth noting that what makes A-3 faster than A-2 is a random
allocation strategy which drastically reduces collisions. A-3's performance
with sequential allocations are much worse (the avg. thread runtime with 20
concurrent threads is about 0.75 seconds)

With the test on the Galera cluster I was expecting a terrible slowdown in
A-1 because of deadlocks caused by certification failures. I was extremely
disappointed that the slowdown I measured however does not make any of the
other algorithms a viable alternative.
On the Galera cluster I did not run extensive collections for A-2. Indeed
primary key violations seem to triggers db deadlock because of failed write
set certification too (but I have not yet tested this).
I run tests with 10 threads on each node, for a total of 30 workers. Some
results are available at [15]. There was indeed a slow down in A-1 (about
20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1
was still at least 3 times faster than A-3.
As A-3's queries are mostly select (about 75% of them) use of caches might
make it a lot faster; also the algorithm is probably inefficient and can be
optimised in several areas. Still, I suspect it can be made faster than
A-1. At this stage I am leaning towards adoption db-level-locks with
retries for Neutron's IPAM. However, since I never trust