Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-12 Thread Attila Fazekas


- Original Message -
From: Attila Fazekas afaze...@redhat.com
To: Jay Pipes jaypi...@gmail.com
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com
Sent: Thursday, February 12, 2015 11:52:39 AM
Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
should know about Galera





- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: Attila Fazekas afaze...@redhat.com
 Cc: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org, Pavel
 Kholkin pkhol...@mirantis.com
 Sent: Wednesday, February 11, 2015 9:52:55 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/11/2015 06:34 AM, Attila Fazekas wrote:
  - Original Message -
  From: Jay Pipes jaypi...@gmail.com
  To: Attila Fazekas afaze...@redhat.com
  Cc: OpenStack Development Mailing List (not for usage questions)
  openstack-dev@lists.openstack.org, Pavel
  Kholkin pkhol...@mirantis.com
  Sent: Tuesday, February 10, 2015 7:32:11 PM
  Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody
  should know about Galera
 
  On 02/10/2015 06:28 AM, Attila Fazekas wrote:
  - Original Message -
  From: Jay Pipes jaypi...@gmail.com
  To: Attila Fazekas afaze...@redhat.com, OpenStack Development
  Mailing
  List (not for usage questions)
  openstack-dev@lists.openstack.org
  Cc: Pavel Kholkin pkhol...@mirantis.com
  Sent: Monday, February 9, 2015 7:15:10 PM
  Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things
  everybody
  should know about Galera
 
  On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
  snip
  Am I missed something ?
 
  Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
  that are needed to support SELECT FOR UPDATE statements across multiple
  cluster nodes.
 
  Galere does not replicates the row-level locks created by UPDATE/INSERT
  ...
  So what to do with the UPDATE?
 
  No, Galera replicates the write sets (binary log segments) for
  UPDATE/INSERT/DELETE statements -- the things that actually
  change/add/remove records in DB tables. No locks are replicated, ever.
 
  Galera does not do any replication at UPDATE/INSERT/DELETE time.
 
  $ mysql
  use test;
  CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64));
 
  $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data)
  VALUES (test);'; done )  | mysql
 
  The writer1 is busy, the other nodes did not noticed anything about the
  above pending
  transaction, for them this transaction does not exists as long as you do
  not call a COMMIT.
 
  Any kind of DML/DQL you issue without a COMMIT does not happened in the
  other nodes perspective.
 
  Replication happens at COMMIT time if the `write sets` is not empty.
 
 We're going in circles here. I was just pointing out that SELECT ... FOR
 UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements
 will cause a write-set to be replicated (yes, upon COMMIT of the
 containing transaction).
 
 Please see my repeated statements in this thread and others that the
 compare-and-swap technique is dependent on issuing *separate*
 transactions for each SELECT and UPDATE statement...
 
  When a transaction wins a voting, the other nodes rollbacks all transaction
  which had a local conflicting row lock.
 
 A SELECT statement in a separate transaction does not ever trigger a
 ROLLBACK, nor will an UPDATE statement that does not match any rows.
 That is IMO how increased throughput is achieved in the compare-and-swap
 technique versus the SELECT FOR UPDATE technique.
 
yes, I mentioned this way in one bug [0].

But the related changes on the review, actually works as I said [1][2][3],
and the SELECT is not in a separated dedicated transaction.


[0] https://bugs.launchpad.net/neutron/+bug/1410854 [sorry I sent a wrong link 
before]
[1] https://review.openstack.org/#/c/143837/
[2] https://review.openstack.org/#/c/153558/
[3] https://review.openstack.org/#/c/149261/

 -jay
 
 -jay
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-12 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: Attila Fazekas afaze...@redhat.com
 Cc: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org, Pavel
 Kholkin pkhol...@mirantis.com
 Sent: Wednesday, February 11, 2015 9:52:55 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/11/2015 06:34 AM, Attila Fazekas wrote:
  - Original Message -
  From: Jay Pipes jaypi...@gmail.com
  To: Attila Fazekas afaze...@redhat.com
  Cc: OpenStack Development Mailing List (not for usage questions)
  openstack-dev@lists.openstack.org, Pavel
  Kholkin pkhol...@mirantis.com
  Sent: Tuesday, February 10, 2015 7:32:11 PM
  Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody
  should know about Galera
 
  On 02/10/2015 06:28 AM, Attila Fazekas wrote:
  - Original Message -
  From: Jay Pipes jaypi...@gmail.com
  To: Attila Fazekas afaze...@redhat.com, OpenStack Development
  Mailing
  List (not for usage questions)
  openstack-dev@lists.openstack.org
  Cc: Pavel Kholkin pkhol...@mirantis.com
  Sent: Monday, February 9, 2015 7:15:10 PM
  Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things
  everybody
  should know about Galera
 
  On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
  snip
  Am I missed something ?
 
  Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
  that are needed to support SELECT FOR UPDATE statements across multiple
  cluster nodes.
 
  Galere does not replicates the row-level locks created by UPDATE/INSERT
  ...
  So what to do with the UPDATE?
 
  No, Galera replicates the write sets (binary log segments) for
  UPDATE/INSERT/DELETE statements -- the things that actually
  change/add/remove records in DB tables. No locks are replicated, ever.
 
  Galera does not do any replication at UPDATE/INSERT/DELETE time.
 
  $ mysql
  use test;
  CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64));
 
  $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data)
  VALUES (test);'; done )  | mysql
 
  The writer1 is busy, the other nodes did not noticed anything about the
  above pending
  transaction, for them this transaction does not exists as long as you do
  not call a COMMIT.
 
  Any kind of DML/DQL you issue without a COMMIT does not happened in the
  other nodes perspective.
 
  Replication happens at COMMIT time if the `write sets` is not empty.
 
 We're going in circles here. I was just pointing out that SELECT ... FOR
 UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements
 will cause a write-set to be replicated (yes, upon COMMIT of the
 containing transaction).
 
 Please see my repeated statements in this thread and others that the
 compare-and-swap technique is dependent on issuing *separate*
 transactions for each SELECT and UPDATE statement...
 
  When a transaction wins a voting, the other nodes rollbacks all transaction
  which had a local conflicting row lock.
 
 A SELECT statement in a separate transaction does not ever trigger a
 ROLLBACK, nor will an UPDATE statement that does not match any rows.
 That is IMO how increased throughput is achieved in the compare-and-swap
 technique versus the SELECT FOR UPDATE technique.
 
yes, I mentioned this way in one bug [0].

But the related changes on the review, actually works as I said [1][2][3],
and the SELECT is not in a separated dedicated transaction.


[0] https://blueprints.launchpad.net/nova/+spec/lock-free-quota-management
[1] https://review.openstack.org/#/c/143837/
[2] https://review.openstack.org/#/c/153558/
[3] https://review.openstack.org/#/c/149261/

 -jay
 
 -jay
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-11 Thread Jay Pipes

On 02/11/2015 06:34 AM, Attila Fazekas wrote:

- Original Message -

From: Jay Pipes jaypi...@gmail.com
To: Attila Fazekas afaze...@redhat.com
Cc: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.org, Pavel
Kholkin pkhol...@mirantis.com
Sent: Tuesday, February 10, 2015 7:32:11 PM
Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
should know about Galera

On 02/10/2015 06:28 AM, Attila Fazekas wrote:

- Original Message -

From: Jay Pipes jaypi...@gmail.com
To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing
List (not for usage questions)
openstack-dev@lists.openstack.org
Cc: Pavel Kholkin pkhol...@mirantis.com
Sent: Monday, February 9, 2015 7:15:10 PM
Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody
should know about Galera

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
that are needed to support SELECT FOR UPDATE statements across multiple
cluster nodes.


Galere does not replicates the row-level locks created by UPDATE/INSERT ...
So what to do with the UPDATE?


No, Galera replicates the write sets (binary log segments) for
UPDATE/INSERT/DELETE statements -- the things that actually
change/add/remove records in DB tables. No locks are replicated, ever.


Galera does not do any replication at UPDATE/INSERT/DELETE time.

$ mysql
use test;
CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64));

$(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES 
(test);'; done )  | mysql

The writer1 is busy, the other nodes did not noticed anything about the above 
pending
transaction, for them this transaction does not exists as long as you do not 
call a COMMIT.

Any kind of DML/DQL you issue without a COMMIT does not happened in the other 
nodes perspective.

Replication happens at COMMIT time if the `write sets` is not empty.


We're going in circles here. I was just pointing out that SELECT ... FOR 
UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements 
will cause a write-set to be replicated (yes, upon COMMIT of the 
containing transaction).


Please see my repeated statements in this thread and others that the 
compare-and-swap technique is dependent on issuing *separate* 
transactions for each SELECT and UPDATE statement...



When a transaction wins a voting, the other nodes rollbacks all transaction
which had a local conflicting row lock.


A SELECT statement in a separate transaction does not ever trigger a 
ROLLBACK, nor will an UPDATE statement that does not match any rows. 
That is IMO how increased throughput is achieved in the compare-and-swap 
technique versus the SELECT FOR UPDATE technique.


-jay

-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-11 Thread Jay Pipes

On 02/11/2015 07:58 AM, Matthew Booth wrote:

On 10/02/15 18:29, Jay Pipes wrote:

On 02/10/2015 09:47 AM, Matthew Booth wrote:

On 09/02/15 18:15, Jay Pipes wrote:

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
that are needed to support SELECT FOR UPDATE statements across multiple
cluster nodes.

https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ



Is that the right link, Jay? I'm taking your word on the write-intent
locks not being replicated, but that link seems to say the opposite.


This link is better:

http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/


Specifically the line:

The local record lock held by the started transation on pxc1 didn’t
play any part in replication or certification (replication happens at
commit time, there was no commit there yet).


Thanks, Jay, that's a great article.

Based on that, I think I may have misunderstood what you were saying
before. I currently understand that the behaviour of select ... for
update is correct on Galera, it's just not very efficient. Correct in
this case meaning it aborts the transaction due to a correctly detected
lock conflict.

FWIW, that was pretty much my original understanding, but without the
detail.

To expand: Galera doesn't replicate write intent locks, but it turns out
it doesn't have to for correctness. The reason is that the conflict
between a local write intent lock and a remote write, which is
replicated, will always be detected during or before local certification.


Exactly correct.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-11 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: Attila Fazekas afaze...@redhat.com
 Cc: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org, Pavel
 Kholkin pkhol...@mirantis.com
 Sent: Tuesday, February 10, 2015 7:32:11 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/10/2015 06:28 AM, Attila Fazekas wrote:
  - Original Message -
  From: Jay Pipes jaypi...@gmail.com
  To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing
  List (not for usage questions)
  openstack-dev@lists.openstack.org
  Cc: Pavel Kholkin pkhol...@mirantis.com
  Sent: Monday, February 9, 2015 7:15:10 PM
  Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody
  should know about Galera
 
  On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
  snip
  Am I missed something ?
 
  Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
  that are needed to support SELECT FOR UPDATE statements across multiple
  cluster nodes.
 
  Galere does not replicates the row-level locks created by UPDATE/INSERT ...
  So what to do with the UPDATE?
 
 No, Galera replicates the write sets (binary log segments) for
 UPDATE/INSERT/DELETE statements -- the things that actually
 change/add/remove records in DB tables. No locks are replicated, ever.

Galera does not do any replication at UPDATE/INSERT/DELETE time. 

$ mysql
use test;
CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64));

$(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES 
(test);'; done )  | mysql

The writer1 is busy, the other nodes did not noticed anything about the above 
pending
transaction, for them this transaction does not exists as long as you do not 
call a COMMIT.

Any kind of DML/DQL you issue without a COMMIT does not happened in the other 
nodes perspective.

Replication happens at COMMIT time if the `write sets` is not empty.

When a transaction wins a voting, the other nodes rollbacks all transaction
which had a local conflicting row lock.


  Why should I handle the FOR UPDATE differently?
 
 Because SELECT FOR UPDATE doesn't change any rows, and therefore does
 not trigger any replication event in Galera.

What matters is the full transaction changed any row at COMMIT time or not.
The DMLs them-self does not starts a replication as `SELECT FOR UPDATE` does 
not.


 See here:
 
 http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/
 
 -jay
 
  https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ
 
  Best,
  -jay
 
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-11 Thread Matthew Booth
On 10/02/15 18:29, Jay Pipes wrote:
 On 02/10/2015 09:47 AM, Matthew Booth wrote:
 On 09/02/15 18:15, Jay Pipes wrote:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
 I do not see why not to use `FOR UPDATE` even with multi-writer or
 Is the retry/swap way really solves anything here.
 snip
 Am I missed something ?

 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
 that are needed to support SELECT FOR UPDATE statements across multiple
 cluster nodes.

 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ


 Is that the right link, Jay? I'm taking your word on the write-intent
 locks not being replicated, but that link seems to say the opposite.
 
 This link is better:
 
 http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/
 
 
 Specifically the line:
 
 The local record lock held by the started transation on pxc1 didn’t
 play any part in replication or certification (replication happens at
 commit time, there was no commit there yet).

Thanks, Jay, that's a great article.

Based on that, I think I may have misunderstood what you were saying
before. I currently understand that the behaviour of select ... for
update is correct on Galera, it's just not very efficient. Correct in
this case meaning it aborts the transaction due to a correctly detected
lock conflict.

FWIW, that was pretty much my original understanding, but without the
detail.

To expand: Galera doesn't replicate write intent locks, but it turns out
it doesn't have to for correctness. The reason is that the conflict
between a local write intent lock and a remote write, which is
replicated, will always be detected during or before local certification.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Matthew Booth
On 09/02/15 18:15, Jay Pipes wrote:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
 I do not see why not to use `FOR UPDATE` even with multi-writer or
 Is the retry/swap way really solves anything here.
 snip
 Am I missed something ?
 
 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
 that are needed to support SELECT FOR UPDATE statements across multiple
 cluster nodes.
 
 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ

Is that the right link, Jay? I'm taking your word on the write-intent
locks not being replicated, but that link seems to say the opposite.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: openstack-dev@lists.openstack.org
 Sent: Monday, February 9, 2015 9:36:45 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/09/2015 03:10 PM, Clint Byrum wrote:
  Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800:
  On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
  snip
  Am I missed something ?
 
  Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
  that are needed to support SELECT FOR UPDATE statements across multiple
  cluster nodes.
 
  https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ
 
  Attila acknowledged that. What Attila was saying was that by using it
  with Galera, the box that is doing the FOR UPDATE locks will simply fail
  upon commit because a conflicting commit has already happened and arrived
  from the node that accepted the write. Further what Attila is saying is
  that this means there is not such an obvious advantage to the CAS method,
  since the rollback and the # updated rows == 0 are effectively equivalent
  at this point, seeing as the prior commit has already arrived and thus
  will not need to wait to fail certification and be rolled back.
 
 No, that is not correct. In the case of the CAS technique, the frequency
 of rollbacks due to certification failure is demonstrably less than when
 using SELECT FOR UPDATE and relying on the certification timeout error
 to signal a deadlock.
 
  I am not entirely certain that is true though, as I think what will
  happen in sequential order is:
 
  writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction';
  writer1: -- send in-progress update to cluster
  writer2: SELECT FOR UPDATE books WHERE id=3;
  writer1: COMMIT
  writer1: -- try to certify commit in cluster
  ** Here is where I stop knowing for sure what happens **
  writer2: certifies writer1's transaction or blocks?
 
 It will certify writer1's transaction. It will only block another thread
 hitting writer2 requesting write locks or write-intent read locks on the
 same records.
 
  writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3;
  writer2: COMMIT -- One of them is rolled back.
 

The other transaction can be rolled back before you do an actual commit:
writer1: BEGIN
writer2: BEGIN
writer1: update test set val=42 where id=1;
writer2: update test set val=42 where id=1;
writer1: COMMIT
writer2: show variables;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting 
transaction

As you can see 2th transaction failed without issuing a COMMIT after the 1th 
one committed.
You could write anything to mysql on writer2 at this point,
 even invalid things returns with `Deadlock`.

  So, at that point where I'm not sure (please some Galera expert tell
  me):
 
  If what happens is as I suggest, writer1's transaction is certified,
  then that just means the lock sticks around blocking stuff on writer2,
  but that the data is updated and it is certain that writer2's commit will
  be rolled back. However, if it blocks waiting on the lock to resolve,
  then I'm at a loss to determine which transaction would be rolled back,
  but I am thinking that it makes sense that the transaction from writer2
  would be rolled back, because the commit is later.
 
 That is correct. writer2's transaction would be rolled back. The
 difference is that the CAS method would NOT trigger a ROLLBACK. It would
 instead return 0 rows affected, because the UPDATE statement would
 instead look like this:
 
 UPDATE books SET genre = 'sciencefiction' WHERE id = 3 AND genre = 'SciFi';
 
 And the return of 0 rows affected would trigger a simple retry of the
 read and then update attempt on writer2 instead of dealing with ROLLBACK
 semantics on the transaction.
 
 Note that in the CAS method, the SELECT statement and the UPDATE are in
 completely different transactions. This is a very important thing to
 keep in mind.
 
  All this to say that usually the reason for SELECT FOR UPDATE is not
  to only do an update (the transactional semantics handle that), but
  also to prevent the old row from being seen again, which, as Jay says,
  it cannot do.  So I believe you are both correct:
 
  * Attila, yes I think you're right that CAS is not any more efficient
  at replacing SELECT FOR UPDATE from a blocking standpoint.
 
 It is more efficient because there are far fewer ROLLBACKs of
 transactions occurring in the system.
 
 If you look at a slow query log (with a 0 slow query time) for a MySQL
 Galera server in a multi-write cluster during a run of Tempest or Rally,
 you will notice that the number of ROLLBACK statements is extraordinary.
 AFAICR, when Peter Boros and I benchmarked a Rally launch and delete 10K
 VM run, we saw nearly 11% of *total* queries executed

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing 
 List (not for usage questions)
 openstack-dev@lists.openstack.org
 Cc: Pavel Kholkin pkhol...@mirantis.com
 Sent: Monday, February 9, 2015 7:15:10 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
 snip
  Am I missed something ?
 
 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
 that are needed to support SELECT FOR UPDATE statements across multiple
 cluster nodes.
 

Galere does not replicates the row-level locks created by UPDATE/INSERT ...
So what to do with the UPDATE ?

Why should I handle the FOR UPDATE differently ?

 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ
 
 Best,
 -jay
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Jay Pipes

On 02/10/2015 06:28 AM, Attila Fazekas wrote:

- Original Message -

From: Jay Pipes jaypi...@gmail.com
To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not 
for usage questions)
openstack-dev@lists.openstack.org
Cc: Pavel Kholkin pkhol...@mirantis.com
Sent: Monday, February 9, 2015 7:15:10 PM
Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
should know about Galera

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
that are needed to support SELECT FOR UPDATE statements across multiple
cluster nodes.


Galere does not replicates the row-level locks created by UPDATE/INSERT ...
So what to do with the UPDATE?


No, Galera replicates the write sets (binary log segments) for 
UPDATE/INSERT/DELETE statements -- the things that actually 
change/add/remove records in DB tables. No locks are replicated, ever.



Why should I handle the FOR UPDATE differently?


Because SELECT FOR UPDATE doesn't change any rows, and therefore does 
not trigger any replication event in Galera.


See here:

http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/

-jay


https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ

Best,
-jay



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-10 Thread Jay Pipes

On 02/10/2015 09:47 AM, Matthew Booth wrote:

On 09/02/15 18:15, Jay Pipes wrote:

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
that are needed to support SELECT FOR UPDATE statements across multiple
cluster nodes.

https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ


Is that the right link, Jay? I'm taking your word on the write-intent
locks not being replicated, but that link seems to say the opposite.


This link is better:

http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/

Specifically the line:

The local record lock held by the started transation on pxc1 didn’t 
play any part in replication or certification (replication happens at 
commit time, there was no commit there yet).


-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Attila Fazekas




- Original Message -
 From: Jay Pipes jaypi...@gmail.com
 To: openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com
 Sent: Wednesday, February 4, 2015 8:04:10 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
  On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:
  I've spent a few hours today reading about Galera, a clustering solution
  for MySQL. Galera provides multi-master 'virtually synchronous'
  replication between multiple mysql nodes. i.e. I can create a cluster of
  3 mysql dbs and read and write from any of them with certain consistency
  guarantees.
 
  I am no expert[1], but this is a TL;DR of a couple of things which I
  didn't know, but feel I should have done. The semantics are important to
  application design, which is why we should all be aware of them.
 
 
  * Commit will fail if there is a replication conflict
 
  foo is a table with a single field, which is its primary key.
 
  A: start transaction;
  B: start transaction;
  A: insert into foo values(1);
  B: insert into foo values(1); -- 'regular' DB would block here, and
 report an error on A's commit
  A: commit; -- success
  B: commit; -- KABOOM
 
  Confusingly, Galera will report a 'deadlock' to node B, despite this not
  being a deadlock by any definition I'm familiar with.
 
 It is a failure to certify the writeset, which bubbles up as an InnoDB
 deadlock error. See my article here:
 
 http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
 
 Which explains this.

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

Using 'FOR UPDATE' in with 'repeatable read' isolation level, seams still more 
efficient
and has several advantages.

* The SELECT with 'FOR UPDATE' will read the committed version, so you do not 
really need to
  worry about when the transaction actually started. You will get fresh data 
before you reaching the
  actual UPDATE.

* In the article the example query will not return 
  new version of data in the same transaction even if you are retrying, so
  you need to restart the transaction anyway.

  When you are using the 'FOR UPDATE' way if any other transaction successfully 
commits conflicting
  row on any other galera writer, your pending transaction will be rolled back 
at your next statement,
  WITHOUT spending any time in certificating that transaction.
  In this perspective the checking the number after the update `Compare and 
swap` or
  handling an exception does not makes any difference.

* Using FOR UPDATE in a galera transaction (multi-writer) is not more evil than 
using UPDATE, 
  concurrent commit invalidates both of them in the same way (DBDeadlock).  

* The 'FOR UPDATE' if you are using just a `single writer` does not lets other 
threads to do useless work
  while wasting resources.

* The swap way also can be rolled back by galera almost anywhere (DBDeadLock).
  At the end the swap way looks like it just replaced  the exception handling,
  with a return code check + manual transaction restart.

Am I missed something ?

  Yes ! and if I can add more information and I hope I do not make
  mistake I think it's a know issue which comes from MySQL, that is why
  we have a decorator to do a retry and so handle this case here:
 
 
  http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
 
 It's not an issue with MySQL. It's an issue with any database code that
 is highly contentious.
 
 Almost all highly distributed or concurrent applications need to handle
 deadlock issues, and the most common way to handle deadlock issues on
 database records is using a retry technique. There's nothing new about
 that with Galera.
 
 The issue with our use of the @_retry_on_deadlock decorator is *not*
 that the retry decorator is not needed, but rather it is used too
 frequently. The compare-and-swap technique I describe in the article
 above dramatically* reduces the number of deadlocks that occur (and need
 to be handled by the @_retry_on_deadlock decorator) and dramatically
 reduces the contention over critical database sections.
 
 Best,
 -jay
 
 * My colleague Pavel Kholkin is putting together the results of a
 benchmark run that compares the compare-and-swap method with the raw
 @_retry_on_deadlock decorator method. Spoiler: the compare-and-swap
 method cuts the runtime of the benchmark by almost *half*.
 
  Essentially, anywhere that a regular DB would block, Galera will not
  block transactions on different nodes. Instead, it will cause one of the
  transactions to fail on commit. This is still ACID, but the semantics
  are quite different.
 
  The impact of this is that code which makes correct use of locking may
  still fail with a 'deadlock'. The solution to this is to either fail

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Jay Pipes

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks 
that are needed to support SELECT FOR UPDATE statements across multiple 
cluster nodes.


https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Jay Pipes

On 02/09/2015 03:10 PM, Clint Byrum wrote:

Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800:

On 02/09/2015 01:02 PM, Attila Fazekas wrote:

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

snip

Am I missed something ?


Yes. Galera does not replicate the (internal to InnnoDB) row-level locks
that are needed to support SELECT FOR UPDATE statements across multiple
cluster nodes.

https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ


Attila acknowledged that. What Attila was saying was that by using it
with Galera, the box that is doing the FOR UPDATE locks will simply fail
upon commit because a conflicting commit has already happened and arrived
from the node that accepted the write. Further what Attila is saying is
that this means there is not such an obvious advantage to the CAS method,
since the rollback and the # updated rows == 0 are effectively equivalent
at this point, seeing as the prior commit has already arrived and thus
will not need to wait to fail certification and be rolled back.


No, that is not correct. In the case of the CAS technique, the frequency 
of rollbacks due to certification failure is demonstrably less than when 
using SELECT FOR UPDATE and relying on the certification timeout error 
to signal a deadlock.



I am not entirely certain that is true though, as I think what will
happen in sequential order is:

writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction';
writer1: -- send in-progress update to cluster
writer2: SELECT FOR UPDATE books WHERE id=3;
writer1: COMMIT
writer1: -- try to certify commit in cluster
** Here is where I stop knowing for sure what happens **
writer2: certifies writer1's transaction or blocks?


It will certify writer1's transaction. It will only block another thread 
hitting writer2 requesting write locks or write-intent read locks on the 
same records.



writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3;
writer2: COMMIT -- One of them is rolled back.

So, at that point where I'm not sure (please some Galera expert tell
me):

If what happens is as I suggest, writer1's transaction is certified,
then that just means the lock sticks around blocking stuff on writer2,
but that the data is updated and it is certain that writer2's commit will
be rolled back. However, if it blocks waiting on the lock to resolve,
then I'm at a loss to determine which transaction would be rolled back,
but I am thinking that it makes sense that the transaction from writer2
would be rolled back, because the commit is later.


That is correct. writer2's transaction would be rolled back. The 
difference is that the CAS method would NOT trigger a ROLLBACK. It would 
instead return 0 rows affected, because the UPDATE statement would 
instead look like this:


UPDATE books SET genre = 'sciencefiction' WHERE id = 3 AND genre = 'SciFi';

And the return of 0 rows affected would trigger a simple retry of the 
read and then update attempt on writer2 instead of dealing with ROLLBACK 
semantics on the transaction.


Note that in the CAS method, the SELECT statement and the UPDATE are in 
completely different transactions. This is a very important thing to 
keep in mind.



All this to say that usually the reason for SELECT FOR UPDATE is not
to only do an update (the transactional semantics handle that), but
also to prevent the old row from being seen again, which, as Jay says,
it cannot do.  So I believe you are both correct:

* Attila, yes I think you're right that CAS is not any more efficient
at replacing SELECT FOR UPDATE from a blocking standpoint.


It is more efficient because there are far fewer ROLLBACKs of 
transactions occurring in the system.


If you look at a slow query log (with a 0 slow query time) for a MySQL 
Galera server in a multi-write cluster during a run of Tempest or Rally, 
you will notice that the number of ROLLBACK statements is extraordinary. 
AFAICR, when Peter Boros and I benchmarked a Rally launch and delete 10K 
VM run, we saw nearly 11% of *total* queries executed against the server 
were ROLLBACKs. This, in my opinion, is the main reason that the CAS 
method will show as more efficient.



* Jay, yes I think you're right that SELECT FOR UPDATE is not the right
thing to use to do such reads, because one is relying on locks that are
meaningless on a Galera cluster.

Where I think the CAS ends up being the preferred method for this sort
of thing is where one consideres that it won't hold a meaningless lock
while the transaction is completed and then rolled back.


CAS is preferred because it is measurably faster and more 
obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost 
ready to publish documentation showing a benchmark of this that shows 
nearly a 100% decrease in total amount of lock/wait time using CAS 
versus waiting for the coarser-level certification timeout to retry the 
transactions. As 

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-02-09 12:36:45 -0800:
 CAS is preferred because it is measurably faster and more 
 obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost 
 ready to publish documentation showing a benchmark of this that shows 
 nearly a 100% decrease in total amount of lock/wait time using CAS 
 versus waiting for the coarser-level certification timeout to retry the 
 transactions. As mentioned above, I believe this is due to the dramatic 
 decrease in ROLLBACKs.
 

I think the missing piece of the puzzle for me was that each ROLLBACK is
an expensive operation. I figured it was like a non-local return (i.e.
'raise' in python or 'throw' in java) and thus not measurably different.
But now that I think of it, there is likely quite a bit of optimization
around the query path, and not so much around the rollback path.

The bottom of this rabbit hole is simply exquisite, isn't it? :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
 snip
  Am I missed something ?
 
 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks 
 that are needed to support SELECT FOR UPDATE statements across multiple 
 cluster nodes.
 
 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ
 

Attila acknowledged that. What Attila was saying was that by using it
with Galera, the box that is doing the FOR UPDATE locks will simply fail
upon commit because a conflicting commit has already happened and arrived
from the node that accepted the write. Further what Attila is saying is
that this means there is not such an obvious advantage to the CAS method,
since the rollback and the # updated rows == 0 are effectively equivalent
at this point, seeing as the prior commit has already arrived and thus
will not need to wait to fail certification and be rolled back.

I am not entirely certain that is true though, as I think what will
happen in sequential order is:

writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction';
writer1: -- send in-progress update to cluster
writer2: SELECT FOR UPDATE books WHERE id=3;
writer1: COMMIT
writer1: -- try to certify commit in cluster
** Here is where I stop knowing for sure what happens **
writer2: certifies writer1's transaction or blocks?
writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3;
writer2: COMMIT -- One of them is rolled back.

So, at that point where I'm not sure (please some Galera expert tell
me):

If what happens is as I suggest, writer1's transaction is certified,
then that just means the lock sticks around blocking stuff on writer2,
but that the data is updated and it is certain that writer2's commit will
be rolled back. However, if it blocks waiting on the lock to resolve,
then I'm at a loss to determine which transaction would be rolled back,
but I am thinking that it makes sense that the transaction from writer2
would be rolled back, because the commit is later.

All this to say that usually the reason for SELECT FOR UPDATE is not
to only do an update (the transactional semantics handle that), but
also to prevent the old row from being seen again, which, as Jay says,
it cannot do.  So I believe you are both correct:

* Attila, yes I think you're right that CAS is not any more efficient
at replacing SELECT FOR UPDATE from a blocking standpoint.

* Jay, yes I think you're right that SELECT FOR UPDATE is not the right
thing to use to do such reads, because one is relying on locks that are
meaningless on a Galera cluster.

Where I think the CAS ends up being the preferred method for this sort
of thing is where one consideres that it won't hold a meaningless lock
while the transaction is completed and then rolled back.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Jay Pipes

On 02/09/2015 05:02 PM, Clint Byrum wrote:

Excerpts from Jay Pipes's message of 2015-02-09 12:36:45 -0800:

CAS is preferred because it is measurably faster and more
obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost
ready to publish documentation showing a benchmark of this that shows
nearly a 100% decrease in total amount of lock/wait time using CAS
versus waiting for the coarser-level certification timeout to retry the
transactions. As mentioned above, I believe this is due to the dramatic
decrease in ROLLBACKs.



I think the missing piece of the puzzle for me was that each ROLLBACK is
an expensive operation. I figured it was like a non-local return (i.e.
'raise' in python or 'throw' in java) and thus not measurably different.
But now that I think of it, there is likely quite a bit of optimization
around the query path, and not so much around the rollback path.

The bottom of this rabbit hole is simply exquisite, isn't it? :)


It is indeed. :) As soon as I think I understand it fully, a new problem 
area exposes itself.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-07 Thread Peter Boros
Hi Angus,

If causal reads is set in a session, it won't delay all reads, just
that specific read that you set if for. Let's say you have 4 sessions,
in one of them you set causal reads, the other 3 won't wait on
anything. The read in the one session that you set this in will be
delayed, in the other 4, it won't be. Also this delay is usually
small. Since the replication itself is synchronous if a node it not
able to keep up with the rest of the cluster in terms of writes, it
will send flow control messages to the other nodes. Flow control means
that it has it's receive queue full, and the other nodes have to wait
until they can do more writes (in case of flow control writes on the
other nodes are blocked until the given node catches up with writes).
So the delay imposed here can't be arbitrarily large.


On Sat, Feb 7, 2015 at 3:00 AM, Angus Lees g...@inodes.org wrote:
 Thanks for the additional details Peter.  This confirms the parts I'd
 deduced from the docs I could find, and is useful knowledge.

 On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com
 wrote:

 - Like many others said it before me, consistent reads can be achieved
 with wsrep_causal_reads set on in the session.


 So the example was two dependent command-line invocations (write followed by
 read) that have no way to re-use the same DB session (without introducing
 lots of affinity issues that we'd also like to avoid).

 Enabling wsrep_casual_reads makes sure the latter read sees the effects of
 the earlier write, but comes at the cost of delaying all reads by some
 amount depending on the write-load of the galera cluster (if I understand
 correctly).  This additional delay was raised as a concern severe enough not
 to just go down this path.

 Really we don't care about other writes that may have occurred (we always
 need to deal with races against other actors), we just want to ensure our
 earlier write has taken effect on the galera server where we sent the second
 read request.  If we had some way to say wsrep_delay_until $first_txid
 then we we could be sure of read-after-write from a different DB session and
 also (in the vast majority of cases) suffer no additional delay.  An opaque
 sequencer is a generic concept across many of the distributed consensus
 stores I'm familiar with, so this needn't be exposed as a Galera-only quirk.


 Meh, I gather people are bored with the topic at this point.  As I suggested
 much earlier, I'd just enable wsrep_casual_reads on the first request for
 the session and then move on to some other problem ;)

  - Gus

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Peter Boros, Principal Architect, Percona
Telephone: +1 888 401 3401 ext 546
Emergency: +1 888 401 3401 ext 911
Skype: percona.pboros

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-06 Thread Angus Lees
Thanks for the additional details Peter.  This confirms the parts I'd
deduced from the docs I could find, and is useful knowledge.

On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com
wrote:

 - Like many others said it before me, consistent reads can be achieved
 with wsrep_causal_reads set on in the session.


So the example was two dependent command-line invocations (write followed
by read) that have no way to re-use the same DB session (without
introducing lots of affinity issues that we'd also like to avoid).

Enabling wsrep_casual_reads makes sure the latter read sees the effects of
the earlier write, but comes at the cost of delaying all reads by some
amount depending on the write-load of the galera cluster (if I understand
correctly).  This additional delay was raised as a concern severe enough
not to just go down this path.

Really we don't care about other writes that may have occurred (we always
need to deal with races against other actors), we just want to ensure our
earlier write has taken effect on the galera server where we sent the
second read request.  If we had some way to say wsrep_delay_until
$first_txid then we we could be sure of read-after-write from a different
DB session and also (in the vast majority of cases) suffer no additional
delay.  An opaque sequencer is a generic concept across many of the
distributed consensus stores I'm familiar with, so this needn't be exposed
as a Galera-only quirk.


Meh, I gather people are bored with the topic at this point.  As I
suggested much earlier, I'd just enable wsrep_casual_reads on the first
request for the session and then move on to some other problem ;)

 - Gus
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-06 Thread Peter Boros
Hi Angus and everyone,

I would like to reply for a couple of things:
- The behavior of overlapping transactions is dependent on the
transaction isolation level, even in the case of the single server,
for any database. This was pointed out by others earlier as well.

- The deadlock error from Galera can be confusing, but the point is
that the application can actually threat this as a deadlock (or apply
any kind of retry logic, which it would apply to a failed
transaction), I don't know if it would be even more confusing from the
developer's point of view, if it would say brute force error.
Transactions can fail in a database, in the initial example the
transaction will fail with a duplicate key error. The result is pretty
much the same from the application's perspective, the transaction was
not successful (it failed as a block), the application should handle
the failure. There can be a lot more reasons for a transaction to fail
regardless of the database engine, some of these failures are
persistent (for example the disk is full underneath the database), and
some of these are intermittent in nature like the case above. A good
retry mechanism can be good for handling the intermittent failures,
depending on the application logic.

- Like many others said it before me, consistent reads can be achieved
with wsrep_causal_reads set on in the session. I can shed some light
on how this works. Nodes in galera are participating in a group
communication. A global order of the transactions are established as
part of this. Since the global order of the transaction is known, a
session with wsrep_causal_reads on will put a marker in the local
replication queue. Because transaction ordering is global, the session
will be simply blocked until all the other transactions are processed
in the replication queue before that marker. So, setting
wsrep_causal_reads imposes additional latency only for the given
select we are using it on (it literally just waits the queue to be
processed up to the current transaction). So because of this, manual
checking of the global transaction ids is not necessary.

- On synchronous replication: galera only transmits the data
synchronously, it doesn't do synchronous apply. A transaction is sent
in parallel to the rest of the cluster nodes (to be accurate, it's
only sent to the nodes that are in the same group segment, but it
waits until all the group segments get the data). Once the other nodes
received it, the transaction commits locally, the others will apply it
later. The cluster can do this because of certification and because
certification is deterministic (the result of the certification will
be the same on all nodes, otherwise, the nodes have a different state,
for example one of them was written locally). The replication uses
write sets, which is practically row based mysql binary log event and
some metadata. The some metadata is good for 2 things: you can take a
look at 2 write sets and tell if they are conflicting or not, and you
can decide if a write set is applicable to a database. Because this is
checked at certification time, the apply part can be parallel (because
of the certification, it's guaranteed that the transactions are not
conflicting). When it comes to consistency and replication speed,
there are no wonders, there are tradeoffs to make. Two phase commit is
relatively slow, distributed locking is relatively slow, this is a lot
faster, but the application should handle transaction failures (which
it should probably handle anyway).

Here is the xtradb cluster documentation (Percona Server with galera):
http://www.percona.com/doc/percona-xtradb-cluster/5.6/#user-s-manual

Here is the multi-master replication part of the documentation:
http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html


On Fri, Feb 6, 2015 at 3:36 AM, Angus Lees g...@inodes.org wrote:
 On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net
 wrote:

 Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +:
  Angus Lees wrote:
   On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com
   mailto:cl...@fewbar.com wrote:
   I'd also like to see consideration given to systems that handle
   distributed consistency in a more active manner. etcd and
   Zookeeper are
   both such systems, and might serve as efficient guards for
   critical
   sections without raising latency.
  
  
   +1 for moving to such systems.  Then we can have a repeat of the above
   conversation without the added complications of SQL semantics ;)
  
 
  So just an fyi:
 
  http://docs.openstack.org/developer/tooz/ exists.
 
  Specifically:
 
 
  http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock
 
  It has a locking api that it provides (that plugs into the various
  backends); there is also a WIP https://review.openstack.org/#/c/151463/
  driver that is being worked for etc.d.
 

 An interesting note 

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Rushi Agrawal
On 5 February 2015 at 23:07, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Avishay Traeger's message of 2015-02-04 22:19:53 -0800:
  On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins 
 robe...@robertcollins.net
  wrote:
 
   On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com
 wrote:
How interesting,
   
Why are people using galera if it behaves like this? :-/
  
   Because its actually fairly normal. In fact its an instance of point 7
   on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
   oldest wiki pages :).
  
 
  When I hear MySQL I don't exactly think of eventual consistency (#7),
  scalability (#1), horizontal scalability (#4), etc.
  For the past few months I have been advocating implementing an
 alternative
  to db/sqlalchemy, but of course it's a huge undertaking.  NoSQL (or even
  distributed key-value stores) should be considered IMO.  Just some food
 for
  thought :)
 

 I know it is popular to think that MySQL* == old slow and low-scale, but
 that is only popular with those who have not actually tried to scale
 MySQL. You may want to have a chat with the people running MySQL at
 Google, Facebook, and a long tail of not quite as big sites but still
 massively bigger than most clouds. Note that many of the people who
 helped those companies scale up are involved directly with OpenStack.

 Just an aside: Youtube relies completely on MySQL for all of it's database
traffic, but uses a layer on top of it called Vitess [1] to allow it to
scale.

[1]: https://github.com/youtube/vitess


 The NoSQL bits that are popular out there make the easy part easy. There
 is no magic bullet for the hard part, which is when you need to do both
 synchronous and asynchronous. Factor in its maturity and the breadth of
 talent available, and I'll choose MySQL for this task every time.

 * Please let's also give a nod to our friends working on MariaDB, a
   MySQL-compatible fork that many find preferrable and for the purposes
   of this discussion, equivalent.

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Attila Fazekas




- Original Message -
 From: Matthew Booth mbo...@redhat.com
 To: openstack-dev@lists.openstack.org
 Sent: Thursday, February 5, 2015 12:32:33 PM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 05/02/15 11:01, Attila Fazekas wrote:
  I have a question related to deadlock handling as well.
  
  Why the DBDeadlock exception is not caught generally for all api/rpc
  request ?
  
  The mysql recommendation regarding to Deadlocks [1]:
  Normally, you must write your applications so that they are always
   prepared to re-issue a transaction if it gets rolled back because of a
   deadlock.
 
 This is evil imho, although it may well be pragmatic. A deadlock (a real
 deadlock, that is) occurs because of a preventable bug in code. It
 occurs because 2 transactions have attempted to take multiple locks in a
 different order. Getting this right is hard, but it is achievable. The
 solution to real deadlocks is to fix the bugs.

 
 Galera 'deadlocks' on the other hand are not deadlocks, despite being
 reported as such (sounds as though this is due to an implementation
 quirk?). They don't involve 2 transactions holding mutual locks, and
 there is never any doubt about how to proceed. They involve 2
 transactions holding the same lock, and 1 of them committed first. In a
 real deadlock they wouldn't get as far as commit. This isn't any kind of
 bug: it's normal behaviour in this environment and you just have to
 handle it.

  Now the services are just handling the DBDeadlock in several places.
  We have some logstash hits for other places even without galera.
 
 I haven't had much success with logstash. Could you post a query which
 would return these? This would be extremely interesting.

Just use this:
message: DBDeadlock

If you would like to exclude the lock wait timeout ones:
message: Deadlock found when trying to get lock


  Instead of throwing 503 to the end user, the request could be repeated
  `silently`.
  
  The users would be able repeat the request himself,
  so the automated repeat should not cause unexpected new problem.
 
 Good point: we could argue 'no worse than now', even if it's buggy.
 
  The retry limit might be configurable, the exception needs to be watched
  before
  anything sent to the db on behalf of the transaction or request.
  
  Considering all request handler as potential deadlock thrower seams much
  easier than,
  deciding case by case.
 
 Well this happens at the transaction level, and we don't quite have a
 1:1 request:transaction relationship. We're moving towards it, but
 potentially long running requests will always have to use multiple
 transactions.
 
 However, I take your point. I think retry on transaction failure is
 something which would benefit from standard handling in a library.
 
 Matt
 --
 Matthew Booth
 Red Hat Engineering, Virtualisation Team
 
 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 04/02/15 19:04, Jay Pipes wrote:
 On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
 On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:
 I've spent a few hours today reading about Galera, a clustering solution
 for MySQL. Galera provides multi-master 'virtually synchronous'
 replication between multiple mysql nodes. i.e. I can create a cluster of
 3 mysql dbs and read and write from any of them with certain consistency
 guarantees.

 I am no expert[1], but this is a TL;DR of a couple of things which I
 didn't know, but feel I should have done. The semantics are important to
 application design, which is why we should all be aware of them.


 * Commit will fail if there is a replication conflict

 foo is a table with a single field, which is its primary key.

 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM

 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.
 
 It is a failure to certify the writeset, which bubbles up as an InnoDB
 deadlock error. See my article here:
 
 http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
 
 
 Which explains this.
 
 Yes ! and if I can add more information and I hope I do not make
 mistake I think it's a know issue which comes from MySQL, that is why
 we have a decorator to do a retry and so handle this case here:

   
 http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

 
 It's not an issue with MySQL. It's an issue with any database code that
 is highly contentious.
 
 Almost all highly distributed or concurrent applications need to handle
 deadlock issues, and the most common way to handle deadlock issues on
 database records is using a retry technique. There's nothing new about
 that with Galera.
 
 The issue with our use of the @_retry_on_deadlock decorator is *not*
 that the retry decorator is not needed, but rather it is used too
 frequently. The compare-and-swap technique I describe in the article
 above dramatically* reduces the number of deadlocks that occur (and need
 to be handled by the @_retry_on_deadlock decorator) and dramatically
 reduces the contention over critical database sections.

I'm still confused as to how this code got there, though. We shouldn't
be hitting Galera lock contention (reported as deadlocks) if we're using
a single master, which I thought we were. Does this mean either:

A. There are deployments using multi-master?
B. These are really deadlocks?

If A, is this something we need to continue to support?

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 11:11, Sahid Orentino Ferdjaoui wrote:
 I'm still confused as to how this code got there, though. We shouldn't
 be hitting Galera lock contention (reported as deadlocks) if we're using
 a single master, which I thought we were. Does this mean either:
 
 I guess we can hit a lock contention even in single master.

I don't think so, but you can certainly still have real deadlocks.
They're bugs, though.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Clint Byrum
Excerpts from Angus Lees's message of 2015-02-04 16:59:31 -0800:
 On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net
 wrote:
 
  On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
   How interesting,
  
   Why are people using galera if it behaves like this? :-/
 
  Because its actually fairly normal. In fact its an instance of point 7
  on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
  oldest wiki pages :).
 
  In more detail, consider what happens in full isolation when you have
  the A and B example given, but B starts its transaction before A.
 
  B BEGIN
  A BEGIN
  A INSERT foo
  A COMMIT
  B SELECT foo - NULL
 
 
 Note that this still makes sense from each of A and B's individual view of
 the world.
 
 If I understood correctly, the big change with Galera that Matthew is
 highlighting is that read-after-write may not be consistent from the pov of
 a single thread.
 

No that's not a complete picture.

What Matthew is highlighting is that after a commit, a new transaction
may not see the write if it is done on a separate node in the cluster.

In a single thread, using a single database session, then a read after
successful commit is guaranteed to read a version of the database
that existed after that commit. What it may not be consistent with is
subsequent writes which may have happened after the commit on other
servers, unless you use the sync wait.

 Not have read-after-write is *really* hard to code to (see for example x86
 SMP cache coherency, C++ threading semantics, etc which all provide
 read-after-write for this reason).  This is particularly true when the
 affected operations are hidden behind an ORM - it isn't clear what might
 involve a database call and sequencers (or logical clocks, etc) aren't made
 explicit in the API.
 
 I strongly suggest just enabling wsrep_casual_reads on all galera sessions,
 unless you can guarantee that the high-level task is purely read-only, and
 then moving on to something else ;)  If we choose performance over
 correctness here then we're just signing up for lots of debugging of hard
 to reproduce race conditions, and the fixes are going to look like what
 wsrep_casual_reads does anyway.
 
 (Mind you, exposing sequencers at every API interaction would be awesome,
 and I look forward to a future framework and toolchain that makes that easy
 to do correctly)
 

I'd like to see actual examples where that will matter. Meanwhile making
all selects wait for the cluster will basically just ruin responsiveness
and waste tons of time, so we should be careful to think this through
before making any blanket policy.

I'd also like to see consideration given to systems that handle
distributed consistency in a more active manner. etcd and Zookeeper are
both such systems, and might serve as efficient guards for critical
sections without raising latency.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Clint Byrum
Excerpts from Avishay Traeger's message of 2015-02-04 22:19:53 -0800:
 On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net
 wrote:
 
  On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
   How interesting,
  
   Why are people using galera if it behaves like this? :-/
 
  Because its actually fairly normal. In fact its an instance of point 7
  on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
  oldest wiki pages :).
 
 
 When I hear MySQL I don't exactly think of eventual consistency (#7),
 scalability (#1), horizontal scalability (#4), etc.
 For the past few months I have been advocating implementing an alternative
 to db/sqlalchemy, but of course it's a huge undertaking.  NoSQL (or even
 distributed key-value stores) should be considered IMO.  Just some food for
 thought :)
 

I know it is popular to think that MySQL* == old slow and low-scale, but
that is only popular with those who have not actually tried to scale
MySQL. You may want to have a chat with the people running MySQL at
Google, Facebook, and a long tail of not quite as big sites but still
massively bigger than most clouds. Note that many of the people who
helped those companies scale up are involved directly with OpenStack.

The NoSQL bits that are popular out there make the easy part easy. There
is no magic bullet for the hard part, which is when you need to do both
synchronous and asynchronous. Factor in its maturity and the breadth of
talent available, and I'll choose MySQL for this task every time.

* Please let's also give a nod to our friends working on MariaDB, a
  MySQL-compatible fork that many find preferrable and for the purposes
  of this discussion, equivalent.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Mike Bayer


Attila Fazekas afaze...@redhat.com wrote:

 I have a question related to deadlock handling as well.
 
 Why the DBDeadlock exception is not caught generally for all api/rpc request ?
 
 The mysql recommendation regarding to Deadlocks [1]:
 Normally, you must write your applications so that they are always 
 prepared to re-issue a transaction if it gets rolled back because of a 
 deadlock.
 
 Now the services are just handling the DBDeadlock in several places.
 We have some logstash hits for other places even without galera.
 
 Instead of throwing 503 to the end user, the request could be repeated 
 `silently`.
 
 The users would be able repeat the request himself,
 so the automated repeat should not cause unexpected new problem.
 
 The retry limit might be configurable, the exception needs to be watched 
 before
 anything sent to the db on behalf of the transaction or request.
 
 Considering all request handler as potential deadlock thrower seams much 
 easier than,
 deciding case by case.  

typically, deadlocks in “normal” applications are very unusual, except in
well-known “hot-spots” where they are known to occur. The deadlock-retry can
be applied to all methods as a whole, but this generally adds a lot more
weight to the app, in that methods need to be written with the assumption
that this is to occur. It complicates the potential that perhaps one method
that is already wrapped in a retry needs to call upon another method that is
also wrapped - should the wrappers organize themselves into a single “wrap”
for the whole thing?   It’s not like this is a bad idea, but it does have 
potential
implications.

Part of the promise of enginefacade [1] is that, if applications used the
decorator version (which unfortunately not all apps week to want to), we
could build this “smart retry” functionality right into the decorator and we 
would in fact gain the ability to do this pretty easily.

[1] https://review.openstack.org/#/c/125181/




 [1] http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html
 
 - Original Message -
 From: Matthew Booth mbo...@redhat.com
 To: openstack-dev@lists.openstack.org
 Sent: Thursday, February 5, 2015 10:36:55 AM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote:
 * Commit will fail if there is a replication conflict
 
 foo is a table with a single field, which is its primary key.
 
 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
  report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM
 
 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.
 
 Yes ! and if I can add more information and I hope I do not make
 mistake I think it's a know issue which comes from MySQL, that is why
 we have a decorator to do a retry and so handle this case here:
 
  
 http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
 
 Right, and that remains a significant source of confusion and
 obfuscation in the db api. Our db code is littered with races and
 potential actual deadlocks, but only some functions are decorated. Are
 they decorated because of real deadlocks, or because of Galera lock
 contention? The solutions to those 2 problems are very different! Also,
 hunting deadlocks is hard enough work. Adding the possibility that they
 might not even be there is just evil.
 
 Incidentally, we're currently looking to replace this stuff with some
 new code in oslo.db, which is why I'm looking at it.
 
 Matt
 --
 Matthew Booth
 Red Hat Engineering, Virtualisation Team
 
 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Angus Lees
On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com wrote:

 In a single thread, using a single database session, then a read after
 successful commit is guaranteed to read a version of the database
 that existed after that commit.


Ah, I'm relieved to hear this clarification - thanks.

I'd like to see actual examples where that will matter. Meanwhile making
 all selects wait for the cluster will basically just ruin responsiveness
 and waste tons of time, so we should be careful to think this through
 before making any blanket policy.


Matthew's example earlier in the thread is simply a user issuing two
related commands in succession:

$ nova aggregate-create
$ nova aggregate-details

Once that fails a few times, the user will put a poorly commented sleep 2
in between the two statements, and this will fix the problem most of the
time.  A better fix would repeat the aggregate-details query multiple
times until it looks like it has found the previous create.

Now, that sleep or poll is of course a poor version of something you could
do at a lower level, by waiting for reads+writes to propagate to a majority
quorum.

I'd also like to see consideration given to systems that handle
 distributed consistency in a more active manner. etcd and Zookeeper are
 both such systems, and might serve as efficient guards for critical
 sections without raising latency.


+1 for moving to such systems.  Then we can have a repeat of the above
conversation without the added complications of SQL semantics ;)

 - Gus
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Joshua Harlow

Angus Lees wrote:

On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com
mailto:cl...@fewbar.com wrote:

In a single thread, using a single database session, then a read after
successful commit is guaranteed to read a version of the database
that existed after that commit.


Ah, I'm relieved to hear this clarification - thanks.

I'd like to see actual examples where that will matter. Meanwhile making
all selects wait for the cluster will basically just ruin responsiveness
and waste tons of time, so we should be careful to think this through
before making any blanket policy.


Matthew's example earlier in the thread is simply a user issuing two
related commands in succession:

$ nova aggregate-create
$ nova aggregate-details

Once that fails a few times, the user will put a poorly commented sleep
2 in between the two statements, and this will fix the problem most
of the time.  A better fix would repeat the aggregate-details query
multiple times until it looks like it has found the previous create.

Now, that sleep or poll is of course a poor version of something you
could do at a lower level, by waiting for reads+writes to propagate to a
majority quorum.

I'd also like to see consideration given to systems that handle
distributed consistency in a more active manner. etcd and Zookeeper are
both such systems, and might serve as efficient guards for critical
sections without raising latency.


+1 for moving to such systems.  Then we can have a repeat of the above
conversation without the added complications of SQL semantics ;)



So just an fyi:

http://docs.openstack.org/developer/tooz/ exists.

Specifically:

http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock

It has a locking api that it provides (that plugs into the various 
backends); there is also a WIP https://review.openstack.org/#/c/151463/ 
driver that is being worked for etc.d.


-Josh


  - Gus

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Gregory Haynes
Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +:
 Angus Lees wrote:
  On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com
  mailto:cl...@fewbar.com wrote:
  I'd also like to see consideration given to systems that handle
  distributed consistency in a more active manner. etcd and Zookeeper are
  both such systems, and might serve as efficient guards for critical
  sections without raising latency.
 
 
  +1 for moving to such systems.  Then we can have a repeat of the above
  conversation without the added complications of SQL semantics ;)
 
 
 So just an fyi:
 
 http://docs.openstack.org/developer/tooz/ exists.
 
 Specifically:
 
 http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock
 
 It has a locking api that it provides (that plugs into the various 
 backends); there is also a WIP https://review.openstack.org/#/c/151463/ 
 driver that is being worked for etc.d.
 

An interesting note about the etcd implementation is that you can
select per-request whether you want to wait for quorum on a read or not.
This means that in theory you could obtain higher throughput for most
operations which do not require this and then only gain quorum for
operations which require it (e.g. locks).

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Sahid Orentino Ferdjaoui
On Thu, Feb 05, 2015 at 09:56:21AM +, Matthew Booth wrote:
 On 04/02/15 19:04, Jay Pipes wrote:
  On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
  On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:
  I've spent a few hours today reading about Galera, a clustering solution
  for MySQL. Galera provides multi-master 'virtually synchronous'
  replication between multiple mysql nodes. i.e. I can create a cluster of
  3 mysql dbs and read and write from any of them with certain consistency
  guarantees.
 
  I am no expert[1], but this is a TL;DR of a couple of things which I
  didn't know, but feel I should have done. The semantics are important to
  application design, which is why we should all be aware of them.
 
 
  * Commit will fail if there is a replication conflict
 
  foo is a table with a single field, which is its primary key.
 
  A: start transaction;
  B: start transaction;
  A: insert into foo values(1);
  B: insert into foo values(1); -- 'regular' DB would block here, and
 report an error on A's commit
  A: commit; -- success
  B: commit; -- KABOOM
 
  Confusingly, Galera will report a 'deadlock' to node B, despite this not
  being a deadlock by any definition I'm familiar with.
  
  It is a failure to certify the writeset, which bubbles up as an InnoDB
  deadlock error. See my article here:
  
  http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
  
  
  Which explains this.
  
  Yes ! and if I can add more information and I hope I do not make
  mistake I think it's a know issue which comes from MySQL, that is why
  we have a decorator to do a retry and so handle this case here:
 

  http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
 
  
  It's not an issue with MySQL. It's an issue with any database code that
  is highly contentious.

I wanted to speak about the term deadlock (which also looks to
surprise Matthew) used, I though it comes from MySQL. In our situation
it's not really a deadlock, just a locked sessions from A and so B needs
to retry ?

I believe a deadlock would be when a session A tries to read something
on table x.foo to update y.bar when B tries to read something on y.bar
to update x.foo - so when A acquires a lock to read x.foo, B acquires
a lock to read y.bar, then when A needs to acquire lock to update
y.bar it can not, then same thing for B with x.foo.

  Almost all highly distributed or concurrent applications need to handle
  deadlock issues, and the most common way to handle deadlock issues on
  database records is using a retry technique. There's nothing new about
  that with Galera.
  
  The issue with our use of the @_retry_on_deadlock decorator is *not*
  that the retry decorator is not needed, but rather it is used too
  frequently. The compare-and-swap technique I describe in the article
  above dramatically* reduces the number of deadlocks that occur (and need
  to be handled by the @_retry_on_deadlock decorator) and dramatically
  reduces the contention over critical database sections.

Thanks for these informations.

 I'm still confused as to how this code got there, though. We shouldn't
 be hitting Galera lock contention (reported as deadlocks) if we're using
 a single master, which I thought we were. Does this mean either:

I guess we can hit a lock contention even in single master.

 A. There are deployments using multi-master?
 B. These are really deadlocks?
 
 If A, is this something we need to continue to support?
 
 Thanks,
 
 Matt
 -- 
 Matthew Booth
 Red Hat Engineering, Virtualisation Team
 
 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 04:30, Mike Bayer wrote:
 Galera doesn't change anything here. I'm really not sure what the
 fuss is about, frankly.
 
 because we’re trying to get Galera to actually work as a load
 balanced cluster to some degree, at least for reads.

Yeah, the use case of concern here is consecutive RPC transactions from
a single remote client, which can't reasonably be in the same
transaction. This affects semantics visible to the end-user.

In Nova, they might do:

$ nova aggregate-create ...
$ nova aggregate-details ...

Should they expect that the second command might fail if they don't
pause long enough between the 2? Should they retry until it succeeds?
This example is a toy, but I would expect to find many other more subtle
examples.

 Otherwise I’m not really sure why we have to bother with Galera at
 all.  If we just want a single MySQL server that has a warm standby
 for failover, why aren’t we just using that capability straight from
 MySQL.  Then we get “SELECT FOR UPDATE” and everything else back.

Actually I think this is a misconception. If I have understood
correctly[1], Galera *does* work with select for update. Use of select
for update on a single node will work exactly as normal with blocking
behaviour. Use of select for update across 2 nodes will not block, but
fail on commit if there was lock contention.

 Galera’s “multi master” capability is already in the trash for us,
 and it seems like “multi-slave” is only marginally useful either, the
 vast majority of openstack has to be 100% pointed at just one node to
 work correctly.

It's not necessarily in the trash, but given that the semantics are
different (fail on commit rather than block) we'd need to do more work
to support them. It sounds to me that we want to defer that rather than
try to fix it now. i.e. multi-master is currently unsupport(ed|able).

We could add an additional decorator to enginefacade which would
re-execute a @writer block if it detected Galera lock contention.
However, given that we'd have to audit that code for other side-effects,
for the moment it sounds like it's safer to fail.

Matt

[1] Standard caveats apply.
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Attila Fazekas
I have a question related to deadlock handling as well.

Why the DBDeadlock exception is not caught generally for all api/rpc request ?

The mysql recommendation regarding to Deadlocks [1]:
Normally, you must write your applications so that they are always 
 prepared to re-issue a transaction if it gets rolled back because of a 
deadlock.

Now the services are just handling the DBDeadlock in several places.
We have some logstash hits for other places even without galera.

Instead of throwing 503 to the end user, the request could be repeated 
`silently`.

The users would be able repeat the request himself,
so the automated repeat should not cause unexpected new problem.

The retry limit might be configurable, the exception needs to be watched before
anything sent to the db on behalf of the transaction or request.

Considering all request handler as potential deadlock thrower seams much easier 
than,
deciding case by case.  

[1] http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html

- Original Message -
 From: Matthew Booth mbo...@redhat.com
 To: openstack-dev@lists.openstack.org
 Sent: Thursday, February 5, 2015 10:36:55 AM
 Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody 
 should know about Galera
 
 On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote:
  * Commit will fail if there is a replication conflict
 
  foo is a table with a single field, which is its primary key.
 
  A: start transaction;
  B: start transaction;
  A: insert into foo values(1);
  B: insert into foo values(1); -- 'regular' DB would block here, and
report an error on A's commit
  A: commit; -- success
  B: commit; -- KABOOM
 
  Confusingly, Galera will report a 'deadlock' to node B, despite this not
  being a deadlock by any definition I'm familiar with.
  
  Yes ! and if I can add more information and I hope I do not make
  mistake I think it's a know issue which comes from MySQL, that is why
  we have a decorator to do a retry and so handle this case here:
  

  http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
 
 Right, and that remains a significant source of confusion and
 obfuscation in the db api. Our db code is littered with races and
 potential actual deadlocks, but only some functions are decorated. Are
 they decorated because of real deadlocks, or because of Galera lock
 contention? The solutions to those 2 problems are very different! Also,
 hunting deadlocks is hard enough work. Adding the possibility that they
 might not even be there is just evil.
 
 Incidentally, we're currently looking to replace this stuff with some
 new code in oslo.db, which is why I'm looking at it.
 
 Matt
 --
 Matthew Booth
 Red Hat Engineering, Virtualisation Team
 
 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 05/02/15 11:01, Attila Fazekas wrote:
 I have a question related to deadlock handling as well.
 
 Why the DBDeadlock exception is not caught generally for all api/rpc request ?
 
 The mysql recommendation regarding to Deadlocks [1]:
 Normally, you must write your applications so that they are always 
  prepared to re-issue a transaction if it gets rolled back because of a 
 deadlock.

This is evil imho, although it may well be pragmatic. A deadlock (a real
deadlock, that is) occurs because of a preventable bug in code. It
occurs because 2 transactions have attempted to take multiple locks in a
different order. Getting this right is hard, but it is achievable. The
solution to real deadlocks is to fix the bugs.

Galera 'deadlocks' on the other hand are not deadlocks, despite being
reported as such (sounds as though this is due to an implementation
quirk?). They don't involve 2 transactions holding mutual locks, and
there is never any doubt about how to proceed. They involve 2
transactions holding the same lock, and 1 of them committed first. In a
real deadlock they wouldn't get as far as commit. This isn't any kind of
bug: it's normal behaviour in this environment and you just have to
handle it.

 Now the services are just handling the DBDeadlock in several places.
 We have some logstash hits for other places even without galera.

I haven't had much success with logstash. Could you post a query which
would return these? This would be extremely interesting.

 Instead of throwing 503 to the end user, the request could be repeated 
 `silently`.
 
 The users would be able repeat the request himself,
 so the automated repeat should not cause unexpected new problem.

Good point: we could argue 'no worse than now', even if it's buggy.

 The retry limit might be configurable, the exception needs to be watched 
 before
 anything sent to the db on behalf of the transaction or request.
 
 Considering all request handler as potential deadlock thrower seams much 
 easier than,
 deciding case by case.  

Well this happens at the transaction level, and we don't quite have a
1:1 request:transaction relationship. We're moving towards it, but
potentially long running requests will always have to use multiple
transactions.

However, I take your point. I think retry on transaction failure is
something which would benefit from standard handling in a library.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Matthew Booth
On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote:
 * Commit will fail if there is a replication conflict

 foo is a table with a single field, which is its primary key.

 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
   report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM

 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.
 
 Yes ! and if I can add more information and I hope I do not make
 mistake I think it's a know issue which comes from MySQL, that is why
 we have a decorator to do a retry and so handle this case here:
 
   
 http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

Right, and that remains a significant source of confusion and
obfuscation in the db api. Our db code is littered with races and
potential actual deadlocks, but only some functions are decorated. Are
they decorated because of real deadlocks, or because of Galera lock
contention? The solutions to those 2 problems are very different! Also,
hunting deadlocks is hard enough work. Adding the possibility that they
might not even be there is just evil.

Incidentally, we're currently looking to replace this stuff with some
new code in oslo.db, which is why I'm looking at it.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Angus Lees
On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net
wrote:

 Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +:
  Angus Lees wrote:
   On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com
   mailto:cl...@fewbar.com wrote:
   I'd also like to see consideration given to systems that handle
   distributed consistency in a more active manner. etcd and
 Zookeeper are
   both such systems, and might serve as efficient guards for critical
   sections without raising latency.
  
  
   +1 for moving to such systems.  Then we can have a repeat of the above
   conversation without the added complications of SQL semantics ;)
  
 
  So just an fyi:
 
  http://docs.openstack.org/developer/tooz/ exists.
 
  Specifically:
 
  http://docs.openstack.org/developer/tooz/developers.
 html#tooz.coordination.CoordinationDriver.get_lock
 
  It has a locking api that it provides (that plugs into the various
  backends); there is also a WIP https://review.openstack.org/#/c/151463/
  driver that is being worked for etc.d.
 

 An interesting note about the etcd implementation is that you can
 select per-request whether you want to wait for quorum on a read or not.
 This means that in theory you could obtain higher throughput for most
 operations which do not require this and then only gain quorum for
 operations which require it (e.g. locks).


Along those lines and in an effort to be a bit less doom-and-gloom, I spent
my lunch break trying to find non-marketing documentation on the Galera
replication protocol and how it is exposed. (It was surprisingly difficult
to find such information *)

It's easy to get the transaction ID of the last commit
(wsrep_last_committed), but I can't find a way to wait until at least a
particular transaction ID has been synced.  If we can find that latter
functionality, then we can expose that sequencer all the way through (HTTP
header?) and then any follow-on commands can mention the sequencer of the
previous write command that they really need to see the effects of.

In practice, this should lead to zero additional wait time, since the
Galera replication has almost certainly already caught up by the time the
second command comes in - and we can just read from the local server with
no additional delay.

See the various *Index variables in the etcd API, for how the same idea
gets used there.

 - Gus

(*) In case you're also curious, the only doc I found with any details was
http://galeracluster.com/documentation-webpages/certificationbasedreplication.html
and its sibling pages.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Gregory Haynes
Excerpts from Angus Lees's message of 2015-02-06 02:36:32 +:
 On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net
 wrote:
 
  Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +:
   Angus Lees wrote:
On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com
mailto:cl...@fewbar.com wrote:
I'd also like to see consideration given to systems that handle
distributed consistency in a more active manner. etcd and
  Zookeeper are
both such systems, and might serve as efficient guards for critical
sections without raising latency.
   
   
+1 for moving to such systems.  Then we can have a repeat of the above
conversation without the added complications of SQL semantics ;)
   
  
   So just an fyi:
  
   http://docs.openstack.org/developer/tooz/ exists.
  
   Specifically:
  
   http://docs.openstack.org/developer/tooz/developers.
  html#tooz.coordination.CoordinationDriver.get_lock
  
   It has a locking api that it provides (that plugs into the various
   backends); there is also a WIP https://review.openstack.org/#/c/151463/
   driver that is being worked for etc.d.
  
 
  An interesting note about the etcd implementation is that you can
  select per-request whether you want to wait for quorum on a read or not.
  This means that in theory you could obtain higher throughput for most
  operations which do not require this and then only gain quorum for
  operations which require it (e.g. locks).
 
 
 Along those lines and in an effort to be a bit less doom-and-gloom, I spent
 my lunch break trying to find non-marketing documentation on the Galera
 replication protocol and how it is exposed. (It was surprisingly difficult
 to find such information *)
 
 It's easy to get the transaction ID of the last commit
 (wsrep_last_committed), but I can't find a way to wait until at least a
 particular transaction ID has been synced.  If we can find that latter
 functionality, then we can expose that sequencer all the way through (HTTP
 header?) and then any follow-on commands can mention the sequencer of the
 previous write command that they really need to see the effects of.
 
 In practice, this should lead to zero additional wait time, since the
 Galera replication has almost certainly already caught up by the time the
 second command comes in - and we can just read from the local server with
 no additional delay.
 
 See the various *Index variables in the etcd API, for how the same idea
 gets used there.
 
  - Gus
 
 (*) In case you're also curious, the only doc I found with any details was
 http://galeracluster.com/documentation-webpages/certificationbasedreplication.html
 and its sibling pages.

My fear with something like this is that this is already a very hard
problem to get correct and this would be adding a fair amount of
complexity client side to achieve this. There is also an issue in that
this would a gelera-specific solution which means well be adding another
dimension to our feature testing matrix if we really wanted to support
it.

IMO we *really* do not want to be in the business of writing distrubuted
locking systems, but rather should be finding a way to either not
require them or rely on existing solutions.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Mathieu Gagné

On 2015-02-05 9:36 PM, Angus Lees wrote:

On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net
mailto:g...@greghaynes.net wrote:

Along those lines and in an effort to be a bit less doom-and-gloom, I
spent my lunch break trying to find non-marketing documentation on the
Galera replication protocol and how it is exposed. (It was surprisingly
difficult to find such information *)

It's easy to get the transaction ID of the last commit
(wsrep_last_committed), but I can't find a way to wait until at least a
particular transaction ID has been synced.  If we can find that latter
functionality, then we can expose that sequencer all the way through
(HTTP header?) and then any follow-on commands can mention the sequencer
of the previous write command that they really need to see the effects of.

In practice, this should lead to zero additional wait time, since the
Galera replication has almost certainly already caught up by the time
the second command comes in - and we can just read from the local server
with no additional delay.

See the various *Index variables in the etcd API, for how the same idea
gets used there.



I don't use Galera but managed to understand that you don't need all 
this complex system, it's already built-in within Galera. Matthew Booth 
already mentioned it in his first post.


The wsrep_sync_wait [1][2][3] variable can be scoped to the session and 
force a synchronous/committed read if you *really* need it but will 
result in larger read latencies.


[1] 
http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait
[2] 
http://www.percona.com/doc/percona-xtradb-cluster/5.5/wsrep-system-index.html#wsrep_sync_wait
[3] 
https://mariadb.com/kb/en/mariadb/galera-cluster-system-variables/#wsrep_sync_wait


--
Mathieu

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Joshua Harlow

Hey now you forgot a site in that list ;-)

-Josh

Clint Byrum wrote:

You may want to have a chat with the people running MySQL at
Google, Facebook, and a long tail of not quite as big sites but still
massively bigger than most clouds.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Matthew Booth
I've spent a few hours today reading about Galera, a clustering solution
for MySQL. Galera provides multi-master 'virtually synchronous'
replication between multiple mysql nodes. i.e. I can create a cluster of
3 mysql dbs and read and write from any of them with certain consistency
guarantees.

I am no expert[1], but this is a TL;DR of a couple of things which I
didn't know, but feel I should have done. The semantics are important to
application design, which is why we should all be aware of them.


* Commit will fail if there is a replication conflict

foo is a table with a single field, which is its primary key.

A: start transaction;
B: start transaction;
A: insert into foo values(1);
B: insert into foo values(1); -- 'regular' DB would block here, and
  report an error on A's commit
A: commit; -- success
B: commit; -- KABOOM

Confusingly, Galera will report a 'deadlock' to node B, despite this not
being a deadlock by any definition I'm familiar with.

Essentially, anywhere that a regular DB would block, Galera will not
block transactions on different nodes. Instead, it will cause one of the
transactions to fail on commit. This is still ACID, but the semantics
are quite different.

The impact of this is that code which makes correct use of locking may
still fail with a 'deadlock'. The solution to this is to either fail the
entire operation, or to re-execute the transaction and all its
associated code in the expectation that it won't fail next time.

As I understand it, these can be eliminated by sending all writes to a
single node, although that obviously makes less efficient use of your
cluster.


* Write followed by read on a different node can return stale data

During a commit, Galera replicates a transaction out to all other db
nodes. Due to its design, Galera knows these transactions will be
successfully committed to the remote node eventually[2], but it doesn't
commit them straight away. The remote node will check these outstanding
replication transactions for write conflicts on commit, but not for
read. This means that you can do:

A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo; -- May not contain the value we inserted above[3]

This means that even for 'synchronous' slaves, if a client makes an RPC
call which writes a row to write master A, then another RPC call which
expects to read that row from synchronous slave node B, there's no
default guarantee that it'll be there.

Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
  try:
commit
  except:
rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.

Matt

[1] No, really: I just read a bunch of docs and blogs today. If anybody
who is an expert would like to validate/correct that would be great.

[2]
http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

[3]
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Sahid Orentino Ferdjaoui
On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:
 I've spent a few hours today reading about Galera, a clustering solution
 for MySQL. Galera provides multi-master 'virtually synchronous'
 replication between multiple mysql nodes. i.e. I can create a cluster of
 3 mysql dbs and read and write from any of them with certain consistency
 guarantees.
 
 I am no expert[1], but this is a TL;DR of a couple of things which I
 didn't know, but feel I should have done. The semantics are important to
 application design, which is why we should all be aware of them.
 
 
 * Commit will fail if there is a replication conflict
 
 foo is a table with a single field, which is its primary key.
 
 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
   report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM
 
 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.

Yes ! and if I can add more information and I hope I do not make
mistake I think it's a know issue which comes from MySQL, that is why
we have a decorator to do a retry and so handle this case here:

  
http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

 Essentially, anywhere that a regular DB would block, Galera will not
 block transactions on different nodes. Instead, it will cause one of the
 transactions to fail on commit. This is still ACID, but the semantics
 are quite different.
 
 The impact of this is that code which makes correct use of locking may
 still fail with a 'deadlock'. The solution to this is to either fail the
 entire operation, or to re-execute the transaction and all its
 associated code in the expectation that it won't fail next time.
 
 As I understand it, these can be eliminated by sending all writes to a
 single node, although that obviously makes less efficient use of your
 cluster.
 
 
 * Write followed by read on a different node can return stale data
 
 During a commit, Galera replicates a transaction out to all other db
 nodes. Due to its design, Galera knows these transactions will be
 successfully committed to the remote node eventually[2], but it doesn't
 commit them straight away. The remote node will check these outstanding
 replication transactions for write conflicts on commit, but not for
 read. This means that you can do:
 
 A: start transaction;
 A: insert into foo values(1)
 A: commit;
 B: select * from foo; -- May not contain the value we inserted above[3]
 
 This means that even for 'synchronous' slaves, if a client makes an RPC
 call which writes a row to write master A, then another RPC call which
 expects to read that row from synchronous slave node B, there's no
 default guarantee that it'll be there.
 
 Galera exposes a session variable which will fix this: wsrep_sync_wait
 (or wsrep_causal_reads on older mysql). However, this isn't the default.
 It presumably has a performance cost, but I don't know what it is, or
 how it scales with various workloads.
 
 
 Because these are semantic issues, they aren't things which can be
 easily guarded with an if statement. We can't say:
 
 if galera:
   try:
 commit
   except:
 rewind time
 
 If we are to support this DB at all, we have to structure code in the
 first place to allow for its semantics.
 
 Matt
 
 [1] No, really: I just read a bunch of docs and blogs today. If anybody
 who is an expert would like to validate/correct that would be great.
 
 [2]
 http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
 
 [3]
 http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
 -- 
 Matthew Booth
 Red Hat Engineering, Virtualisation Team
 
 Phone: +442070094448 (UK)
 GPG ID:  D33C3490
 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Mike Bayer


Matthew Booth mbo...@redhat.com wrote:

 This means that even for 'synchronous' slaves, if a client makes an RPC
 call which writes a row to write master A, then another RPC call which
 expects to read that row from synchronous slave node B, there's no
 default guarantee that it'll be there.


Can I get some kind of clue as to how common this use case is? This is 
where we get into things like how nova.objects works and stuff, which is not my 
domain.We are going through a huge amount of thought in order to handle 
this use case but I’m not versed in where / how this use case exactly happens 
and how widespread it is.



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Jay Pipes

On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:

On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote:

I've spent a few hours today reading about Galera, a clustering solution
for MySQL. Galera provides multi-master 'virtually synchronous'
replication between multiple mysql nodes. i.e. I can create a cluster of
3 mysql dbs and read and write from any of them with certain consistency
guarantees.

I am no expert[1], but this is a TL;DR of a couple of things which I
didn't know, but feel I should have done. The semantics are important to
application design, which is why we should all be aware of them.


* Commit will fail if there is a replication conflict

foo is a table with a single field, which is its primary key.

A: start transaction;
B: start transaction;
A: insert into foo values(1);
B: insert into foo values(1); -- 'regular' DB would block here, and
   report an error on A's commit
A: commit; -- success
B: commit; -- KABOOM

Confusingly, Galera will report a 'deadlock' to node B, despite this not
being a deadlock by any definition I'm familiar with.


It is a failure to certify the writeset, which bubbles up as an InnoDB 
deadlock error. See my article here:


http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/

Which explains this.


Yes ! and if I can add more information and I hope I do not make
mistake I think it's a know issue which comes from MySQL, that is why
we have a decorator to do a retry and so handle this case here:

   
http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177


It's not an issue with MySQL. It's an issue with any database code that 
is highly contentious.


Almost all highly distributed or concurrent applications need to handle 
deadlock issues, and the most common way to handle deadlock issues on 
database records is using a retry technique. There's nothing new about 
that with Galera.


The issue with our use of the @_retry_on_deadlock decorator is *not* 
that the retry decorator is not needed, but rather it is used too 
frequently. The compare-and-swap technique I describe in the article 
above dramatically* reduces the number of deadlocks that occur (and need 
to be handled by the @_retry_on_deadlock decorator) and dramatically 
reduces the contention over critical database sections.


Best,
-jay

* My colleague Pavel Kholkin is putting together the results of a 
benchmark run that compares the compare-and-swap method with the raw 
@_retry_on_deadlock decorator method. Spoiler: the compare-and-swap 
method cuts the runtime of the benchmark by almost *half*.



Essentially, anywhere that a regular DB would block, Galera will not
block transactions on different nodes. Instead, it will cause one of the
transactions to fail on commit. This is still ACID, but the semantics
are quite different.

The impact of this is that code which makes correct use of locking may
still fail with a 'deadlock'. The solution to this is to either fail the
entire operation, or to re-execute the transaction and all its
associated code in the expectation that it won't fail next time.

As I understand it, these can be eliminated by sending all writes to a
single node, although that obviously makes less efficient use of your
cluster.


* Write followed by read on a different node can return stale data

During a commit, Galera replicates a transaction out to all other db
nodes. Due to its design, Galera knows these transactions will be
successfully committed to the remote node eventually[2], but it doesn't
commit them straight away. The remote node will check these outstanding
replication transactions for write conflicts on commit, but not for
read. This means that you can do:

A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo; -- May not contain the value we inserted above[3]

This means that even for 'synchronous' slaves, if a client makes an RPC
call which writes a row to write master A, then another RPC call which
expects to read that row from synchronous slave node B, there's no
default guarantee that it'll be there.

Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
   try:
 commit
   except:
 rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.

Matt

[1] No, really: I just read a bunch of docs and blogs today. If anybody
who is an expert would like to validate/correct that would be great.

[2]
http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

[3]

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Mike Bayer


Matthew Booth mbo...@redhat.com wrote:

 A: start transaction;
 B: start transaction;
 A: insert into foo values(1);
 B: insert into foo values(1); -- 'regular' DB would block here, and
  report an error on A's commit
 A: commit; -- success
 B: commit; -- KABOOM
 
 Confusingly, Galera will report a 'deadlock' to node B, despite this not
 being a deadlock by any definition I'm familiar with.

So, one of the entire points of the enginefacade work is that we will ensure 
that writes will continue to be made to exactly one node in the cluster.  
Openstack does not have the problem defined above, because we only communicate 
with one node, even today.  The work that we are trying to proceed with is to 
at least have *reads* make full use of the cluster.

The above phenomenon is not a problem for openstack today except for the 
reduced efficiency, which enginefacade will partially solve.

 
 As I understand it, these can be eliminated by sending all writes to a
 single node, although that obviously makes less efficient use of your
 cluster.

this is what we do right now and it continues to be the plan going forward.   
Having single-master is in fact the traditional form of clustering.  In the 
Openstack case, this issue isn’t as bad as it seems, because openstack runs 
many different applications against the same database simultaneously.  
Different applications should refer to different nodes in the cluster as their 
“master”.   There’s no conflict here because each app talks only to its own 
tables.

 During a commit, Galera replicates a transaction out to all other db
 nodes. Due to its design, Galera knows these transactions will be
 successfully committed to the remote node eventually[2], but it doesn't
 commit them straight away. The remote node will check these outstanding
 replication transactions for write conflicts on commit, but not for
 read. This means that you can do:
 
 A: start transaction;
 A: insert into foo values(1)
 A: commit;
 B: select * from foo; -- May not contain the value we inserted above[3]

will need to get more detail on this.   this would mean that galera is not in 
fact synchronous.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Clint Byrum
Excerpts from Matthew Booth's message of 2015-02-04 08:30:32 -0800:
 * Write followed by read on a different node can return stale data
 
 During a commit, Galera replicates a transaction out to all other db
 nodes. Due to its design, Galera knows these transactions will be
 successfully committed to the remote node eventually[2], but it doesn't
 commit them straight away. The remote node will check these outstanding
 replication transactions for write conflicts on commit, but not for
 read. This means that you can do:
 
 A: start transaction;
 A: insert into foo values(1)
 A: commit;
 B: select * from foo; -- May not contain the value we inserted above[3]
 
 This means that even for 'synchronous' slaves, if a client makes an RPC
 call which writes a row to write master A, then another RPC call which
 expects to read that row from synchronous slave node B, there's no
 default guarantee that it'll be there.
 
 Galera exposes a session variable which will fix this: wsrep_sync_wait
 (or wsrep_causal_reads on older mysql). However, this isn't the default.
 It presumably has a performance cost, but I don't know what it is, or
 how it scales with various workloads.
 

wsrep_sync_wait/wsrep_casual_reads doesn't actually hit the cluster
any harder, it simply tells the local Galera node if you're not caught
up with the highest known sync point, don't answer queries yet. So it
will slow down that particular query as it waits for an update from the
leader about sync point and, if necessary, waits for the local engine
to catch up to that point. However, it isn't going to push that query
off to all the other boxes or anything like that.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Joshua Harlow

How interesting,

Why are people using galera if it behaves like this? :-/

Are the people that are using it know/aware that this happens? :-/

Scary

Mike Bayer wrote:


Matthew Boothmbo...@redhat.com  wrote:


A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo;-- May not contain the value we inserted above[3]


I’ve confirmed in my own testing that this is accurate. the
wsrep_causal_reads flag does resolve this, and it is settable on a
per-session basis.  The attached script, adapted from the script
given in the blog post, illustrates this.



Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Well, consider our application is doing some @writer, then later it does
some @reader. @reader has the contract that reads must be synchronous with
any writes. Easy enough, @reader ensures that the connection it uses sets up
set wsrep_causal_reads=1”. The attached test case confirms this is feasible
on a per-session (that is, a connection attached to the database) basis, so
that the setting will not impact the cluster as a whole, and we can
forego using it on those @async_reader calls where we don’t need it.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
  try:
commit
  except:
rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.


I think the above example is referring to the “deadlock” issue, which we have
solved both with the “only write to one master” strategy.

But overall, as you’re aware, we will no longer have the words “begin” or
“commit” in our code. This takes place all within enginefacade. With this
pattern, we will permanently end the need for any kind of repeated special
patterns or boilerplate which occurs per-transaction on a
backend-configurable basis. The enginefacade is where any such special
patterns can take place, and for extended patterns such as setting up
wsrep_causal_reads on @reader nodes or similar, we can implement a
rudimentary plugin system for it such that we can have a “galera” backend to
set up what’s needed.

The attached script does essentially what the one associated with
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
does. It’s valid because without wsrep_causal_reads turned on the
connection, I get plenty of reads that lag behind the writes, so I’ve
confirmed this is easily reproducible, and that with casual_reads turned on,
it vanishes. The script demonstrates that a single application can set up
“wsrep_causal_reads” on a per-session basis (remember, by “session” we mean
“a mysql session”), where it takes effect for that connection alone, not
affecting the performance of other concurrent connections even in the same
application. With the flag turned on, the script never reads a stale row.
The script illustrates calls upon both the casual reads connection and the
non-causal reads in a randomly alternating fashion. I’m running it against a
cluster of two virtual nodes on a laptop, so performance is very slow, but
some sample output:

2015-02-04 15:49:27,131 100 runs
2015-02-04 15:49:27,754 w/ non-causal reads, got row 763 val is 9499, retries 0
2015-02-04 15:49:27,760 w/ non-causal reads, got row 763 val is 9499, retries 1
2015-02-04 15:49:27,764 w/ non-causal reads, got row 763 val is 9499, retries 2
2015-02-04 15:49:27,772 w/ non-causal reads, got row 763 val is 9499, retries 3
2015-02-04 15:49:27,777 w/ non-causal reads, got row 763 val is 9499, retries 4
2015-02-04 15:49:30,985 200 runs
2015-02-04 15:49:37,579 300 runs
2015-02-04 15:49:42,396 400 runs
2015-02-04 15:49:48,240 w/ non-causal reads, got row 6544 val is 6766, retries 0
2015-02-04 15:49:48,255 w/ non-causal reads, got row 6544 val is 6766, retries 1
2015-02-04 15:49:48,276 w/ non-causal reads, got row 6544 val is 6766, retries 2
2015-02-04 15:49:49,336 500 runs
2015-02-04 15:49:56,433 600 runs
2015-02-04 15:50:05,801 700 runs
2015-02-04 15:50:08,802 w/ non-causal reads, got row 533 val is 834, retries 0
2015-02-04 15:50:10,849 800 runs
2015-02-04 15:50:14,834 900 runs
2015-02-04 15:50:15,445 w/ non-causal reads, got row 124 val is 3850, retries 0
2015-02-04 15:50:15,448 w/ non-causal reads, got row 124 val is 3850, retries 1
2015-02-04 15:50:18,515 1000 runs
2015-02-04 15:50:22,130 1100 runs
2015-02-04 15:50:26,301 1200 runs
2015-02-04 15:50:28,898 w/ non-causal reads, got row 1493 val is 8358, retries 0
2015-02-04 15:50:29,988 1300 runs
2015-02-04 15:50:33,736 1400 runs
2015-02-04 15:50:34,219 w/ non-causal reads, got row 9661 val is 2877, retries 0
2015-02-04 15:50:38,796 1500 runs
2015-02-04 15:50:42,844 1600 runs
2015-02-04 15:50:46,838 1700 runs
2015-02-04 

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-02-04 13:24:20 -0800:
 How interesting,
 
 Why are people using galera if it behaves like this? :-/
 

Note that any true MVCC database will roll back transactions on
conflicts. One must always have a deadlock detection algorithm of
some kind.

Galera behaves like this because it is enormously costly to be synchronous
at all times for everything. So it is synchronous when you want it to be,
and async when you don't.

Note that it's likely NDB (aka MySQL Cluster) would work fairly well
for OpenStack's workloads, and does not suffer from this. However, it
requires low latency high bandwidth links between all nodes (infiniband
recommended) or it will just plain suck. So Galera is a cheaper, easier
to tune and reason about option.

 Are the people that are using it know/aware that this happens? :-/
 

I think the problem really is that it is somewhat de facto, and used
without being tested. The gate doesn't set up a three node Galera db and
test that OpenStack works right. Also it is inherently a race condition,
and thus will be a hard one to test.

Thats where having knowledge of it and taking time to engineer a
solution that makes sense is really the best course I can think of.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Robert Collins
On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
 How interesting,

 Why are people using galera if it behaves like this? :-/

Because its actually fairly normal. In fact its an instance of point 7
on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
oldest wiki pages :).

In more detail, consider what happens in full isolation when you have
the A and B example given, but B starts its transaction before A.

B BEGIN
A BEGIN
A INSERT foo
A COMMIT
B SELECT foo - NULL

- data inserted by a transaction with a higher transaction id isn't
visible to the older transaction (in a MVCC style engine - there are
other engines, but this is common).

When you add clustering in, many cluster DBs are not synchronous: -
postgresql replication is asynchronous - both log shipping and slony.
Neither is Galera. So reads will see older data than has been
committed to the cluster. Writes will conflict *if* the write was
dependent on data that was changed.

If rather than clustering you add multiple DB's, you get the same sort
of thing unless you explicitly wire in 2PC and a distributed lock
manager and oh my... and we have multiple DB's (cinder, nova etc) but
no such coordination between them.

Now, if we say that we can't accept eventual consistency, that we have
to have atomic visibility of changes, then we've a -lot- of work-
because of the multiple DB's thing.

However, eventual consistency can cause confusion if its not applied
well, and it may be that this layer is the wrong layer to apply it at
- thats certainly a possibility.

 Are the people that are using it know/aware that this happens? :-/

I hope so :)

-Rob


-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Mike Bayer


Matthew Booth mbo...@redhat.com wrote:

 A: start transaction;
 A: insert into foo values(1)
 A: commit;
 B: select * from foo; -- May not contain the value we inserted above[3]

I’ve confirmed in my own testing that this is accurate. the
wsrep_causal_reads flag does resolve this, and it is settable on a
per-session basis.  The attached script, adapted from the script 
given in the blog post, illustrates this.


 
 Galera exposes a session variable which will fix this: wsrep_sync_wait
 (or wsrep_causal_reads on older mysql). However, this isn't the default.
 It presumably has a performance cost, but I don't know what it is, or
 how it scales with various workloads.

Well, consider our application is doing some @writer, then later it does
some @reader. @reader has the contract that reads must be synchronous with
any writes. Easy enough, @reader ensures that the connection it uses sets up
set wsrep_causal_reads=1”. The attached test case confirms this is feasible
on a per-session (that is, a connection attached to the database) basis, so 
that the setting will not impact the cluster as a whole, and we can 
forego using it on those @async_reader calls where we don’t need it.

 Because these are semantic issues, they aren't things which can be
 easily guarded with an if statement. We can't say:
 
 if galera:
  try:
commit
  except:
rewind time
 
 If we are to support this DB at all, we have to structure code in the
 first place to allow for its semantics.

I think the above example is referring to the “deadlock” issue, which we have
solved both with the “only write to one master” strategy.

But overall, as you’re aware, we will no longer have the words “begin” or
“commit” in our code. This takes place all within enginefacade. With this
pattern, we will permanently end the need for any kind of repeated special
patterns or boilerplate which occurs per-transaction on a
backend-configurable basis. The enginefacade is where any such special
patterns can take place, and for extended patterns such as setting up
wsrep_causal_reads on @reader nodes or similar, we can implement a
rudimentary plugin system for it such that we can have a “galera” backend to
set up what’s needed.

The attached script does essentially what the one associated with
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
does. It’s valid because without wsrep_causal_reads turned on the
connection, I get plenty of reads that lag behind the writes, so I’ve
confirmed this is easily reproducible, and that with casual_reads turned on,
it vanishes. The script demonstrates that a single application can set up
“wsrep_causal_reads” on a per-session basis (remember, by “session” we mean
“a mysql session”), where it takes effect for that connection alone, not
affecting the performance of other concurrent connections even in the same
application. With the flag turned on, the script never reads a stale row.
The script illustrates calls upon both the casual reads connection and the
non-causal reads in a randomly alternating fashion. I’m running it against a
cluster of two virtual nodes on a laptop, so performance is very slow, but
some sample output:

2015-02-04 15:49:27,131 100 runs
2015-02-04 15:49:27,754 w/ non-causal reads, got row 763 val is 9499, retries 0
2015-02-04 15:49:27,760 w/ non-causal reads, got row 763 val is 9499, retries 1
2015-02-04 15:49:27,764 w/ non-causal reads, got row 763 val is 9499, retries 2
2015-02-04 15:49:27,772 w/ non-causal reads, got row 763 val is 9499, retries 3
2015-02-04 15:49:27,777 w/ non-causal reads, got row 763 val is 9499, retries 4
2015-02-04 15:49:30,985 200 runs
2015-02-04 15:49:37,579 300 runs
2015-02-04 15:49:42,396 400 runs
2015-02-04 15:49:48,240 w/ non-causal reads, got row 6544 val is 6766, retries 0
2015-02-04 15:49:48,255 w/ non-causal reads, got row 6544 val is 6766, retries 1
2015-02-04 15:49:48,276 w/ non-causal reads, got row 6544 val is 6766, retries 2
2015-02-04 15:49:49,336 500 runs
2015-02-04 15:49:56,433 600 runs
2015-02-04 15:50:05,801 700 runs
2015-02-04 15:50:08,802 w/ non-causal reads, got row 533 val is 834, retries 0
2015-02-04 15:50:10,849 800 runs
2015-02-04 15:50:14,834 900 runs
2015-02-04 15:50:15,445 w/ non-causal reads, got row 124 val is 3850, retries 0
2015-02-04 15:50:15,448 w/ non-causal reads, got row 124 val is 3850, retries 1
2015-02-04 15:50:18,515 1000 runs
2015-02-04 15:50:22,130 1100 runs
2015-02-04 15:50:26,301 1200 runs
2015-02-04 15:50:28,898 w/ non-causal reads, got row 1493 val is 8358, retries 0
2015-02-04 15:50:29,988 1300 runs
2015-02-04 15:50:33,736 1400 runs
2015-02-04 15:50:34,219 w/ non-causal reads, got row 9661 val is 2877, retries 0
2015-02-04 15:50:38,796 1500 runs
2015-02-04 15:50:42,844 1600 runs
2015-02-04 15:50:46,838 1700 runs
2015-02-04 15:50:51,049 1800 runs
2015-02-04 15:50:55,139 1900 runs
2015-02-04 15:50:59,632 2000 runs
2015-02-04 15:51:04,721 2100 runs
2015-02-04 15:51:10,670 2200 runs

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Jay Pipes



On 02/04/2015 07:59 PM, Angus Lees wrote:

On Thu Feb 05 2015 at 9:02:49 AM Robert Collins
robe...@robertcollins.net mailto:robe...@robertcollins.net wrote:

On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com
mailto:harlo...@outlook.com wrote:
  How interesting,
 
  Why are people using galera if it behaves like this? :-/

Because its actually fairly normal. In fact its an instance of point 7
on https://wiki.openstack.org/__wiki/BasicDesignTenets
https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
oldest wiki pages :).

In more detail, consider what happens in full isolation when you have
the A and B example given, but B starts its transaction before A.

B BEGIN
A BEGIN
A INSERT foo
A COMMIT
B SELECT foo - NULL


Note that this still makes sense from each of A and B's individual view
of the world.

If I understood correctly, the big change with Galera that Matthew is
highlighting is that read-after-write may not be consistent from the pov
of a single thread.


No, this is not correct. There is nothing different about Galera here 
versus any asynchronously replicated database. A single thread, issuing 
statements in two entirely *separate sessions*, load-balanced across an 
entire set of database cluster nodes, may indeed see older data if the 
second session gets balanced to a slave node.


Nothing has changed about this with Galera. The exact same patterns that 
you would use to ensure that you are able to read the data that you 
previously wrote can be used with Galera. Just have the thread start a 
transactional session and ensure all queries are executed in the context 
of that session. Done. Nothing about Galera changes anything here.



Not have read-after-write is *really* hard to code to (see for example
x86 SMP cache coherency, C++ threading semantics, etc which all provide
read-after-write for this reason).  This is particularly true when the
affected operations are hidden behind an ORM - it isn't clear what might
involve a database call and sequencers (or logical clocks, etc) aren't
made explicit in the API.

I strongly suggest just enabling wsrep_casual_reads on all galera
sessions, unless you can guarantee that the high-level task is purely
read-only, and then moving on to something else ;)  If we choose
performance over correctness here then we're just signing up for lots of
debugging of hard to reproduce race conditions, and the fixes are going
to look like what wsrep_casual_reads does anyway.

(Mind you, exposing sequencers at every API interaction would be
awesome, and I look forward to a future framework and toolchain that
makes that easy to do correctly)


IMHO, you all are reading WAY too much into this. The behaviour that 
Matthew is describing is the kind of thing that has been around for 
decades now with asynchronous slave replication. Applications have 
traditionally handled it by sending reads that can tolerate slave lag to 
a slave machine, and reads that cannot to the same machine that was 
written to.


Galera doesn't change anything here. I'm really not sure what the fuss 
is about, frankly.


I don't recommend mucking with wsrep_causal_reads if we don't have to. 
And, IMO, we don't have to much with it at all.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Avishay Traeger
On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net
wrote:

 On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
  How interesting,
 
  Why are people using galera if it behaves like this? :-/

 Because its actually fairly normal. In fact its an instance of point 7
 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
 oldest wiki pages :).


When I hear MySQL I don't exactly think of eventual consistency (#7),
scalability (#1), horizontal scalability (#4), etc.
For the past few months I have been advocating implementing an alternative
to db/sqlalchemy, but of course it's a huge undertaking.  NoSQL (or even
distributed key-value stores) should be considered IMO.  Just some food for
thought :)


-- 
*Avishay Traeger*
*Storage RD*

Mobile: +972 54 447 1475
E-mail: avis...@stratoscale.com



Web http://www.stratoscale.com/ | Blog http://www.stratoscale.com/blog/
 | Twitter https://twitter.com/Stratoscale | Google+
https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts
 | Linkedin https://www.linkedin.com/company/stratoscale
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Angus Lees
On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net
wrote:

 On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
  How interesting,
 
  Why are people using galera if it behaves like this? :-/

 Because its actually fairly normal. In fact its an instance of point 7
 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
 oldest wiki pages :).

 In more detail, consider what happens in full isolation when you have
 the A and B example given, but B starts its transaction before A.

 B BEGIN
 A BEGIN
 A INSERT foo
 A COMMIT
 B SELECT foo - NULL


Note that this still makes sense from each of A and B's individual view of
the world.

If I understood correctly, the big change with Galera that Matthew is
highlighting is that read-after-write may not be consistent from the pov of
a single thread.

Not have read-after-write is *really* hard to code to (see for example x86
SMP cache coherency, C++ threading semantics, etc which all provide
read-after-write for this reason).  This is particularly true when the
affected operations are hidden behind an ORM - it isn't clear what might
involve a database call and sequencers (or logical clocks, etc) aren't made
explicit in the API.

I strongly suggest just enabling wsrep_casual_reads on all galera sessions,
unless you can guarantee that the high-level task is purely read-only, and
then moving on to something else ;)  If we choose performance over
correctness here then we're just signing up for lots of debugging of hard
to reproduce race conditions, and the fixes are going to look like what
wsrep_casual_reads does anyway.

(Mind you, exposing sequencers at every API interaction would be awesome,
and I look forward to a future framework and toolchain that makes that easy
to do correctly)

 - Gus
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Mike Bayer


Jay Pipes jaypi...@gmail.com wrote:

 No, this is not correct. There is nothing different about Galera here versus 
 any asynchronously replicated database. A single thread, issuing statements 
 in two entirely *separate sessions*, load-balanced across an entire set of 
 database cluster nodes, may indeed see older data if the second session gets 
 balanced to a slave node.

That’s what we’re actually talking about.   We’re talking about “reader” 
methods that aren’t enclosed in a “writer” potentially being pointed at the 
cluster as a whole.

 
 Nothing has changed about this with Galera. The exact same patterns that you 
 would use to ensure that you are able to read the data that you previously 
 wrote can be used with Galera. Just have the thread start a transactional 
 session and ensure all queries are executed in the context of that session. 
 Done. Nothing about Galera changes anything here.

Right but, what I’m trying to get a handle on is, how often do we make a series 
of RPC calls at an openstack service, where each one (because they are separate 
calls) are all in different transactions, and then how many of those are RPC 
calls that are “read-only” (and therefore we’d like to point at the cluster as 
a whole) are dependent on a “writer” RPC call that just happened immediately 
preceding?

 
 IMHO, you all are reading WAY too much into this. The behaviour that Matthew 
 is describing is the kind of thing that has been around for decades now with 
 asynchronous slave replication. Applications have traditionally handled it by 
 sending reads that can tolerate slave lag to a slave machine, and reads that 
 cannot to the same machine that was written to.

Can we identify methods in Openstack, and particularly Nova, that are reads 
that can tolerate slave lag?  Or is the thing architected such that “no, pretty 
much 95% of reader calls, we have no idea if they occur right after a write 
that they are definitely dependent on” ?Matthew found a small handful in 
one little corner of Nova, some kind of background thread thing, which make use 
of the “use_slave” flag.  But the rest of it, nope.  


 Galera doesn't change anything here. I'm really not sure what the fuss is 
 about, frankly.

because we’re trying to get Galera to actually work as a load balanced cluster 
to some degree, at least for reads.

Otherwise I’m not really sure why we have to bother with Galera at all.  If we 
just want a single MySQL server that has a warm standby for failover, why 
aren’t we just using that capability straight from MySQL.  Then we get “SELECT 
FOR UPDATE” and everything else back.Galera’s “multi master” capability is 
already in the trash for us, and it seems like “multi-slave” is only marginally 
useful either, the vast majority of openstack has to be 100% pointed at just 
one node to work correctly.

I’m coming here with the disadvantage that I don’t have a clear picture of the 
actual use patterns we really need.The picture I have right now is of a 
Nova / Neutron etc. that receive dozens/hundreds of tiny RPC calls each of 
which do some small thing in its own transaction, yet most are dependent on 
each other as they are all part of a single larger operation, and that the 
whole thing runs too slowly.   But this is the fuzziest picture ever.



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev