Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Attila Fazekas afaze...@redhat.com To: Jay Pipes jaypi...@gmail.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Thursday, February 12, 2015 11:52:39 AM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Wednesday, February 11, 2015 9:52:55 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/11/2015 06:34 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Tuesday, February 10, 2015 7:32:11 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/10/2015 06:28 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE? No, Galera replicates the write sets (binary log segments) for UPDATE/INSERT/DELETE statements -- the things that actually change/add/remove records in DB tables. No locks are replicated, ever. Galera does not do any replication at UPDATE/INSERT/DELETE time. $ mysql use test; CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64)); $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES (test);'; done ) | mysql The writer1 is busy, the other nodes did not noticed anything about the above pending transaction, for them this transaction does not exists as long as you do not call a COMMIT. Any kind of DML/DQL you issue without a COMMIT does not happened in the other nodes perspective. Replication happens at COMMIT time if the `write sets` is not empty. We're going in circles here. I was just pointing out that SELECT ... FOR UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements will cause a write-set to be replicated (yes, upon COMMIT of the containing transaction). Please see my repeated statements in this thread and others that the compare-and-swap technique is dependent on issuing *separate* transactions for each SELECT and UPDATE statement... When a transaction wins a voting, the other nodes rollbacks all transaction which had a local conflicting row lock. A SELECT statement in a separate transaction does not ever trigger a ROLLBACK, nor will an UPDATE statement that does not match any rows. That is IMO how increased throughput is achieved in the compare-and-swap technique versus the SELECT FOR UPDATE technique. yes, I mentioned this way in one bug [0]. But the related changes on the review, actually works as I said [1][2][3], and the SELECT is not in a separated dedicated transaction. [0] https://bugs.launchpad.net/neutron/+bug/1410854 [sorry I sent a wrong link before] [1] https://review.openstack.org/#/c/143837/ [2] https://review.openstack.org/#/c/153558/ [3] https://review.openstack.org/#/c/149261/ -jay -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Wednesday, February 11, 2015 9:52:55 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/11/2015 06:34 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Tuesday, February 10, 2015 7:32:11 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/10/2015 06:28 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE? No, Galera replicates the write sets (binary log segments) for UPDATE/INSERT/DELETE statements -- the things that actually change/add/remove records in DB tables. No locks are replicated, ever. Galera does not do any replication at UPDATE/INSERT/DELETE time. $ mysql use test; CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64)); $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES (test);'; done ) | mysql The writer1 is busy, the other nodes did not noticed anything about the above pending transaction, for them this transaction does not exists as long as you do not call a COMMIT. Any kind of DML/DQL you issue without a COMMIT does not happened in the other nodes perspective. Replication happens at COMMIT time if the `write sets` is not empty. We're going in circles here. I was just pointing out that SELECT ... FOR UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements will cause a write-set to be replicated (yes, upon COMMIT of the containing transaction). Please see my repeated statements in this thread and others that the compare-and-swap technique is dependent on issuing *separate* transactions for each SELECT and UPDATE statement... When a transaction wins a voting, the other nodes rollbacks all transaction which had a local conflicting row lock. A SELECT statement in a separate transaction does not ever trigger a ROLLBACK, nor will an UPDATE statement that does not match any rows. That is IMO how increased throughput is achieved in the compare-and-swap technique versus the SELECT FOR UPDATE technique. yes, I mentioned this way in one bug [0]. But the related changes on the review, actually works as I said [1][2][3], and the SELECT is not in a separated dedicated transaction. [0] https://blueprints.launchpad.net/nova/+spec/lock-free-quota-management [1] https://review.openstack.org/#/c/143837/ [2] https://review.openstack.org/#/c/153558/ [3] https://review.openstack.org/#/c/149261/ -jay -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/11/2015 06:34 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Tuesday, February 10, 2015 7:32:11 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/10/2015 06:28 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE? No, Galera replicates the write sets (binary log segments) for UPDATE/INSERT/DELETE statements -- the things that actually change/add/remove records in DB tables. No locks are replicated, ever. Galera does not do any replication at UPDATE/INSERT/DELETE time. $ mysql use test; CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64)); $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES (test);'; done ) | mysql The writer1 is busy, the other nodes did not noticed anything about the above pending transaction, for them this transaction does not exists as long as you do not call a COMMIT. Any kind of DML/DQL you issue without a COMMIT does not happened in the other nodes perspective. Replication happens at COMMIT time if the `write sets` is not empty. We're going in circles here. I was just pointing out that SELECT ... FOR UPDATE will never replicate anything. INSERT/UPDATE/DELETE statements will cause a write-set to be replicated (yes, upon COMMIT of the containing transaction). Please see my repeated statements in this thread and others that the compare-and-swap technique is dependent on issuing *separate* transactions for each SELECT and UPDATE statement... When a transaction wins a voting, the other nodes rollbacks all transaction which had a local conflicting row lock. A SELECT statement in a separate transaction does not ever trigger a ROLLBACK, nor will an UPDATE statement that does not match any rows. That is IMO how increased throughput is achieved in the compare-and-swap technique versus the SELECT FOR UPDATE technique. -jay -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/11/2015 07:58 AM, Matthew Booth wrote: On 10/02/15 18:29, Jay Pipes wrote: On 02/10/2015 09:47 AM, Matthew Booth wrote: On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. This link is better: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ Specifically the line: The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Thanks, Jay, that's a great article. Based on that, I think I may have misunderstood what you were saying before. I currently understand that the behaviour of select ... for update is correct on Galera, it's just not very efficient. Correct in this case meaning it aborts the transaction due to a correctly detected lock conflict. FWIW, that was pretty much my original understanding, but without the detail. To expand: Galera doesn't replicate write intent locks, but it turns out it doesn't have to for correctness. The reason is that the conflict between a local write intent lock and a remote write, which is replicated, will always be detected during or before local certification. Exactly correct. Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com Cc: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Tuesday, February 10, 2015 7:32:11 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/10/2015 06:28 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE? No, Galera replicates the write sets (binary log segments) for UPDATE/INSERT/DELETE statements -- the things that actually change/add/remove records in DB tables. No locks are replicated, ever. Galera does not do any replication at UPDATE/INSERT/DELETE time. $ mysql use test; CREATE TABLE test (id integer PRIMARY KEY AUTO_INCREMENT, data CHAR(64)); $(echo 'use test; BEGIN;'; while true ; do echo 'INSERT INTO test(data) VALUES (test);'; done ) | mysql The writer1 is busy, the other nodes did not noticed anything about the above pending transaction, for them this transaction does not exists as long as you do not call a COMMIT. Any kind of DML/DQL you issue without a COMMIT does not happened in the other nodes perspective. Replication happens at COMMIT time if the `write sets` is not empty. When a transaction wins a voting, the other nodes rollbacks all transaction which had a local conflicting row lock. Why should I handle the FOR UPDATE differently? Because SELECT FOR UPDATE doesn't change any rows, and therefore does not trigger any replication event in Galera. What matters is the full transaction changed any row at COMMIT time or not. The DMLs them-self does not starts a replication as `SELECT FOR UPDATE` does not. See here: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ -jay https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 10/02/15 18:29, Jay Pipes wrote: On 02/10/2015 09:47 AM, Matthew Booth wrote: On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. This link is better: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ Specifically the line: The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Thanks, Jay, that's a great article. Based on that, I think I may have misunderstood what you were saying before. I currently understand that the behaviour of select ... for update is correct on Galera, it's just not very efficient. Correct in this case meaning it aborts the transaction due to a correctly detected lock conflict. FWIW, that was pretty much my original understanding, but without the detail. To expand: Galera doesn't replicate write intent locks, but it turns out it doesn't have to for correctness. The reason is that the conflict between a local write intent lock and a remote write, which is replicated, will always be detected during or before local certification. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Jay Pipes jaypi...@gmail.com To: openstack-dev@lists.openstack.org Sent: Monday, February 9, 2015 9:36:45 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 03:10 PM, Clint Byrum wrote: Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Attila acknowledged that. What Attila was saying was that by using it with Galera, the box that is doing the FOR UPDATE locks will simply fail upon commit because a conflicting commit has already happened and arrived from the node that accepted the write. Further what Attila is saying is that this means there is not such an obvious advantage to the CAS method, since the rollback and the # updated rows == 0 are effectively equivalent at this point, seeing as the prior commit has already arrived and thus will not need to wait to fail certification and be rolled back. No, that is not correct. In the case of the CAS technique, the frequency of rollbacks due to certification failure is demonstrably less than when using SELECT FOR UPDATE and relying on the certification timeout error to signal a deadlock. I am not entirely certain that is true though, as I think what will happen in sequential order is: writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction'; writer1: -- send in-progress update to cluster writer2: SELECT FOR UPDATE books WHERE id=3; writer1: COMMIT writer1: -- try to certify commit in cluster ** Here is where I stop knowing for sure what happens ** writer2: certifies writer1's transaction or blocks? It will certify writer1's transaction. It will only block another thread hitting writer2 requesting write locks or write-intent read locks on the same records. writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3; writer2: COMMIT -- One of them is rolled back. The other transaction can be rolled back before you do an actual commit: writer1: BEGIN writer2: BEGIN writer1: update test set val=42 where id=1; writer2: update test set val=42 where id=1; writer1: COMMIT writer2: show variables; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction As you can see 2th transaction failed without issuing a COMMIT after the 1th one committed. You could write anything to mysql on writer2 at this point, even invalid things returns with `Deadlock`. So, at that point where I'm not sure (please some Galera expert tell me): If what happens is as I suggest, writer1's transaction is certified, then that just means the lock sticks around blocking stuff on writer2, but that the data is updated and it is certain that writer2's commit will be rolled back. However, if it blocks waiting on the lock to resolve, then I'm at a loss to determine which transaction would be rolled back, but I am thinking that it makes sense that the transaction from writer2 would be rolled back, because the commit is later. That is correct. writer2's transaction would be rolled back. The difference is that the CAS method would NOT trigger a ROLLBACK. It would instead return 0 rows affected, because the UPDATE statement would instead look like this: UPDATE books SET genre = 'sciencefiction' WHERE id = 3 AND genre = 'SciFi'; And the return of 0 rows affected would trigger a simple retry of the read and then update attempt on writer2 instead of dealing with ROLLBACK semantics on the transaction. Note that in the CAS method, the SELECT statement and the UPDATE are in completely different transactions. This is a very important thing to keep in mind. All this to say that usually the reason for SELECT FOR UPDATE is not to only do an update (the transactional semantics handle that), but also to prevent the old row from being seen again, which, as Jay says, it cannot do. So I believe you are both correct: * Attila, yes I think you're right that CAS is not any more efficient at replacing SELECT FOR UPDATE from a blocking standpoint. It is more efficient because there are far fewer ROLLBACKs of transactions occurring in the system. If you look at a slow query log (with a 0 slow query time) for a MySQL Galera server in a multi-write cluster during a run of Tempest or Rally, you will notice that the number of ROLLBACK statements is extraordinary. AFAICR, when Peter Boros and I benchmarked a Rally launch and delete 10K VM run, we saw nearly 11% of *total* queries executed
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE ? Why should I handle the FOR UPDATE differently ? https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/10/2015 06:28 AM, Attila Fazekas wrote: - Original Message - From: Jay Pipes jaypi...@gmail.com To: Attila Fazekas afaze...@redhat.com, OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Cc: Pavel Kholkin pkhol...@mirantis.com Sent: Monday, February 9, 2015 7:15:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. Galere does not replicates the row-level locks created by UPDATE/INSERT ... So what to do with the UPDATE? No, Galera replicates the write sets (binary log segments) for UPDATE/INSERT/DELETE statements -- the things that actually change/add/remove records in DB tables. No locks are replicated, ever. Why should I handle the FOR UPDATE differently? Because SELECT FOR UPDATE doesn't change any rows, and therefore does not trigger any replication event in Galera. See here: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ -jay https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/10/2015 09:47 AM, Matthew Booth wrote: On 09/02/15 18:15, Jay Pipes wrote: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Is that the right link, Jay? I'm taking your word on the write-intent locks not being replicated, but that link seems to say the opposite. This link is better: http://www.percona.com/blog/2014/09/11/openstack-users-shed-light-on-percona-xtradb-cluster-deadlock-issues/ Specifically the line: The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Jay Pipes jaypi...@gmail.com To: openstack-dev@lists.openstack.org, Pavel Kholkin pkhol...@mirantis.com Sent: Wednesday, February 4, 2015 8:04:10 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote: On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here: http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/ Which explains this. I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. Using 'FOR UPDATE' in with 'repeatable read' isolation level, seams still more efficient and has several advantages. * The SELECT with 'FOR UPDATE' will read the committed version, so you do not really need to worry about when the transaction actually started. You will get fresh data before you reaching the actual UPDATE. * In the article the example query will not return new version of data in the same transaction even if you are retrying, so you need to restart the transaction anyway. When you are using the 'FOR UPDATE' way if any other transaction successfully commits conflicting row on any other galera writer, your pending transaction will be rolled back at your next statement, WITHOUT spending any time in certificating that transaction. In this perspective the checking the number after the update `Compare and swap` or handling an exception does not makes any difference. * Using FOR UPDATE in a galera transaction (multi-writer) is not more evil than using UPDATE, concurrent commit invalidates both of them in the same way (DBDeadlock). * The 'FOR UPDATE' if you are using just a `single writer` does not lets other threads to do useless work while wasting resources. * The swap way also can be rolled back by galera almost anywhere (DBDeadLock). At the end the swap way looks like it just replaced the exception handling, with a return code check + manual transaction restart. Am I missed something ? Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 It's not an issue with MySQL. It's an issue with any database code that is highly contentious. Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera. The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections. Best, -jay * My colleague Pavel Kholkin is putting together the results of a benchmark run that compares the compare-and-swap method with the raw @_retry_on_deadlock decorator method. Spoiler: the compare-and-swap method cuts the runtime of the benchmark by almost *half*. Essentially, anywhere that a regular DB would block, Galera will not block transactions on different nodes. Instead, it will cause one of the transactions to fail on commit. This is still ACID, but the semantics are quite different. The impact of this is that code which makes correct use of locking may still fail with a 'deadlock'. The solution to this is to either fail
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/09/2015 03:10 PM, Clint Byrum wrote: Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Attila acknowledged that. What Attila was saying was that by using it with Galera, the box that is doing the FOR UPDATE locks will simply fail upon commit because a conflicting commit has already happened and arrived from the node that accepted the write. Further what Attila is saying is that this means there is not such an obvious advantage to the CAS method, since the rollback and the # updated rows == 0 are effectively equivalent at this point, seeing as the prior commit has already arrived and thus will not need to wait to fail certification and be rolled back. No, that is not correct. In the case of the CAS technique, the frequency of rollbacks due to certification failure is demonstrably less than when using SELECT FOR UPDATE and relying on the certification timeout error to signal a deadlock. I am not entirely certain that is true though, as I think what will happen in sequential order is: writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction'; writer1: -- send in-progress update to cluster writer2: SELECT FOR UPDATE books WHERE id=3; writer1: COMMIT writer1: -- try to certify commit in cluster ** Here is where I stop knowing for sure what happens ** writer2: certifies writer1's transaction or blocks? It will certify writer1's transaction. It will only block another thread hitting writer2 requesting write locks or write-intent read locks on the same records. writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3; writer2: COMMIT -- One of them is rolled back. So, at that point where I'm not sure (please some Galera expert tell me): If what happens is as I suggest, writer1's transaction is certified, then that just means the lock sticks around blocking stuff on writer2, but that the data is updated and it is certain that writer2's commit will be rolled back. However, if it blocks waiting on the lock to resolve, then I'm at a loss to determine which transaction would be rolled back, but I am thinking that it makes sense that the transaction from writer2 would be rolled back, because the commit is later. That is correct. writer2's transaction would be rolled back. The difference is that the CAS method would NOT trigger a ROLLBACK. It would instead return 0 rows affected, because the UPDATE statement would instead look like this: UPDATE books SET genre = 'sciencefiction' WHERE id = 3 AND genre = 'SciFi'; And the return of 0 rows affected would trigger a simple retry of the read and then update attempt on writer2 instead of dealing with ROLLBACK semantics on the transaction. Note that in the CAS method, the SELECT statement and the UPDATE are in completely different transactions. This is a very important thing to keep in mind. All this to say that usually the reason for SELECT FOR UPDATE is not to only do an update (the transactional semantics handle that), but also to prevent the old row from being seen again, which, as Jay says, it cannot do. So I believe you are both correct: * Attila, yes I think you're right that CAS is not any more efficient at replacing SELECT FOR UPDATE from a blocking standpoint. It is more efficient because there are far fewer ROLLBACKs of transactions occurring in the system. If you look at a slow query log (with a 0 slow query time) for a MySQL Galera server in a multi-write cluster during a run of Tempest or Rally, you will notice that the number of ROLLBACK statements is extraordinary. AFAICR, when Peter Boros and I benchmarked a Rally launch and delete 10K VM run, we saw nearly 11% of *total* queries executed against the server were ROLLBACKs. This, in my opinion, is the main reason that the CAS method will show as more efficient. * Jay, yes I think you're right that SELECT FOR UPDATE is not the right thing to use to do such reads, because one is relying on locks that are meaningless on a Galera cluster. Where I think the CAS ends up being the preferred method for this sort of thing is where one consideres that it won't hold a meaningless lock while the transaction is completed and then rolled back. CAS is preferred because it is measurably faster and more obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost ready to publish documentation showing a benchmark of this that shows nearly a 100% decrease in total amount of lock/wait time using CAS versus waiting for the coarser-level certification timeout to retry the transactions. As
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Jay Pipes's message of 2015-02-09 12:36:45 -0800: CAS is preferred because it is measurably faster and more obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost ready to publish documentation showing a benchmark of this that shows nearly a 100% decrease in total amount of lock/wait time using CAS versus waiting for the coarser-level certification timeout to retry the transactions. As mentioned above, I believe this is due to the dramatic decrease in ROLLBACKs. I think the missing piece of the puzzle for me was that each ROLLBACK is an expensive operation. I figured it was like a non-local return (i.e. 'raise' in python or 'throw' in java) and thus not measurably different. But now that I think of it, there is likely quite a bit of optimization around the query path, and not so much around the rollback path. The bottom of this rabbit hole is simply exquisite, isn't it? :) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800: On 02/09/2015 01:02 PM, Attila Fazekas wrote: I do not see why not to use `FOR UPDATE` even with multi-writer or Is the retry/swap way really solves anything here. snip Am I missed something ? Yes. Galera does not replicate the (internal to InnnoDB) row-level locks that are needed to support SELECT FOR UPDATE statements across multiple cluster nodes. https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ Attila acknowledged that. What Attila was saying was that by using it with Galera, the box that is doing the FOR UPDATE locks will simply fail upon commit because a conflicting commit has already happened and arrived from the node that accepted the write. Further what Attila is saying is that this means there is not such an obvious advantage to the CAS method, since the rollback and the # updated rows == 0 are effectively equivalent at this point, seeing as the prior commit has already arrived and thus will not need to wait to fail certification and be rolled back. I am not entirely certain that is true though, as I think what will happen in sequential order is: writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction'; writer1: -- send in-progress update to cluster writer2: SELECT FOR UPDATE books WHERE id=3; writer1: COMMIT writer1: -- try to certify commit in cluster ** Here is where I stop knowing for sure what happens ** writer2: certifies writer1's transaction or blocks? writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3; writer2: COMMIT -- One of them is rolled back. So, at that point where I'm not sure (please some Galera expert tell me): If what happens is as I suggest, writer1's transaction is certified, then that just means the lock sticks around blocking stuff on writer2, but that the data is updated and it is certain that writer2's commit will be rolled back. However, if it blocks waiting on the lock to resolve, then I'm at a loss to determine which transaction would be rolled back, but I am thinking that it makes sense that the transaction from writer2 would be rolled back, because the commit is later. All this to say that usually the reason for SELECT FOR UPDATE is not to only do an update (the transactional semantics handle that), but also to prevent the old row from being seen again, which, as Jay says, it cannot do. So I believe you are both correct: * Attila, yes I think you're right that CAS is not any more efficient at replacing SELECT FOR UPDATE from a blocking standpoint. * Jay, yes I think you're right that SELECT FOR UPDATE is not the right thing to use to do such reads, because one is relying on locks that are meaningless on a Galera cluster. Where I think the CAS ends up being the preferred method for this sort of thing is where one consideres that it won't hold a meaningless lock while the transaction is completed and then rolled back. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/09/2015 05:02 PM, Clint Byrum wrote: Excerpts from Jay Pipes's message of 2015-02-09 12:36:45 -0800: CAS is preferred because it is measurably faster and more obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost ready to publish documentation showing a benchmark of this that shows nearly a 100% decrease in total amount of lock/wait time using CAS versus waiting for the coarser-level certification timeout to retry the transactions. As mentioned above, I believe this is due to the dramatic decrease in ROLLBACKs. I think the missing piece of the puzzle for me was that each ROLLBACK is an expensive operation. I figured it was like a non-local return (i.e. 'raise' in python or 'throw' in java) and thus not measurably different. But now that I think of it, there is likely quite a bit of optimization around the query path, and not so much around the rollback path. The bottom of this rabbit hole is simply exquisite, isn't it? :) It is indeed. :) As soon as I think I understand it fully, a new problem area exposes itself. Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Hi Angus, If causal reads is set in a session, it won't delay all reads, just that specific read that you set if for. Let's say you have 4 sessions, in one of them you set causal reads, the other 3 won't wait on anything. The read in the one session that you set this in will be delayed, in the other 4, it won't be. Also this delay is usually small. Since the replication itself is synchronous if a node it not able to keep up with the rest of the cluster in terms of writes, it will send flow control messages to the other nodes. Flow control means that it has it's receive queue full, and the other nodes have to wait until they can do more writes (in case of flow control writes on the other nodes are blocked until the given node catches up with writes). So the delay imposed here can't be arbitrarily large. On Sat, Feb 7, 2015 at 3:00 AM, Angus Lees g...@inodes.org wrote: Thanks for the additional details Peter. This confirms the parts I'd deduced from the docs I could find, and is useful knowledge. On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com wrote: - Like many others said it before me, consistent reads can be achieved with wsrep_causal_reads set on in the session. So the example was two dependent command-line invocations (write followed by read) that have no way to re-use the same DB session (without introducing lots of affinity issues that we'd also like to avoid). Enabling wsrep_casual_reads makes sure the latter read sees the effects of the earlier write, but comes at the cost of delaying all reads by some amount depending on the write-load of the galera cluster (if I understand correctly). This additional delay was raised as a concern severe enough not to just go down this path. Really we don't care about other writes that may have occurred (we always need to deal with races against other actors), we just want to ensure our earlier write has taken effect on the galera server where we sent the second read request. If we had some way to say wsrep_delay_until $first_txid then we we could be sure of read-after-write from a different DB session and also (in the vast majority of cases) suffer no additional delay. An opaque sequencer is a generic concept across many of the distributed consensus stores I'm familiar with, so this needn't be exposed as a Galera-only quirk. Meh, I gather people are bored with the topic at this point. As I suggested much earlier, I'd just enable wsrep_casual_reads on the first request for the session and then move on to some other problem ;) - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Peter Boros, Principal Architect, Percona Telephone: +1 888 401 3401 ext 546 Emergency: +1 888 401 3401 ext 911 Skype: percona.pboros __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Thanks for the additional details Peter. This confirms the parts I'd deduced from the docs I could find, and is useful knowledge. On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com wrote: - Like many others said it before me, consistent reads can be achieved with wsrep_causal_reads set on in the session. So the example was two dependent command-line invocations (write followed by read) that have no way to re-use the same DB session (without introducing lots of affinity issues that we'd also like to avoid). Enabling wsrep_casual_reads makes sure the latter read sees the effects of the earlier write, but comes at the cost of delaying all reads by some amount depending on the write-load of the galera cluster (if I understand correctly). This additional delay was raised as a concern severe enough not to just go down this path. Really we don't care about other writes that may have occurred (we always need to deal with races against other actors), we just want to ensure our earlier write has taken effect on the galera server where we sent the second read request. If we had some way to say wsrep_delay_until $first_txid then we we could be sure of read-after-write from a different DB session and also (in the vast majority of cases) suffer no additional delay. An opaque sequencer is a generic concept across many of the distributed consensus stores I'm familiar with, so this needn't be exposed as a Galera-only quirk. Meh, I gather people are bored with the topic at this point. As I suggested much earlier, I'd just enable wsrep_casual_reads on the first request for the session and then move on to some other problem ;) - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Hi Angus and everyone, I would like to reply for a couple of things: - The behavior of overlapping transactions is dependent on the transaction isolation level, even in the case of the single server, for any database. This was pointed out by others earlier as well. - The deadlock error from Galera can be confusing, but the point is that the application can actually threat this as a deadlock (or apply any kind of retry logic, which it would apply to a failed transaction), I don't know if it would be even more confusing from the developer's point of view, if it would say brute force error. Transactions can fail in a database, in the initial example the transaction will fail with a duplicate key error. The result is pretty much the same from the application's perspective, the transaction was not successful (it failed as a block), the application should handle the failure. There can be a lot more reasons for a transaction to fail regardless of the database engine, some of these failures are persistent (for example the disk is full underneath the database), and some of these are intermittent in nature like the case above. A good retry mechanism can be good for handling the intermittent failures, depending on the application logic. - Like many others said it before me, consistent reads can be achieved with wsrep_causal_reads set on in the session. I can shed some light on how this works. Nodes in galera are participating in a group communication. A global order of the transactions are established as part of this. Since the global order of the transaction is known, a session with wsrep_causal_reads on will put a marker in the local replication queue. Because transaction ordering is global, the session will be simply blocked until all the other transactions are processed in the replication queue before that marker. So, setting wsrep_causal_reads imposes additional latency only for the given select we are using it on (it literally just waits the queue to be processed up to the current transaction). So because of this, manual checking of the global transaction ids is not necessary. - On synchronous replication: galera only transmits the data synchronously, it doesn't do synchronous apply. A transaction is sent in parallel to the rest of the cluster nodes (to be accurate, it's only sent to the nodes that are in the same group segment, but it waits until all the group segments get the data). Once the other nodes received it, the transaction commits locally, the others will apply it later. The cluster can do this because of certification and because certification is deterministic (the result of the certification will be the same on all nodes, otherwise, the nodes have a different state, for example one of them was written locally). The replication uses write sets, which is practically row based mysql binary log event and some metadata. The some metadata is good for 2 things: you can take a look at 2 write sets and tell if they are conflicting or not, and you can decide if a write set is applicable to a database. Because this is checked at certification time, the apply part can be parallel (because of the certification, it's guaranteed that the transactions are not conflicting). When it comes to consistency and replication speed, there are no wonders, there are tradeoffs to make. Two phase commit is relatively slow, distributed locking is relatively slow, this is a lot faster, but the application should handle transaction failures (which it should probably handle anyway). Here is the xtradb cluster documentation (Percona Server with galera): http://www.percona.com/doc/percona-xtradb-cluster/5.6/#user-s-manual Here is the multi-master replication part of the documentation: http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html On Fri, Feb 6, 2015 at 3:36 AM, Angus Lees g...@inodes.org wrote: On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net wrote: Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +: Angus Lees wrote: On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com mailto:cl...@fewbar.com wrote: I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) So just an fyi: http://docs.openstack.org/developer/tooz/ exists. Specifically: http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock It has a locking api that it provides (that plugs into the various backends); there is also a WIP https://review.openstack.org/#/c/151463/ driver that is being worked for etc.d. An interesting note
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 5 February 2015 at 23:07, Clint Byrum cl...@fewbar.com wrote: Excerpts from Avishay Traeger's message of 2015-02-04 22:19:53 -0800: On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). When I hear MySQL I don't exactly think of eventual consistency (#7), scalability (#1), horizontal scalability (#4), etc. For the past few months I have been advocating implementing an alternative to db/sqlalchemy, but of course it's a huge undertaking. NoSQL (or even distributed key-value stores) should be considered IMO. Just some food for thought :) I know it is popular to think that MySQL* == old slow and low-scale, but that is only popular with those who have not actually tried to scale MySQL. You may want to have a chat with the people running MySQL at Google, Facebook, and a long tail of not quite as big sites but still massively bigger than most clouds. Note that many of the people who helped those companies scale up are involved directly with OpenStack. Just an aside: Youtube relies completely on MySQL for all of it's database traffic, but uses a layer on top of it called Vitess [1] to allow it to scale. [1]: https://github.com/youtube/vitess The NoSQL bits that are popular out there make the easy part easy. There is no magic bullet for the hard part, which is when you need to do both synchronous and asynchronous. Factor in its maturity and the breadth of talent available, and I'll choose MySQL for this task every time. * Please let's also give a nod to our friends working on MariaDB, a MySQL-compatible fork that many find preferrable and for the purposes of this discussion, equivalent. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
- Original Message - From: Matthew Booth mbo...@redhat.com To: openstack-dev@lists.openstack.org Sent: Thursday, February 5, 2015 12:32:33 PM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 05/02/15 11:01, Attila Fazekas wrote: I have a question related to deadlock handling as well. Why the DBDeadlock exception is not caught generally for all api/rpc request ? The mysql recommendation regarding to Deadlocks [1]: Normally, you must write your applications so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock. This is evil imho, although it may well be pragmatic. A deadlock (a real deadlock, that is) occurs because of a preventable bug in code. It occurs because 2 transactions have attempted to take multiple locks in a different order. Getting this right is hard, but it is achievable. The solution to real deadlocks is to fix the bugs. Galera 'deadlocks' on the other hand are not deadlocks, despite being reported as such (sounds as though this is due to an implementation quirk?). They don't involve 2 transactions holding mutual locks, and there is never any doubt about how to proceed. They involve 2 transactions holding the same lock, and 1 of them committed first. In a real deadlock they wouldn't get as far as commit. This isn't any kind of bug: it's normal behaviour in this environment and you just have to handle it. Now the services are just handling the DBDeadlock in several places. We have some logstash hits for other places even without galera. I haven't had much success with logstash. Could you post a query which would return these? This would be extremely interesting. Just use this: message: DBDeadlock If you would like to exclude the lock wait timeout ones: message: Deadlock found when trying to get lock Instead of throwing 503 to the end user, the request could be repeated `silently`. The users would be able repeat the request himself, so the automated repeat should not cause unexpected new problem. Good point: we could argue 'no worse than now', even if it's buggy. The retry limit might be configurable, the exception needs to be watched before anything sent to the db on behalf of the transaction or request. Considering all request handler as potential deadlock thrower seams much easier than, deciding case by case. Well this happens at the transaction level, and we don't quite have a 1:1 request:transaction relationship. We're moving towards it, but potentially long running requests will always have to use multiple transactions. However, I take your point. I think retry on transaction failure is something which would benefit from standard handling in a library. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 04/02/15 19:04, Jay Pipes wrote: On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote: On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here: http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/ Which explains this. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 It's not an issue with MySQL. It's an issue with any database code that is highly contentious. Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera. The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections. I'm still confused as to how this code got there, though. We shouldn't be hitting Galera lock contention (reported as deadlocks) if we're using a single master, which I thought we were. Does this mean either: A. There are deployments using multi-master? B. These are really deadlocks? If A, is this something we need to continue to support? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 11:11, Sahid Orentino Ferdjaoui wrote: I'm still confused as to how this code got there, though. We shouldn't be hitting Galera lock contention (reported as deadlocks) if we're using a single master, which I thought we were. Does this mean either: I guess we can hit a lock contention even in single master. I don't think so, but you can certainly still have real deadlocks. They're bugs, though. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Angus Lees's message of 2015-02-04 16:59:31 -0800: On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). In more detail, consider what happens in full isolation when you have the A and B example given, but B starts its transaction before A. B BEGIN A BEGIN A INSERT foo A COMMIT B SELECT foo - NULL Note that this still makes sense from each of A and B's individual view of the world. If I understood correctly, the big change with Galera that Matthew is highlighting is that read-after-write may not be consistent from the pov of a single thread. No that's not a complete picture. What Matthew is highlighting is that after a commit, a new transaction may not see the write if it is done on a separate node in the cluster. In a single thread, using a single database session, then a read after successful commit is guaranteed to read a version of the database that existed after that commit. What it may not be consistent with is subsequent writes which may have happened after the commit on other servers, unless you use the sync wait. Not have read-after-write is *really* hard to code to (see for example x86 SMP cache coherency, C++ threading semantics, etc which all provide read-after-write for this reason). This is particularly true when the affected operations are hidden behind an ORM - it isn't clear what might involve a database call and sequencers (or logical clocks, etc) aren't made explicit in the API. I strongly suggest just enabling wsrep_casual_reads on all galera sessions, unless you can guarantee that the high-level task is purely read-only, and then moving on to something else ;) If we choose performance over correctness here then we're just signing up for lots of debugging of hard to reproduce race conditions, and the fixes are going to look like what wsrep_casual_reads does anyway. (Mind you, exposing sequencers at every API interaction would be awesome, and I look forward to a future framework and toolchain that makes that easy to do correctly) I'd like to see actual examples where that will matter. Meanwhile making all selects wait for the cluster will basically just ruin responsiveness and waste tons of time, so we should be careful to think this through before making any blanket policy. I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Avishay Traeger's message of 2015-02-04 22:19:53 -0800: On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). When I hear MySQL I don't exactly think of eventual consistency (#7), scalability (#1), horizontal scalability (#4), etc. For the past few months I have been advocating implementing an alternative to db/sqlalchemy, but of course it's a huge undertaking. NoSQL (or even distributed key-value stores) should be considered IMO. Just some food for thought :) I know it is popular to think that MySQL* == old slow and low-scale, but that is only popular with those who have not actually tried to scale MySQL. You may want to have a chat with the people running MySQL at Google, Facebook, and a long tail of not quite as big sites but still massively bigger than most clouds. Note that many of the people who helped those companies scale up are involved directly with OpenStack. The NoSQL bits that are popular out there make the easy part easy. There is no magic bullet for the hard part, which is when you need to do both synchronous and asynchronous. Factor in its maturity and the breadth of talent available, and I'll choose MySQL for this task every time. * Please let's also give a nod to our friends working on MariaDB, a MySQL-compatible fork that many find preferrable and for the purposes of this discussion, equivalent. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Attila Fazekas afaze...@redhat.com wrote: I have a question related to deadlock handling as well. Why the DBDeadlock exception is not caught generally for all api/rpc request ? The mysql recommendation regarding to Deadlocks [1]: Normally, you must write your applications so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock. Now the services are just handling the DBDeadlock in several places. We have some logstash hits for other places even without galera. Instead of throwing 503 to the end user, the request could be repeated `silently`. The users would be able repeat the request himself, so the automated repeat should not cause unexpected new problem. The retry limit might be configurable, the exception needs to be watched before anything sent to the db on behalf of the transaction or request. Considering all request handler as potential deadlock thrower seams much easier than, deciding case by case. typically, deadlocks in “normal” applications are very unusual, except in well-known “hot-spots” where they are known to occur. The deadlock-retry can be applied to all methods as a whole, but this generally adds a lot more weight to the app, in that methods need to be written with the assumption that this is to occur. It complicates the potential that perhaps one method that is already wrapped in a retry needs to call upon another method that is also wrapped - should the wrappers organize themselves into a single “wrap” for the whole thing? It’s not like this is a bad idea, but it does have potential implications. Part of the promise of enginefacade [1] is that, if applications used the decorator version (which unfortunately not all apps week to want to), we could build this “smart retry” functionality right into the decorator and we would in fact gain the ability to do this pretty easily. [1] https://review.openstack.org/#/c/125181/ [1] http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html - Original Message - From: Matthew Booth mbo...@redhat.com To: openstack-dev@lists.openstack.org Sent: Thursday, February 5, 2015 10:36:55 AM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote: * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 Right, and that remains a significant source of confusion and obfuscation in the db api. Our db code is littered with races and potential actual deadlocks, but only some functions are decorated. Are they decorated because of real deadlocks, or because of Galera lock contention? The solutions to those 2 problems are very different! Also, hunting deadlocks is hard enough work. Adding the possibility that they might not even be there is just evil. Incidentally, we're currently looking to replace this stuff with some new code in oslo.db, which is why I'm looking at it. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com wrote: In a single thread, using a single database session, then a read after successful commit is guaranteed to read a version of the database that existed after that commit. Ah, I'm relieved to hear this clarification - thanks. I'd like to see actual examples where that will matter. Meanwhile making all selects wait for the cluster will basically just ruin responsiveness and waste tons of time, so we should be careful to think this through before making any blanket policy. Matthew's example earlier in the thread is simply a user issuing two related commands in succession: $ nova aggregate-create $ nova aggregate-details Once that fails a few times, the user will put a poorly commented sleep 2 in between the two statements, and this will fix the problem most of the time. A better fix would repeat the aggregate-details query multiple times until it looks like it has found the previous create. Now, that sleep or poll is of course a poor version of something you could do at a lower level, by waiting for reads+writes to propagate to a majority quorum. I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Angus Lees wrote: On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com mailto:cl...@fewbar.com wrote: In a single thread, using a single database session, then a read after successful commit is guaranteed to read a version of the database that existed after that commit. Ah, I'm relieved to hear this clarification - thanks. I'd like to see actual examples where that will matter. Meanwhile making all selects wait for the cluster will basically just ruin responsiveness and waste tons of time, so we should be careful to think this through before making any blanket policy. Matthew's example earlier in the thread is simply a user issuing two related commands in succession: $ nova aggregate-create $ nova aggregate-details Once that fails a few times, the user will put a poorly commented sleep 2 in between the two statements, and this will fix the problem most of the time. A better fix would repeat the aggregate-details query multiple times until it looks like it has found the previous create. Now, that sleep or poll is of course a poor version of something you could do at a lower level, by waiting for reads+writes to propagate to a majority quorum. I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) So just an fyi: http://docs.openstack.org/developer/tooz/ exists. Specifically: http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock It has a locking api that it provides (that plugs into the various backends); there is also a WIP https://review.openstack.org/#/c/151463/ driver that is being worked for etc.d. -Josh - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +: Angus Lees wrote: On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com mailto:cl...@fewbar.com wrote: I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) So just an fyi: http://docs.openstack.org/developer/tooz/ exists. Specifically: http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock It has a locking api that it provides (that plugs into the various backends); there is also a WIP https://review.openstack.org/#/c/151463/ driver that is being worked for etc.d. An interesting note about the etcd implementation is that you can select per-request whether you want to wait for quorum on a read or not. This means that in theory you could obtain higher throughput for most operations which do not require this and then only gain quorum for operations which require it (e.g. locks). __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Thu, Feb 05, 2015 at 09:56:21AM +, Matthew Booth wrote: On 04/02/15 19:04, Jay Pipes wrote: On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote: On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here: http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/ Which explains this. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 It's not an issue with MySQL. It's an issue with any database code that is highly contentious. I wanted to speak about the term deadlock (which also looks to surprise Matthew) used, I though it comes from MySQL. In our situation it's not really a deadlock, just a locked sessions from A and so B needs to retry ? I believe a deadlock would be when a session A tries to read something on table x.foo to update y.bar when B tries to read something on y.bar to update x.foo - so when A acquires a lock to read x.foo, B acquires a lock to read y.bar, then when A needs to acquire lock to update y.bar it can not, then same thing for B with x.foo. Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera. The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections. Thanks for these informations. I'm still confused as to how this code got there, though. We shouldn't be hitting Galera lock contention (reported as deadlocks) if we're using a single master, which I thought we were. Does this mean either: I guess we can hit a lock contention even in single master. A. There are deployments using multi-master? B. These are really deadlocks? If A, is this something we need to continue to support? Thanks, Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 04:30, Mike Bayer wrote: Galera doesn't change anything here. I'm really not sure what the fuss is about, frankly. because we’re trying to get Galera to actually work as a load balanced cluster to some degree, at least for reads. Yeah, the use case of concern here is consecutive RPC transactions from a single remote client, which can't reasonably be in the same transaction. This affects semantics visible to the end-user. In Nova, they might do: $ nova aggregate-create ... $ nova aggregate-details ... Should they expect that the second command might fail if they don't pause long enough between the 2? Should they retry until it succeeds? This example is a toy, but I would expect to find many other more subtle examples. Otherwise I’m not really sure why we have to bother with Galera at all. If we just want a single MySQL server that has a warm standby for failover, why aren’t we just using that capability straight from MySQL. Then we get “SELECT FOR UPDATE” and everything else back. Actually I think this is a misconception. If I have understood correctly[1], Galera *does* work with select for update. Use of select for update on a single node will work exactly as normal with blocking behaviour. Use of select for update across 2 nodes will not block, but fail on commit if there was lock contention. Galera’s “multi master” capability is already in the trash for us, and it seems like “multi-slave” is only marginally useful either, the vast majority of openstack has to be 100% pointed at just one node to work correctly. It's not necessarily in the trash, but given that the semantics are different (fail on commit rather than block) we'd need to do more work to support them. It sounds to me that we want to defer that rather than try to fix it now. i.e. multi-master is currently unsupport(ed|able). We could add an additional decorator to enginefacade which would re-execute a @writer block if it detected Galera lock contention. However, given that we'd have to audit that code for other side-effects, for the moment it sounds like it's safer to fail. Matt [1] Standard caveats apply. -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
I have a question related to deadlock handling as well. Why the DBDeadlock exception is not caught generally for all api/rpc request ? The mysql recommendation regarding to Deadlocks [1]: Normally, you must write your applications so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock. Now the services are just handling the DBDeadlock in several places. We have some logstash hits for other places even without galera. Instead of throwing 503 to the end user, the request could be repeated `silently`. The users would be able repeat the request himself, so the automated repeat should not cause unexpected new problem. The retry limit might be configurable, the exception needs to be watched before anything sent to the db on behalf of the transaction or request. Considering all request handler as potential deadlock thrower seams much easier than, deciding case by case. [1] http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html - Original Message - From: Matthew Booth mbo...@redhat.com To: openstack-dev@lists.openstack.org Sent: Thursday, February 5, 2015 10:36:55 AM Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote: * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 Right, and that remains a significant source of confusion and obfuscation in the db api. Our db code is littered with races and potential actual deadlocks, but only some functions are decorated. Are they decorated because of real deadlocks, or because of Galera lock contention? The solutions to those 2 problems are very different! Also, hunting deadlocks is hard enough work. Adding the possibility that they might not even be there is just evil. Incidentally, we're currently looking to replace this stuff with some new code in oslo.db, which is why I'm looking at it. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 05/02/15 11:01, Attila Fazekas wrote: I have a question related to deadlock handling as well. Why the DBDeadlock exception is not caught generally for all api/rpc request ? The mysql recommendation regarding to Deadlocks [1]: Normally, you must write your applications so that they are always prepared to re-issue a transaction if it gets rolled back because of a deadlock. This is evil imho, although it may well be pragmatic. A deadlock (a real deadlock, that is) occurs because of a preventable bug in code. It occurs because 2 transactions have attempted to take multiple locks in a different order. Getting this right is hard, but it is achievable. The solution to real deadlocks is to fix the bugs. Galera 'deadlocks' on the other hand are not deadlocks, despite being reported as such (sounds as though this is due to an implementation quirk?). They don't involve 2 transactions holding mutual locks, and there is never any doubt about how to proceed. They involve 2 transactions holding the same lock, and 1 of them committed first. In a real deadlock they wouldn't get as far as commit. This isn't any kind of bug: it's normal behaviour in this environment and you just have to handle it. Now the services are just handling the DBDeadlock in several places. We have some logstash hits for other places even without galera. I haven't had much success with logstash. Could you post a query which would return these? This would be extremely interesting. Instead of throwing 503 to the end user, the request could be repeated `silently`. The users would be able repeat the request himself, so the automated repeat should not cause unexpected new problem. Good point: we could argue 'no worse than now', even if it's buggy. The retry limit might be configurable, the exception needs to be watched before anything sent to the db on behalf of the transaction or request. Considering all request handler as potential deadlock thrower seams much easier than, deciding case by case. Well this happens at the transaction level, and we don't quite have a 1:1 request:transaction relationship. We're moving towards it, but potentially long running requests will always have to use multiple transactions. However, I take your point. I think retry on transaction failure is something which would benefit from standard handling in a library. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 04/02/15 17:05, Sahid Orentino Ferdjaoui wrote: * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 Right, and that remains a significant source of confusion and obfuscation in the db api. Our db code is littered with races and potential actual deadlocks, but only some functions are decorated. Are they decorated because of real deadlocks, or because of Galera lock contention? The solutions to those 2 problems are very different! Also, hunting deadlocks is hard enough work. Adding the possibility that they might not even be there is just evil. Incidentally, we're currently looking to replace this stuff with some new code in oslo.db, which is why I'm looking at it. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net wrote: Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +: Angus Lees wrote: On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com mailto:cl...@fewbar.com wrote: I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) So just an fyi: http://docs.openstack.org/developer/tooz/ exists. Specifically: http://docs.openstack.org/developer/tooz/developers. html#tooz.coordination.CoordinationDriver.get_lock It has a locking api that it provides (that plugs into the various backends); there is also a WIP https://review.openstack.org/#/c/151463/ driver that is being worked for etc.d. An interesting note about the etcd implementation is that you can select per-request whether you want to wait for quorum on a read or not. This means that in theory you could obtain higher throughput for most operations which do not require this and then only gain quorum for operations which require it (e.g. locks). Along those lines and in an effort to be a bit less doom-and-gloom, I spent my lunch break trying to find non-marketing documentation on the Galera replication protocol and how it is exposed. (It was surprisingly difficult to find such information *) It's easy to get the transaction ID of the last commit (wsrep_last_committed), but I can't find a way to wait until at least a particular transaction ID has been synced. If we can find that latter functionality, then we can expose that sequencer all the way through (HTTP header?) and then any follow-on commands can mention the sequencer of the previous write command that they really need to see the effects of. In practice, this should lead to zero additional wait time, since the Galera replication has almost certainly already caught up by the time the second command comes in - and we can just read from the local server with no additional delay. See the various *Index variables in the etcd API, for how the same idea gets used there. - Gus (*) In case you're also curious, the only doc I found with any details was http://galeracluster.com/documentation-webpages/certificationbasedreplication.html and its sibling pages. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Angus Lees's message of 2015-02-06 02:36:32 +: On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net wrote: Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +: Angus Lees wrote: On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum cl...@fewbar.com mailto:cl...@fewbar.com wrote: I'd also like to see consideration given to systems that handle distributed consistency in a more active manner. etcd and Zookeeper are both such systems, and might serve as efficient guards for critical sections without raising latency. +1 for moving to such systems. Then we can have a repeat of the above conversation without the added complications of SQL semantics ;) So just an fyi: http://docs.openstack.org/developer/tooz/ exists. Specifically: http://docs.openstack.org/developer/tooz/developers. html#tooz.coordination.CoordinationDriver.get_lock It has a locking api that it provides (that plugs into the various backends); there is also a WIP https://review.openstack.org/#/c/151463/ driver that is being worked for etc.d. An interesting note about the etcd implementation is that you can select per-request whether you want to wait for quorum on a read or not. This means that in theory you could obtain higher throughput for most operations which do not require this and then only gain quorum for operations which require it (e.g. locks). Along those lines and in an effort to be a bit less doom-and-gloom, I spent my lunch break trying to find non-marketing documentation on the Galera replication protocol and how it is exposed. (It was surprisingly difficult to find such information *) It's easy to get the transaction ID of the last commit (wsrep_last_committed), but I can't find a way to wait until at least a particular transaction ID has been synced. If we can find that latter functionality, then we can expose that sequencer all the way through (HTTP header?) and then any follow-on commands can mention the sequencer of the previous write command that they really need to see the effects of. In practice, this should lead to zero additional wait time, since the Galera replication has almost certainly already caught up by the time the second command comes in - and we can just read from the local server with no additional delay. See the various *Index variables in the etcd API, for how the same idea gets used there. - Gus (*) In case you're also curious, the only doc I found with any details was http://galeracluster.com/documentation-webpages/certificationbasedreplication.html and its sibling pages. My fear with something like this is that this is already a very hard problem to get correct and this would be adding a fair amount of complexity client side to achieve this. There is also an issue in that this would a gelera-specific solution which means well be adding another dimension to our feature testing matrix if we really wanted to support it. IMO we *really* do not want to be in the business of writing distrubuted locking systems, but rather should be finding a way to either not require them or rely on existing solutions. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 2015-02-05 9:36 PM, Angus Lees wrote: On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes g...@greghaynes.net mailto:g...@greghaynes.net wrote: Along those lines and in an effort to be a bit less doom-and-gloom, I spent my lunch break trying to find non-marketing documentation on the Galera replication protocol and how it is exposed. (It was surprisingly difficult to find such information *) It's easy to get the transaction ID of the last commit (wsrep_last_committed), but I can't find a way to wait until at least a particular transaction ID has been synced. If we can find that latter functionality, then we can expose that sequencer all the way through (HTTP header?) and then any follow-on commands can mention the sequencer of the previous write command that they really need to see the effects of. In practice, this should lead to zero additional wait time, since the Galera replication has almost certainly already caught up by the time the second command comes in - and we can just read from the local server with no additional delay. See the various *Index variables in the etcd API, for how the same idea gets used there. I don't use Galera but managed to understand that you don't need all this complex system, it's already built-in within Galera. Matthew Booth already mentioned it in his first post. The wsrep_sync_wait [1][2][3] variable can be scoped to the session and force a synchronous/committed read if you *really* need it but will result in larger read latencies. [1] http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait [2] http://www.percona.com/doc/percona-xtradb-cluster/5.5/wsrep-system-index.html#wsrep_sync_wait [3] https://mariadb.com/kb/en/mariadb/galera-cluster-system-variables/#wsrep_sync_wait -- Mathieu __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Hey now you forgot a site in that list ;-) -Josh Clint Byrum wrote: You may want to have a chat with the people running MySQL at Google, Facebook, and a long tail of not quite as big sites but still massively bigger than most clouds. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Essentially, anywhere that a regular DB would block, Galera will not block transactions on different nodes. Instead, it will cause one of the transactions to fail on commit. This is still ACID, but the semantics are quite different. The impact of this is that code which makes correct use of locking may still fail with a 'deadlock'. The solution to this is to either fail the entire operation, or to re-execute the transaction and all its associated code in the expectation that it won't fail next time. As I understand it, these can be eliminated by sending all writes to a single node, although that obviously makes less efficient use of your cluster. * Write followed by read on a different node can return stale data During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. Matt [1] No, really: I just read a bunch of docs and blogs today. If anybody who is an expert would like to validate/correct that would be great. [2] http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/ [3] http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 Essentially, anywhere that a regular DB would block, Galera will not block transactions on different nodes. Instead, it will cause one of the transactions to fail on commit. This is still ACID, but the semantics are quite different. The impact of this is that code which makes correct use of locking may still fail with a 'deadlock'. The solution to this is to either fail the entire operation, or to re-execute the transaction and all its associated code in the expectation that it won't fail next time. As I understand it, these can be eliminated by sending all writes to a single node, although that obviously makes less efficient use of your cluster. * Write followed by read on a different node can return stale data During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. Matt [1] No, really: I just read a bunch of docs and blogs today. If anybody who is an expert would like to validate/correct that would be great. [2] http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/ [3] http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Matthew Booth mbo...@redhat.com wrote: This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Can I get some kind of clue as to how common this use case is? This is where we get into things like how nova.objects works and stuff, which is not my domain.We are going through a huge amount of thought in order to handle this use case but I’m not versed in where / how this use case exactly happens and how widespread it is. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote: On Wed, Feb 04, 2015 at 04:30:32PM +, Matthew Booth wrote: I've spent a few hours today reading about Galera, a clustering solution for MySQL. Galera provides multi-master 'virtually synchronous' replication between multiple mysql nodes. i.e. I can create a cluster of 3 mysql dbs and read and write from any of them with certain consistency guarantees. I am no expert[1], but this is a TL;DR of a couple of things which I didn't know, but feel I should have done. The semantics are important to application design, which is why we should all be aware of them. * Commit will fail if there is a replication conflict foo is a table with a single field, which is its primary key. A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. It is a failure to certify the writeset, which bubbles up as an InnoDB deadlock error. See my article here: http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/ Which explains this. Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 It's not an issue with MySQL. It's an issue with any database code that is highly contentious. Almost all highly distributed or concurrent applications need to handle deadlock issues, and the most common way to handle deadlock issues on database records is using a retry technique. There's nothing new about that with Galera. The issue with our use of the @_retry_on_deadlock decorator is *not* that the retry decorator is not needed, but rather it is used too frequently. The compare-and-swap technique I describe in the article above dramatically* reduces the number of deadlocks that occur (and need to be handled by the @_retry_on_deadlock decorator) and dramatically reduces the contention over critical database sections. Best, -jay * My colleague Pavel Kholkin is putting together the results of a benchmark run that compares the compare-and-swap method with the raw @_retry_on_deadlock decorator method. Spoiler: the compare-and-swap method cuts the runtime of the benchmark by almost *half*. Essentially, anywhere that a regular DB would block, Galera will not block transactions on different nodes. Instead, it will cause one of the transactions to fail on commit. This is still ACID, but the semantics are quite different. The impact of this is that code which makes correct use of locking may still fail with a 'deadlock'. The solution to this is to either fail the entire operation, or to re-execute the transaction and all its associated code in the expectation that it won't fail next time. As I understand it, these can be eliminated by sending all writes to a single node, although that obviously makes less efficient use of your cluster. * Write followed by read on a different node can return stale data During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. Matt [1] No, really: I just read a bunch of docs and blogs today. If anybody who is an expert would like to validate/correct that would be great. [2] http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/ [3]
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Matthew Booth mbo...@redhat.com wrote: A: start transaction; B: start transaction; A: insert into foo values(1); B: insert into foo values(1); -- 'regular' DB would block here, and report an error on A's commit A: commit; -- success B: commit; -- KABOOM Confusingly, Galera will report a 'deadlock' to node B, despite this not being a deadlock by any definition I'm familiar with. So, one of the entire points of the enginefacade work is that we will ensure that writes will continue to be made to exactly one node in the cluster. Openstack does not have the problem defined above, because we only communicate with one node, even today. The work that we are trying to proceed with is to at least have *reads* make full use of the cluster. The above phenomenon is not a problem for openstack today except for the reduced efficiency, which enginefacade will partially solve. As I understand it, these can be eliminated by sending all writes to a single node, although that obviously makes less efficient use of your cluster. this is what we do right now and it continues to be the plan going forward. Having single-master is in fact the traditional form of clustering. In the Openstack case, this issue isn’t as bad as it seems, because openstack runs many different applications against the same database simultaneously. Different applications should refer to different nodes in the cluster as their “master”. There’s no conflict here because each app talks only to its own tables. During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] will need to get more detail on this. this would mean that galera is not in fact synchronous. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Matthew Booth's message of 2015-02-04 08:30:32 -0800: * Write followed by read on a different node can return stale data During a commit, Galera replicates a transaction out to all other db nodes. Due to its design, Galera knows these transactions will be successfully committed to the remote node eventually[2], but it doesn't commit them straight away. The remote node will check these outstanding replication transactions for write conflicts on commit, but not for read. This means that you can do: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] This means that even for 'synchronous' slaves, if a client makes an RPC call which writes a row to write master A, then another RPC call which expects to read that row from synchronous slave node B, there's no default guarantee that it'll be there. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. wsrep_sync_wait/wsrep_casual_reads doesn't actually hit the cluster any harder, it simply tells the local Galera node if you're not caught up with the highest known sync point, don't answer queries yet. So it will slow down that particular query as it waits for an update from the leader about sync point and, if necessary, waits for the local engine to catch up to that point. However, it isn't going to push that query off to all the other boxes or anything like that. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
How interesting, Why are people using galera if it behaves like this? :-/ Are the people that are using it know/aware that this happens? :-/ Scary Mike Bayer wrote: Matthew Boothmbo...@redhat.com wrote: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo;-- May not contain the value we inserted above[3] I’ve confirmed in my own testing that this is accurate. the wsrep_causal_reads flag does resolve this, and it is settable on a per-session basis. The attached script, adapted from the script given in the blog post, illustrates this. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Well, consider our application is doing some @writer, then later it does some @reader. @reader has the contract that reads must be synchronous with any writes. Easy enough, @reader ensures that the connection it uses sets up set wsrep_causal_reads=1”. The attached test case confirms this is feasible on a per-session (that is, a connection attached to the database) basis, so that the setting will not impact the cluster as a whole, and we can forego using it on those @async_reader calls where we don’t need it. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. I think the above example is referring to the “deadlock” issue, which we have solved both with the “only write to one master” strategy. But overall, as you’re aware, we will no longer have the words “begin” or “commit” in our code. This takes place all within enginefacade. With this pattern, we will permanently end the need for any kind of repeated special patterns or boilerplate which occurs per-transaction on a backend-configurable basis. The enginefacade is where any such special patterns can take place, and for extended patterns such as setting up wsrep_causal_reads on @reader nodes or similar, we can implement a rudimentary plugin system for it such that we can have a “galera” backend to set up what’s needed. The attached script does essentially what the one associated with http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ does. It’s valid because without wsrep_causal_reads turned on the connection, I get plenty of reads that lag behind the writes, so I’ve confirmed this is easily reproducible, and that with casual_reads turned on, it vanishes. The script demonstrates that a single application can set up “wsrep_causal_reads” on a per-session basis (remember, by “session” we mean “a mysql session”), where it takes effect for that connection alone, not affecting the performance of other concurrent connections even in the same application. With the flag turned on, the script never reads a stale row. The script illustrates calls upon both the casual reads connection and the non-causal reads in a randomly alternating fashion. I’m running it against a cluster of two virtual nodes on a laptop, so performance is very slow, but some sample output: 2015-02-04 15:49:27,131 100 runs 2015-02-04 15:49:27,754 w/ non-causal reads, got row 763 val is 9499, retries 0 2015-02-04 15:49:27,760 w/ non-causal reads, got row 763 val is 9499, retries 1 2015-02-04 15:49:27,764 w/ non-causal reads, got row 763 val is 9499, retries 2 2015-02-04 15:49:27,772 w/ non-causal reads, got row 763 val is 9499, retries 3 2015-02-04 15:49:27,777 w/ non-causal reads, got row 763 val is 9499, retries 4 2015-02-04 15:49:30,985 200 runs 2015-02-04 15:49:37,579 300 runs 2015-02-04 15:49:42,396 400 runs 2015-02-04 15:49:48,240 w/ non-causal reads, got row 6544 val is 6766, retries 0 2015-02-04 15:49:48,255 w/ non-causal reads, got row 6544 val is 6766, retries 1 2015-02-04 15:49:48,276 w/ non-causal reads, got row 6544 val is 6766, retries 2 2015-02-04 15:49:49,336 500 runs 2015-02-04 15:49:56,433 600 runs 2015-02-04 15:50:05,801 700 runs 2015-02-04 15:50:08,802 w/ non-causal reads, got row 533 val is 834, retries 0 2015-02-04 15:50:10,849 800 runs 2015-02-04 15:50:14,834 900 runs 2015-02-04 15:50:15,445 w/ non-causal reads, got row 124 val is 3850, retries 0 2015-02-04 15:50:15,448 w/ non-causal reads, got row 124 val is 3850, retries 1 2015-02-04 15:50:18,515 1000 runs 2015-02-04 15:50:22,130 1100 runs 2015-02-04 15:50:26,301 1200 runs 2015-02-04 15:50:28,898 w/ non-causal reads, got row 1493 val is 8358, retries 0 2015-02-04 15:50:29,988 1300 runs 2015-02-04 15:50:33,736 1400 runs 2015-02-04 15:50:34,219 w/ non-causal reads, got row 9661 val is 2877, retries 0 2015-02-04 15:50:38,796 1500 runs 2015-02-04 15:50:42,844 1600 runs 2015-02-04 15:50:46,838 1700 runs 2015-02-04
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Excerpts from Joshua Harlow's message of 2015-02-04 13:24:20 -0800: How interesting, Why are people using galera if it behaves like this? :-/ Note that any true MVCC database will roll back transactions on conflicts. One must always have a deadlock detection algorithm of some kind. Galera behaves like this because it is enormously costly to be synchronous at all times for everything. So it is synchronous when you want it to be, and async when you don't. Note that it's likely NDB (aka MySQL Cluster) would work fairly well for OpenStack's workloads, and does not suffer from this. However, it requires low latency high bandwidth links between all nodes (infiniband recommended) or it will just plain suck. So Galera is a cheaper, easier to tune and reason about option. Are the people that are using it know/aware that this happens? :-/ I think the problem really is that it is somewhat de facto, and used without being tested. The gate doesn't set up a three node Galera db and test that OpenStack works right. Also it is inherently a race condition, and thus will be a hard one to test. Thats where having knowledge of it and taking time to engineer a solution that makes sense is really the best course I can think of. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). In more detail, consider what happens in full isolation when you have the A and B example given, but B starts its transaction before A. B BEGIN A BEGIN A INSERT foo A COMMIT B SELECT foo - NULL - data inserted by a transaction with a higher transaction id isn't visible to the older transaction (in a MVCC style engine - there are other engines, but this is common). When you add clustering in, many cluster DBs are not synchronous: - postgresql replication is asynchronous - both log shipping and slony. Neither is Galera. So reads will see older data than has been committed to the cluster. Writes will conflict *if* the write was dependent on data that was changed. If rather than clustering you add multiple DB's, you get the same sort of thing unless you explicitly wire in 2PC and a distributed lock manager and oh my... and we have multiple DB's (cinder, nova etc) but no such coordination between them. Now, if we say that we can't accept eventual consistency, that we have to have atomic visibility of changes, then we've a -lot- of work- because of the multiple DB's thing. However, eventual consistency can cause confusion if its not applied well, and it may be that this layer is the wrong layer to apply it at - thats certainly a possibility. Are the people that are using it know/aware that this happens? :-/ I hope so :) -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Matthew Booth mbo...@redhat.com wrote: A: start transaction; A: insert into foo values(1) A: commit; B: select * from foo; -- May not contain the value we inserted above[3] I’ve confirmed in my own testing that this is accurate. the wsrep_causal_reads flag does resolve this, and it is settable on a per-session basis. The attached script, adapted from the script given in the blog post, illustrates this. Galera exposes a session variable which will fix this: wsrep_sync_wait (or wsrep_causal_reads on older mysql). However, this isn't the default. It presumably has a performance cost, but I don't know what it is, or how it scales with various workloads. Well, consider our application is doing some @writer, then later it does some @reader. @reader has the contract that reads must be synchronous with any writes. Easy enough, @reader ensures that the connection it uses sets up set wsrep_causal_reads=1”. The attached test case confirms this is feasible on a per-session (that is, a connection attached to the database) basis, so that the setting will not impact the cluster as a whole, and we can forego using it on those @async_reader calls where we don’t need it. Because these are semantic issues, they aren't things which can be easily guarded with an if statement. We can't say: if galera: try: commit except: rewind time If we are to support this DB at all, we have to structure code in the first place to allow for its semantics. I think the above example is referring to the “deadlock” issue, which we have solved both with the “only write to one master” strategy. But overall, as you’re aware, we will no longer have the words “begin” or “commit” in our code. This takes place all within enginefacade. With this pattern, we will permanently end the need for any kind of repeated special patterns or boilerplate which occurs per-transaction on a backend-configurable basis. The enginefacade is where any such special patterns can take place, and for extended patterns such as setting up wsrep_causal_reads on @reader nodes or similar, we can implement a rudimentary plugin system for it such that we can have a “galera” backend to set up what’s needed. The attached script does essentially what the one associated with http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ does. It’s valid because without wsrep_causal_reads turned on the connection, I get plenty of reads that lag behind the writes, so I’ve confirmed this is easily reproducible, and that with casual_reads turned on, it vanishes. The script demonstrates that a single application can set up “wsrep_causal_reads” on a per-session basis (remember, by “session” we mean “a mysql session”), where it takes effect for that connection alone, not affecting the performance of other concurrent connections even in the same application. With the flag turned on, the script never reads a stale row. The script illustrates calls upon both the casual reads connection and the non-causal reads in a randomly alternating fashion. I’m running it against a cluster of two virtual nodes on a laptop, so performance is very slow, but some sample output: 2015-02-04 15:49:27,131 100 runs 2015-02-04 15:49:27,754 w/ non-causal reads, got row 763 val is 9499, retries 0 2015-02-04 15:49:27,760 w/ non-causal reads, got row 763 val is 9499, retries 1 2015-02-04 15:49:27,764 w/ non-causal reads, got row 763 val is 9499, retries 2 2015-02-04 15:49:27,772 w/ non-causal reads, got row 763 val is 9499, retries 3 2015-02-04 15:49:27,777 w/ non-causal reads, got row 763 val is 9499, retries 4 2015-02-04 15:49:30,985 200 runs 2015-02-04 15:49:37,579 300 runs 2015-02-04 15:49:42,396 400 runs 2015-02-04 15:49:48,240 w/ non-causal reads, got row 6544 val is 6766, retries 0 2015-02-04 15:49:48,255 w/ non-causal reads, got row 6544 val is 6766, retries 1 2015-02-04 15:49:48,276 w/ non-causal reads, got row 6544 val is 6766, retries 2 2015-02-04 15:49:49,336 500 runs 2015-02-04 15:49:56,433 600 runs 2015-02-04 15:50:05,801 700 runs 2015-02-04 15:50:08,802 w/ non-causal reads, got row 533 val is 834, retries 0 2015-02-04 15:50:10,849 800 runs 2015-02-04 15:50:14,834 900 runs 2015-02-04 15:50:15,445 w/ non-causal reads, got row 124 val is 3850, retries 0 2015-02-04 15:50:15,448 w/ non-causal reads, got row 124 val is 3850, retries 1 2015-02-04 15:50:18,515 1000 runs 2015-02-04 15:50:22,130 1100 runs 2015-02-04 15:50:26,301 1200 runs 2015-02-04 15:50:28,898 w/ non-causal reads, got row 1493 val is 8358, retries 0 2015-02-04 15:50:29,988 1300 runs 2015-02-04 15:50:33,736 1400 runs 2015-02-04 15:50:34,219 w/ non-causal reads, got row 9661 val is 2877, retries 0 2015-02-04 15:50:38,796 1500 runs 2015-02-04 15:50:42,844 1600 runs 2015-02-04 15:50:46,838 1700 runs 2015-02-04 15:50:51,049 1800 runs 2015-02-04 15:50:55,139 1900 runs 2015-02-04 15:50:59,632 2000 runs 2015-02-04 15:51:04,721 2100 runs 2015-02-04 15:51:10,670 2200 runs
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On 02/04/2015 07:59 PM, Angus Lees wrote: On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net mailto:robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com mailto:harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/__wiki/BasicDesignTenets https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). In more detail, consider what happens in full isolation when you have the A and B example given, but B starts its transaction before A. B BEGIN A BEGIN A INSERT foo A COMMIT B SELECT foo - NULL Note that this still makes sense from each of A and B's individual view of the world. If I understood correctly, the big change with Galera that Matthew is highlighting is that read-after-write may not be consistent from the pov of a single thread. No, this is not correct. There is nothing different about Galera here versus any asynchronously replicated database. A single thread, issuing statements in two entirely *separate sessions*, load-balanced across an entire set of database cluster nodes, may indeed see older data if the second session gets balanced to a slave node. Nothing has changed about this with Galera. The exact same patterns that you would use to ensure that you are able to read the data that you previously wrote can be used with Galera. Just have the thread start a transactional session and ensure all queries are executed in the context of that session. Done. Nothing about Galera changes anything here. Not have read-after-write is *really* hard to code to (see for example x86 SMP cache coherency, C++ threading semantics, etc which all provide read-after-write for this reason). This is particularly true when the affected operations are hidden behind an ORM - it isn't clear what might involve a database call and sequencers (or logical clocks, etc) aren't made explicit in the API. I strongly suggest just enabling wsrep_casual_reads on all galera sessions, unless you can guarantee that the high-level task is purely read-only, and then moving on to something else ;) If we choose performance over correctness here then we're just signing up for lots of debugging of hard to reproduce race conditions, and the fixes are going to look like what wsrep_casual_reads does anyway. (Mind you, exposing sequencers at every API interaction would be awesome, and I look forward to a future framework and toolchain that makes that easy to do correctly) IMHO, you all are reading WAY too much into this. The behaviour that Matthew is describing is the kind of thing that has been around for decades now with asynchronous slave replication. Applications have traditionally handled it by sending reads that can tolerate slave lag to a slave machine, and reads that cannot to the same machine that was written to. Galera doesn't change anything here. I'm really not sure what the fuss is about, frankly. I don't recommend mucking with wsrep_causal_reads if we don't have to. And, IMO, we don't have to much with it at all. Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). When I hear MySQL I don't exactly think of eventual consistency (#7), scalability (#1), horizontal scalability (#4), etc. For the past few months I have been advocating implementing an alternative to db/sqlalchemy, but of course it's a huge undertaking. NoSQL (or even distributed key-value stores) should be considered IMO. Just some food for thought :) -- *Avishay Traeger* *Storage RD* Mobile: +972 54 447 1475 E-mail: avis...@stratoscale.com Web http://www.stratoscale.com/ | Blog http://www.stratoscale.com/blog/ | Twitter https://twitter.com/Stratoscale | Google+ https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts | Linkedin https://www.linkedin.com/company/stratoscale __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net wrote: On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote: How interesting, Why are people using galera if it behaves like this? :-/ Because its actually fairly normal. In fact its an instance of point 7 on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our oldest wiki pages :). In more detail, consider what happens in full isolation when you have the A and B example given, but B starts its transaction before A. B BEGIN A BEGIN A INSERT foo A COMMIT B SELECT foo - NULL Note that this still makes sense from each of A and B's individual view of the world. If I understood correctly, the big change with Galera that Matthew is highlighting is that read-after-write may not be consistent from the pov of a single thread. Not have read-after-write is *really* hard to code to (see for example x86 SMP cache coherency, C++ threading semantics, etc which all provide read-after-write for this reason). This is particularly true when the affected operations are hidden behind an ORM - it isn't clear what might involve a database call and sequencers (or logical clocks, etc) aren't made explicit in the API. I strongly suggest just enabling wsrep_casual_reads on all galera sessions, unless you can guarantee that the high-level task is purely read-only, and then moving on to something else ;) If we choose performance over correctness here then we're just signing up for lots of debugging of hard to reproduce race conditions, and the fixes are going to look like what wsrep_casual_reads does anyway. (Mind you, exposing sequencers at every API interaction would be awesome, and I look forward to a future framework and toolchain that makes that easy to do correctly) - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Jay Pipes jaypi...@gmail.com wrote: No, this is not correct. There is nothing different about Galera here versus any asynchronously replicated database. A single thread, issuing statements in two entirely *separate sessions*, load-balanced across an entire set of database cluster nodes, may indeed see older data if the second session gets balanced to a slave node. That’s what we’re actually talking about. We’re talking about “reader” methods that aren’t enclosed in a “writer” potentially being pointed at the cluster as a whole. Nothing has changed about this with Galera. The exact same patterns that you would use to ensure that you are able to read the data that you previously wrote can be used with Galera. Just have the thread start a transactional session and ensure all queries are executed in the context of that session. Done. Nothing about Galera changes anything here. Right but, what I’m trying to get a handle on is, how often do we make a series of RPC calls at an openstack service, where each one (because they are separate calls) are all in different transactions, and then how many of those are RPC calls that are “read-only” (and therefore we’d like to point at the cluster as a whole) are dependent on a “writer” RPC call that just happened immediately preceding? IMHO, you all are reading WAY too much into this. The behaviour that Matthew is describing is the kind of thing that has been around for decades now with asynchronous slave replication. Applications have traditionally handled it by sending reads that can tolerate slave lag to a slave machine, and reads that cannot to the same machine that was written to. Can we identify methods in Openstack, and particularly Nova, that are reads that can tolerate slave lag? Or is the thing architected such that “no, pretty much 95% of reader calls, we have no idea if they occur right after a write that they are definitely dependent on” ?Matthew found a small handful in one little corner of Nova, some kind of background thread thing, which make use of the “use_slave” flag. But the rest of it, nope. Galera doesn't change anything here. I'm really not sure what the fuss is about, frankly. because we’re trying to get Galera to actually work as a load balanced cluster to some degree, at least for reads. Otherwise I’m not really sure why we have to bother with Galera at all. If we just want a single MySQL server that has a warm standby for failover, why aren’t we just using that capability straight from MySQL. Then we get “SELECT FOR UPDATE” and everything else back.Galera’s “multi master” capability is already in the trash for us, and it seems like “multi-slave” is only marginally useful either, the vast majority of openstack has to be 100% pointed at just one node to work correctly. I’m coming here with the disadvantage that I don’t have a clear picture of the actual use patterns we really need.The picture I have right now is of a Nova / Neutron etc. that receive dozens/hundreds of tiny RPC calls each of which do some small thing in its own transaction, yet most are dependent on each other as they are all part of a single larger operation, and that the whole thing runs too slowly. But this is the fuzziest picture ever. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev