Re: [openstack-dev] [nova][neutron][mysql] IMPORTANT: MySQL Galera does *not* support SELECT ... FOR UPDATE
that provides interesting feature, such as PostgreSQL. Sorry, had to say it, but it's pissing me off to see the low quality of the work that is done around SQL in OpenStack. -- Julien Danjou /* Free Software hacker http://julien.danjou.info */ -- Peter Boros, Consultant, Percona Telephone: +1 888 401 3401 ext 546 Emergency: +1 888 401 3401 ext 911 Skype: percona.pboros ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova][neutron][mysql] IMPORTANT: MySQL Galera does *not* support SELECT ... FOR UPDATE
Hi, Also it would be nice to confirm that really SELECT FOR UPDATES are causing the deadlocks. Since these are row lock waits in a single node case, with a slow log from a single node, pt-query-digest can help to determine this. pt-query-digest /path/to/slow.log --order-by InnoDB_rec_lock_wait:sum digest-rec_lock_waits.txt It will show what statements waited for locks most, these will most likely be the ones causing the deadlock issues in case of multi-node writing. On Tue, May 20, 2014 at 2:27 PM, Peter Boros peter.bo...@percona.com wrote: Hi, I would like to shed some additional light on this for those who were not there. So, SELECT ... FOR UPDATE does lock on a single node, as it is pointed out earlier in this thread, a simple solution is to write only one node at a time. Haproxy can be set up with both backends, see this blog post for example. http://www.mysqlperformanceblog.com/2012/06/20/percona-xtradb-cluster-reference-architecture-with-haproxy/ In a nutshell, and with a bit of an oversimplification, galera replicates in write sets. A write set is practically a row based binary log event + some metadata which is good for 2 things: you can take a look at 2 write sets and tell if they are conflicting or not, and you can take a look at a writeset and a database, and tell if the write set is applicable to the database. At the time of commit, the transaction is transferred to all the other cluster nodes in parallel. On the remote node, the new transaction is compared to each other transaction waiting in the queue to be applied, and it's checked if it's applicable to the database. If the transaction if not conflicting, and it's applicable, it's queued, and the node signals back that the commit can proceed. There is a nice drawing about this here: http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html So, because of this, the locks of SELECT FOR UPDATE won't replicate. Between nodes, galera uses optimistic locking. This means that we assume that during the certification process (described above), there will be no conflicts. If there are conflicts, the transaction is rolled back on the originating node, and this is when you receive the error message in question. A failed transaction is something which can happen any time with any database engine with any interesting feature, and when a transaction failed, the application should now what to do with it. In case of galera, a conflict in a single node case was a wait on row locks, in case of galera replication, this will be a rollback. A rollback is a much more expensive operation (data has to be copied back from undo), so if there are lots of failures like this, performance will suffer. So, this is not a deadlock in the classical sense. Yet, InnoDB can roll back a transaction any time because of a deadlock (any database engine can do that, including PostgreSQL), and the application should be able to handle this. As it was noted earlier, writing to a single node only at a time is a good solution for avoiding this. With multiple nodes written, storage engine level writes will still happen on every node, because every node has the whole data set. Writing on multiple nodes can be beneficial because parsing SQL is much more expensive than just applying a row based binary log event, so you can see some performance improvement if all nodes are written. I would discourage using any type of multi-master replication without understanding how conflict resolution works in case of the chosen solution. In case of galera, if row locks were replicated over the network, it would act the same way as a single server, but it would be really slow. If SELECT FOR UPDATE is only used to achieve consistent reads (read your own writes), that can be achieved with wsrep_causal_reads. I am happy to help to avoid SELECT FOR UPDATE if somebody can tell me the use cases. On Tue, May 20, 2014 at 10:53 AM, Julien Danjou jul...@danjou.info wrote: On Mon, May 19 2014, Jay Pipes wrote: I think at that point I mentioned that there were a number of places that were using the SELECT ... FOR UPDATE construct in Nova (in SQLAlchemy, it's the with_lockmode('update') modification of the query object). Peter promptly said that was a problem. MySQL Galera does not support SELECT ... FOR UPDATE, since it has no concept of cross-node locking of records and results are non-deterministic. So you send a command that's not supported and the whole software deadlocks? Is there a bug number about that or something? I cannot understand how this can be possible and considered as something normal (that's the feeling I have reading your mail, I may be wrong). We have a number of options: 1) Stop using MySQL Galera for databases of projects that contain with_lockmode('update') 2) Put a big old warning in the docs somewhere about the problem of potential deadlocks or odd behaviour with Galera
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Hi Angus, If causal reads is set in a session, it won't delay all reads, just that specific read that you set if for. Let's say you have 4 sessions, in one of them you set causal reads, the other 3 won't wait on anything. The read in the one session that you set this in will be delayed, in the other 4, it won't be. Also this delay is usually small. Since the replication itself is synchronous if a node it not able to keep up with the rest of the cluster in terms of writes, it will send flow control messages to the other nodes. Flow control means that it has it's receive queue full, and the other nodes have to wait until they can do more writes (in case of flow control writes on the other nodes are blocked until the given node catches up with writes). So the delay imposed here can't be arbitrarily large. On Sat, Feb 7, 2015 at 3:00 AM, Angus Lees g...@inodes.org wrote: Thanks for the additional details Peter. This confirms the parts I'd deduced from the docs I could find, and is useful knowledge. On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com wrote: - Like many others said it before me, consistent reads can be achieved with wsrep_causal_reads set on in the session. So the example was two dependent command-line invocations (write followed by read) that have no way to re-use the same DB session (without introducing lots of affinity issues that we'd also like to avoid). Enabling wsrep_casual_reads makes sure the latter read sees the effects of the earlier write, but comes at the cost of delaying all reads by some amount depending on the write-load of the galera cluster (if I understand correctly). This additional delay was raised as a concern severe enough not to just go down this path. Really we don't care about other writes that may have occurred (we always need to deal with races against other actors), we just want to ensure our earlier write has taken effect on the galera server where we sent the second read request. If we had some way to say wsrep_delay_until $first_txid then we we could be sure of read-after-write from a different DB session and also (in the vast majority of cases) suffer no additional delay. An opaque sequencer is a generic concept across many of the distributed consensus stores I'm familiar with, so this needn't be exposed as a Galera-only quirk. Meh, I gather people are bored with the topic at this point. As I suggested much earlier, I'd just enable wsrep_casual_reads on the first request for the session and then move on to some other problem ;) - Gus __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Peter Boros, Principal Architect, Percona Telephone: +1 888 401 3401 ext 546 Emergency: +1 888 401 3401 ext 911 Skype: percona.pboros __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
about the etcd implementation is that you can select per-request whether you want to wait for quorum on a read or not. This means that in theory you could obtain higher throughput for most operations which do not require this and then only gain quorum for operations which require it (e.g. locks). Along those lines and in an effort to be a bit less doom-and-gloom, I spent my lunch break trying to find non-marketing documentation on the Galera replication protocol and how it is exposed. (It was surprisingly difficult to find such information *) It's easy to get the transaction ID of the last commit (wsrep_last_committed), but I can't find a way to wait until at least a particular transaction ID has been synced. If we can find that latter functionality, then we can expose that sequencer all the way through (HTTP header?) and then any follow-on commands can mention the sequencer of the previous write command that they really need to see the effects of. In practice, this should lead to zero additional wait time, since the Galera replication has almost certainly already caught up by the time the second command comes in - and we can just read from the local server with no additional delay. See the various *Index variables in the etcd API, for how the same idea gets used there. - Gus (*) In case you're also curious, the only doc I found with any details was http://galeracluster.com/documentation-webpages/certificationbasedreplication.html and its sibling pages. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Peter Boros, Principal Architect, Percona Telephone: +1 888 401 3401 ext 546 Emergency: +1 888 401 3401 ext 911 Skype: percona.pboros __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev