Re: [openstack-dev] [nova][neutron][mysql] IMPORTANT: MySQL Galera does *not* support SELECT ... FOR UPDATE

2014-05-20 Thread Peter Boros
 that provides interesting feature, such as PostgreSQL.

 Sorry, had to say it, but it's pissing me off to see the low quality of
 the work that is done around SQL in OpenStack.

 --
 Julien Danjou
 /* Free Software hacker
http://julien.danjou.info */



-- 
Peter Boros, Consultant, Percona
Telephone: +1 888 401 3401 ext 546
Emergency: +1 888 401 3401 ext 911
Skype: percona.pboros

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][neutron][mysql] IMPORTANT: MySQL Galera does *not* support SELECT ... FOR UPDATE

2014-05-20 Thread Peter Boros
Hi,

Also it would be nice to confirm that really SELECT FOR UPDATES are
causing the deadlocks. Since these are row lock waits in a single node
case, with a slow log from a single node, pt-query-digest can help to
determine this.

pt-query-digest /path/to/slow.log --order-by InnoDB_rec_lock_wait:sum
 digest-rec_lock_waits.txt

It will show what statements waited for locks most, these will most
likely be the ones causing the deadlock issues in case of multi-node
writing.

On Tue, May 20, 2014 at 2:27 PM, Peter Boros peter.bo...@percona.com wrote:
 Hi,

 I would like to shed some additional light on this for those who were
 not there. So, SELECT ... FOR UPDATE does lock on a single node, as it
 is pointed out earlier in this thread, a simple solution is to write
 only one node at a time. Haproxy can be set up with both backends, see
 this blog post for example.
 http://www.mysqlperformanceblog.com/2012/06/20/percona-xtradb-cluster-reference-architecture-with-haproxy/

 In a nutshell, and with a bit of an oversimplification, galera
 replicates in write sets. A write set is practically a row based
 binary log event + some metadata which is good for 2 things: you can
 take a look at 2 write sets and tell if they are conflicting or not,
 and you can take a look at a writeset and a database, and tell if the
 write set is applicable to the database. At the time of commit, the
 transaction is transferred to all the other cluster nodes in parallel.
 On the remote node, the new transaction is compared to each other
 transaction waiting in the queue to be applied, and it's checked if
 it's applicable to the database. If the transaction if not
 conflicting, and it's applicable, it's queued, and the node signals
 back that the commit can proceed. There is a nice drawing about this
 here:

 http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html

 So, because of this, the locks of SELECT FOR UPDATE won't replicate.
 Between nodes, galera uses optimistic locking. This means that we
 assume that during the certification process (described above), there
 will be no conflicts. If there are conflicts, the transaction is
 rolled back on the originating node, and this is when you receive the
 error message in question. A failed transaction is something which can
 happen any time with any database engine with any interesting
 feature, and when a transaction failed, the application should now
 what to do with it. In case of galera, a conflict in a single node
 case was a wait on row locks, in case of galera replication, this will
 be a rollback. A rollback is a much more expensive operation (data has
 to be copied back from undo), so if there are lots of failures like
 this, performance will suffer.
 So, this is not a deadlock in the classical sense. Yet, InnoDB can
 roll back a transaction any time because of a deadlock (any database
 engine can do that, including PostgreSQL), and the application should
 be able to handle this.

 As it was noted earlier, writing to a single node only at a time is a
 good solution for avoiding this. With multiple nodes written, storage
 engine level writes will still happen on every node, because every
 node has the whole data set. Writing on multiple nodes can be
 beneficial because parsing SQL is much more expensive than just
 applying a row based binary log event, so you can see some performance
 improvement if all nodes are written.

 I would discourage using any type of multi-master replication without
 understanding how conflict resolution works in case of the chosen
 solution. In case of galera, if row locks were replicated over the
 network, it would act the same way as a single server, but it would be
 really slow. If SELECT FOR UPDATE is only used to achieve consistent
 reads (read your own writes), that can be achieved with
 wsrep_causal_reads. I am happy to help to avoid SELECT FOR UPDATE if
 somebody can tell me the use cases.

 On Tue, May 20, 2014 at 10:53 AM, Julien Danjou jul...@danjou.info wrote:
 On Mon, May 19 2014, Jay Pipes wrote:

 I think at that point I mentioned that there were a number of places that
 were using the SELECT ... FOR UPDATE construct in Nova (in SQLAlchemy, it's
 the with_lockmode('update') modification of the query object). Peter
 promptly said that was a problem. MySQL Galera does not support SELECT ...
 FOR UPDATE, since it has no concept of cross-node locking of records and
 results are non-deterministic.

 So you send a command that's not supported and the whole software
 deadlocks? Is there a bug number about that or something? I cannot
 understand how this can be possible and considered as something normal
 (that's the feeling I have reading your mail, I may be wrong).

 We have a number of options:

 1) Stop using MySQL Galera for databases of projects that contain
 with_lockmode('update')

 2) Put a big old warning in the docs somewhere about the problem of
 potential deadlocks or odd behaviour with Galera

Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-07 Thread Peter Boros
Hi Angus,

If causal reads is set in a session, it won't delay all reads, just
that specific read that you set if for. Let's say you have 4 sessions,
in one of them you set causal reads, the other 3 won't wait on
anything. The read in the one session that you set this in will be
delayed, in the other 4, it won't be. Also this delay is usually
small. Since the replication itself is synchronous if a node it not
able to keep up with the rest of the cluster in terms of writes, it
will send flow control messages to the other nodes. Flow control means
that it has it's receive queue full, and the other nodes have to wait
until they can do more writes (in case of flow control writes on the
other nodes are blocked until the given node catches up with writes).
So the delay imposed here can't be arbitrarily large.


On Sat, Feb 7, 2015 at 3:00 AM, Angus Lees g...@inodes.org wrote:
 Thanks for the additional details Peter.  This confirms the parts I'd
 deduced from the docs I could find, and is useful knowledge.

 On Sat Feb 07 2015 at 2:24:23 AM Peter Boros peter.bo...@percona.com
 wrote:

 - Like many others said it before me, consistent reads can be achieved
 with wsrep_causal_reads set on in the session.


 So the example was two dependent command-line invocations (write followed by
 read) that have no way to re-use the same DB session (without introducing
 lots of affinity issues that we'd also like to avoid).

 Enabling wsrep_casual_reads makes sure the latter read sees the effects of
 the earlier write, but comes at the cost of delaying all reads by some
 amount depending on the write-load of the galera cluster (if I understand
 correctly).  This additional delay was raised as a concern severe enough not
 to just go down this path.

 Really we don't care about other writes that may have occurred (we always
 need to deal with races against other actors), we just want to ensure our
 earlier write has taken effect on the galera server where we sent the second
 read request.  If we had some way to say wsrep_delay_until $first_txid
 then we we could be sure of read-after-write from a different DB session and
 also (in the vast majority of cases) suffer no additional delay.  An opaque
 sequencer is a generic concept across many of the distributed consensus
 stores I'm familiar with, so this needn't be exposed as a Galera-only quirk.


 Meh, I gather people are bored with the topic at this point.  As I suggested
 much earlier, I'd just enable wsrep_casual_reads on the first request for
 the session and then move on to some other problem ;)

  - Gus

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Peter Boros, Principal Architect, Percona
Telephone: +1 888 401 3401 ext 546
Emergency: +1 888 401 3401 ext 911
Skype: percona.pboros

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-06 Thread Peter Boros
 about the etcd implementation is that you can
 select per-request whether you want to wait for quorum on a read or not.
 This means that in theory you could obtain higher throughput for most
 operations which do not require this and then only gain quorum for
 operations which require it (e.g. locks).


 Along those lines and in an effort to be a bit less doom-and-gloom, I spent
 my lunch break trying to find non-marketing documentation on the Galera
 replication protocol and how it is exposed. (It was surprisingly difficult
 to find such information *)

 It's easy to get the transaction ID of the last commit
 (wsrep_last_committed), but I can't find a way to wait until at least a
 particular transaction ID has been synced.  If we can find that latter
 functionality, then we can expose that sequencer all the way through (HTTP
 header?) and then any follow-on commands can mention the sequencer of the
 previous write command that they really need to see the effects of.

 In practice, this should lead to zero additional wait time, since the Galera
 replication has almost certainly already caught up by the time the second
 command comes in - and we can just read from the local server with no
 additional delay.

 See the various *Index variables in the etcd API, for how the same idea gets
 used there.

  - Gus

 (*) In case you're also curious, the only doc I found with any details was
 http://galeracluster.com/documentation-webpages/certificationbasedreplication.html
 and its sibling pages.

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Peter Boros, Principal Architect, Percona
Telephone: +1 888 401 3401 ext 546
Emergency: +1 888 401 3401 ext 911
Skype: percona.pboros

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev