Re: [openstack-dev] [oslo.db][nova] Use of asynchronous slaves in Nova (was: Deprecating use_slave in Nova)

2015-02-02 Thread Mike Bayer


Matthew Booth mbo...@redhat.com wrote:

 
 Based on my current (and still sketchy) understanding, I think we can
 define 3 classes of database node:
 
 1. Read/write
 2. Synchronous read-only
 3. Asynchronous read-only
 
 and 3 code annotations:
 
 * Writer (must use class 1)
 * Reader (prefer class 2, can use 1)
 * Async reader (prefer class 3, can use 2 or 1)
 
 The use cases for async would presumably be limited. Perhaps certain
 periodic tasks? Would it even be worth it?

Let’s suppose someone runs an openstack setup using a database with async 
replication.

Can openstack even make use of this outside of these periodic tasks, or is it 
the case that a stateless call to openstack (e.g. a web service call) can’t be 
tasked with knowing when it relies upon a previous web service call that may 
not have been synced?

Let’s suppose that an app has a web service call, and within that scope, it 
calls a function that does @writer, and then it calls a function that does 
@reader.   Even that situation, enginefacade could detect that within the new 
@reader call, we see a context being passed that we know was just used in a 
@writer - so even then, we could have the @reader upgrade to @writer if we know 
that reader slaves are async in a certain configuration.

But is that enough?   Or is it the case that a common operation calls upon 
multiple web service calls that are dependent on each other, with no indication 
between them to detect this, therefore all of these calls have to assume “I can 
only read from a slave if its synchronous” ?

I think we really need to know what deployment styles we are targeting here.  
If most people use galera synchronous, that can be the primary platform, and 
the others simply won’t be able to promise very good utilization of async read 
slaves.

If that all makes sense.  If I read this a week from now I won’t understand 
what I’m talking about.



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.db][nova] Use of asynchronous slaves in Nova (was: Deprecating use_slave in Nova)

2015-02-02 Thread Matthew Booth
On 30/01/15 19:06, Mike Bayer wrote:
 
 
 Matthew Booth mbo...@redhat.com wrote:
 
 At some point in the near future, hopefully early in L, we're intending
 to update Nova to use the new database transaction management in
 oslo.db's enginefacade.

 Spec:
 http://git.openstack.org/cgit/openstack/oslo-specs/plain/specs/kilo/make-enginefacade-a-facade.rst

 Implementation:
 https://review.openstack.org/#/c/138215/

 One of the effects of this is that we will always know when we are in a
 read-only transaction, or a transaction which includes writes. We intend
 to use this new contextual information to make greater use of read-only
 slave databases. We are currently proposing that if an admin has
 configured a slave database, we will use the slave for *all* read-only
 transactions. This would make the use_slave parameter passed to some
 Nova apis redundant, as we would always use the slave where the context
 allows.

 However, using a slave database has a potential pitfall when mixed with
 separate write transactions. A caller might currently:

 1. start a write transaction
 2. update the database
 3. commit the transaction
 4. start a read transaction
 5. read from the database

 The client might expect data written in step 2 to be reflected in data
 read in step 5. I can think of 3 cases here:

 1. A short-lived RPC call is using multiple transactions

 This is a bug which the new enginefacade will help us eliminate. We
 should not be using multiple transactions in this case. If the reads are
 in the same transaction as the write: they will be on the master, they
 will be consistent, and there is no problem. As a bonus, lots of these
 will be race conditions, and we'll fix at least some.

 2. A long-lived task is using multiple transactions between long-running
 sub-tasks

 In this case, for example creating a new instance, we genuinely want
 multiple transactions: we don't want to hold a database transaction open
 while we copy images around. However, I can't immediately think of a
 situation where we'd write data, then subsequently want to read it back
 from the db in a read-only transaction. I think we will typically be
 updating state, meaning it's going to be a succession of write transactions.

 3. Separate RPC calls from a remote client

 This seems potentially problematic to me. A client makes an RPC call to
 create a new object. The client subsequently tries to retrieve the
 created object, and gets a 404.

 Summary: 1 is a class of bugs which we should be able to find fairly
 mechanically through unit testing. 2 probably isn't a problem in
 practise? 3 seems like a problem, unless consumers of cloud services are
 supposed to expect that sort of thing.

 I understand that slave databases can occasionally get very behind. How
 behind is this in practise?

 How do we use use_slave currently? Why do we need a use_slave parameter
 passed in via rpc, when it should be apparent to the developer whether a
 particular task is safe for out-of-date data.

 Any chance they have some kind of barrier mechanism? e.g. block until
 the current state contains transaction X.

 General comments on the usefulness of slave databases, and the
 desirability of making maximum use of them?
 
 keep in mind that the big win we get from writer()/ reader() is that
writer() can remain pointing to one node in a Galera cluster, and
reader() can point to the cluster as a whole. reader() by default should
definitely refer to the cluster as a whole, that is, “use slave”.
 
 As for issue #3, galera cluster is synchronous replication. Slaves
don’t get “behind” at all. So to the degree that we need to
transparently support some other kind of master/slave where slaves do
get behind, perhaps there would be a reader(synchronous_required=True)
kind of thing; based on configuration, it would be known that
“synchronous” either means we don’t care (using galera) or that we
should use the writer (an asynchronous replication scheme).

This sounds like the crux of the matter to me. After some (admittedly
cursory) reading, it seems that galera can use both synchronous and
asynchronous replication. Up until Friday I had only ever considered
synchronous replication, which would not be a problem.

I think opportunistically using synchronous slaves whenever possible
could only be a win. Are there any unpleasant practicalities which might
mean this isn't the case?

However, it sounds to me like there is at least some OpenStack
deployment in production using asynchronous slaves, otherwise the issue
of 'getting behind' wouldn't have come up. We need to understand:

* Are people actually using asynchronous slaves?
* If so, why did they choose to do that, and
* what are they using them for?

 
 All of this points to the fact that I really don’t think the
directives / flags should say anything about which specific database to
use; using a “slave” or not due to various concerns is dependent on
backend implementation and configuration. The purpose of