On Sunday, March 11, 2012 10:56:44 PM UTC-7, Michael Bayer wrote: > > > On Mar 11, 2012, at 8:03 PM, George V. Reilly wrote: > > > We ran into a nasty problem with sharding a while back. We came up with > an effective workaround, but we don't fully understand the problem. Perhaps > someone here can provide more insight. > > > > We shard requests to our production MySQL databases using > sqlalchemy.ext.horizontal_shard.ShardedSession. Our query_chooser uses a > consistent hashing scheme on our account_ids to compute shard_ids. In > almost all cases, we know the account_id before performing a query, > therefore a query goes directly to the correct shard and only to that > shard. In the other cases, we're typically trying to find the account_id(s) > given some other key. > > > > The nasty problem arose thus. We made a cross-shard query, which > apparently opened a database connection to more than one of the sharded > databases. The query returned an account_id and our associated SQLAlchemy > Account object. We then added data to the account and committed it — all on > the same scoped_session. Result: the thread (process?) blocks on the > connections to the other databases. > > > > The workaround is to call commit() on the session after performing the > cross-shard query, and then call scoped_session() to make a fresh session > before adding data to the account. > > > > Anyone understand what the underlying problem was? Is it in SQLAlchemy, > MySQL-python, or some lower level? > > > > The pattern that we're moving towards with these cross-shard queries is > to perform them in a distinct session using a contextmanager which commits > the session. Account_ids and other keys are returned from the > contextmanager, but SQLAlchemy objects are not. One (or more) of our > SQLAlchemy Account objects are then created in the main session. I'm > worried about the >1 case, since those accounts may be distributed across > several shards. > > > the Session works like this: each time you execute() on a particular > Connection, that Connection is added to the state of the > SessionTransaction, and remains open in a transaction until the Session > itself is committed, rolled back, or closed. It's in this way that a > Session coordinates multiple operations across many databases, and if > twophase=True it will also call prepare() on all those connections before > the commit. > > So when using ShardedSession, if you hit three databases, they're all in > that session's state. > > Let's say you wanted to close out the first two. That may be fine, but > how do you know that those connections aren't in the Session's state due to > previous operations on those databases ? So it's something of a tricky > problem, to say that you'd like to scan multiple database nodes with a > certain SELECT statement, but then once you find the node you care about, > close out the previous nodes. You'd have to check first that they weren't > already there. > > This is all doable though is not public API, to optimize this operation > would mean you'd need to peek inside of Session.transaction._connections > before and after the operation, then rollback() those connections you know > you don't want anymore and remove them from > Session.transaction._connections. > > So that's what's going on, it's just I think if you try yo experiment with > closing out individual connections from Session.transaction._connections it > may become apparent that this is a bit of an intricate case. > > Overall, I'd likely be using distinct Session objects explicitly in any > case within a horizontal sharding situation; assuming I can make sure that > the scope of a certain request is going to be focused entirely on just one > of those shards, not unlike the solution you came up with. > Thanks, Michael.
-- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To view this discussion on the web visit https://groups.google.com/d/msg/sqlalchemy/-/xvrlAHxrN3AJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
