On Mar 11, 2012, at 8:03 PM, George V. Reilly wrote:

> We ran into a nasty problem with sharding a while back. We came up with an 
> effective workaround, but we don't fully understand the problem. Perhaps 
> someone here can provide more insight.
> 
> We shard requests to our production MySQL databases using 
> sqlalchemy.ext.horizontal_shard.ShardedSession. Our query_chooser uses  a 
> consistent hashing scheme on our account_ids to compute shard_ids. In almost 
> all cases, we know the account_id before performing a query, therefore a 
> query goes directly to the correct shard and only to that shard. In the other 
> cases, we're typically trying to find the account_id(s) given some other key.
> 
> The nasty problem arose thus. We made a cross-shard query, which apparently 
> opened a database connection to more than one of the sharded databases. The 
> query returned an account_id and our associated SQLAlchemy Account object. We 
> then added data to the account and committed it — all on the same 
> scoped_session. Result: the thread (process?) blocks on the connections to 
> the other databases.
> 
> The workaround is to call commit() on the session after performing the 
> cross-shard query, and then call scoped_session() to make a fresh session 
> before adding data to the account.
> 
> Anyone understand what the underlying problem was? Is it in SQLAlchemy, 
> MySQL-python, or some lower level?
> 
> The pattern that we're moving towards with these cross-shard queries is to 
> perform them in a distinct session using a contextmanager which commits the 
> session. Account_ids and other keys are returned from the contextmanager, but 
> SQLAlchemy objects are not. One (or more) of our SQLAlchemy Account objects 
> are then created in the main session. I'm worried about the >1 case, since 
> those accounts may be distributed across several shards.


the Session works like this:   each time you execute() on a particular 
Connection, that Connection is added to the state of the SessionTransaction, 
and remains open in a transaction until the Session itself is committed, rolled 
back, or closed.    It's in this way that a Session coordinates multiple 
operations across many databases, and if twophase=True it will also call 
prepare() on all those connections before the commit.

So when using ShardedSession, if you hit three databases, they're all in that 
session's state.

Let's say you wanted to close out the first two.  That may be fine, but how do 
you know that those connections aren't in the Session's state due to previous 
operations on those databases ?   So it's something of a tricky problem, to say 
that you'd like to scan multiple database nodes with a certain SELECT 
statement, but then once you find the node you care about, close out the 
previous nodes.   You'd have to check first that they weren't already there.    
 

This is all doable though is not public API, to optimize this operation would 
mean you'd need to peek inside of Session.transaction._connections before and 
after the operation, then rollback() those connections you know you don't want 
anymore and remove them from Session.transaction._connections.

So that's what's going on, it's just I think if you try yo experiment with 
closing out individual connections from Session.transaction._connections it may 
become apparent that this is a bit of an intricate case.

Overall, I'd likely be using distinct Session objects explicitly in any case 
within a horizontal sharding situation; assuming I can make sure that the scope 
of a certain request is going to be focused entirely on just one of those 
shards, not unlike the solution you came up with.    


-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Reply via email to