Is this a sane approach to sharding?

Myron Marston Fri, 24 Feb 2012 11:38:01 -0800

I'm working on a project where we plan to do some extensive, dynamic
sharding.  Previously, I've only used ActiveRecord as my SQL ORM for
all ruby projects, but this time around we're considering Sequel since
we're doing stuff that's fairly non-conventional and Sequel's advanced
features seem to be a good fit.  However, I couldn't find a way to use
Sequel's built-in sharding APIs to do what I want, and I came up with
a slightly different approach that is working so far.  I wanted to get
some feedback to see if my approach is sane or if I'm missing some
APIs that would give us what we need in a simpler manner.


First, to explain a bit about our project: we have lots of back-end
services that store gobs of data (typically terabytes, sometime
petabytes).  Each uses its own datastore that's appropriate for its
data access patterns and needs (for example, one uses Cassandra and
one uses Riak).  We want to enable some new integrated views that
combine data from multiple back end services, and make it possible to
query that data in lots of ways, sort it, filter it, paginate it,
etc.  The data for each user is logically disjoint and updated once a
a week, so we're thinking of scaling this middle-tier horizontally by
sharding at the user level, and putting the aggregated data into SQL
databases.  Note that these shards will be created and destroyed
dynamically.  There isn't a static list of shards that will be
available at app environment boot time.  The plan is to have a master
"shards" database that maps users to shards.  The middle-tier API
(likely to be written using Sinatra) will pick the appropriate shard
based on which user it is querying data for.

I couldn't find a way to use Sequel's standard sharding interface
because you need to know what the shards are at app environment boot
time.  I see methods to dynamically add and remove shards, but given
how frequently new shards will be created and destroyed, it gets messy
real quick (i.e. we'd have to constantly notify our processes to
update their list of shards).

Instead, I want the sharding to be lazy.  At request time, the API can
pick an appropriate shard based on the user.  Here's what I came up
with:

https://gist.github.com/1903020

Essentially, the main API here is `DB.use_shard(host, db_name) { ... }
`.  Within the block, the appropriate shard is used for all queries;
outside the block, a `NoCurrentShardError` will be raised.  It
accomplishes this by putting a delegate object in place of the
assigned DB, and defining __getobj__ to return the real
Sequel::Database object so that all method calls are delegated to
that.

Is this a sane approach to our sharding needs?  Is there a better way
to do this?

Thanks in advance for the help!

-- Myron

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Is this a sane approach to sharding?

Reply via email to