I'm working on a project where we plan to do some extensive, dynamic sharding. Previously, I've only used ActiveRecord as my SQL ORM for all ruby projects, but this time around we're considering Sequel since we're doing stuff that's fairly non-conventional and Sequel's advanced features seem to be a good fit. However, I couldn't find a way to use Sequel's built-in sharding APIs to do what I want, and I came up with a slightly different approach that is working so far. I wanted to get some feedback to see if my approach is sane or if I'm missing some APIs that would give us what we need in a simpler manner.
First, to explain a bit about our project: we have lots of back-end services that store gobs of data (typically terabytes, sometime petabytes). Each uses its own datastore that's appropriate for its data access patterns and needs (for example, one uses Cassandra and one uses Riak). We want to enable some new integrated views that combine data from multiple back end services, and make it possible to query that data in lots of ways, sort it, filter it, paginate it, etc. The data for each user is logically disjoint and updated once a a week, so we're thinking of scaling this middle-tier horizontally by sharding at the user level, and putting the aggregated data into SQL databases. Note that these shards will be created and destroyed dynamically. There isn't a static list of shards that will be available at app environment boot time. The plan is to have a master "shards" database that maps users to shards. The middle-tier API (likely to be written using Sinatra) will pick the appropriate shard based on which user it is querying data for. I couldn't find a way to use Sequel's standard sharding interface because you need to know what the shards are at app environment boot time. I see methods to dynamically add and remove shards, but given how frequently new shards will be created and destroyed, it gets messy real quick (i.e. we'd have to constantly notify our processes to update their list of shards). Instead, I want the sharding to be lazy. At request time, the API can pick an appropriate shard based on the user. Here's what I came up with: https://gist.github.com/1903020 Essentially, the main API here is `DB.use_shard(host, db_name) { ... } `. Within the block, the appropriate shard is used for all queries; outside the block, a `NoCurrentShardError` will be raised. It accomplishes this by putting a delegate object in place of the assigned DB, and defining __getobj__ to return the real Sequel::Database object so that all method calls are delegated to that. Is this a sane approach to our sharding needs? Is there a better way to do this? Thanks in advance for the help! -- Myron -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sequel-talk?hl=en.
