On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton <aa...@thelastpickle.com> wrote: >> The single partitioner is "baked in" > That was my point. > > You could perhaps write a partitioner that considers the CF when deciding > what nodes to put data on. Off the top of my head the partitioner is not told > about the CF the key is storing in. > > Aaron > > On 23/02/2011, at 6:01 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > >> On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <da...@lookin2.com> wrote: >>> No, that's not what I mean at all. >>> >>> That message is about the ability to use different partitioners for >>> different CFs, say, RandomPartitioner for one, OPP for another. >>> >>> I'm talking about defining how many nodes a CF should be distributed over, >>> which would be useful if you have a lot of nodes and a lot of small CFs >>> (small relative to the total amount of data). >>> >>> >>> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com> >>> wrote: >>>> >>>> Sounds a bit like this idea >>>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html >>>> >>>> Aaron >>>> >>>> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote: >>>> >>>>> Cassandra is both distributed and replicated. We have Replication Factor >>>>> but no Distribution Factor! >>>>> >>>>> Distribution Factor would define over how many nodes a CF should be >>>>> distributed. >>>>> >>>>> Say you want to support millions of multi-tenant users in clusters with >>>>> thousands of nodes, where you don't know the user's schema in advance, so >>>>> you can't have users share CFs. >>>>> >>>>> In this case you wouldn't want to spread out each user's Column Families >>>>> over thousands of nodes! You would want something like: RF=3, DF=10 i.e. >>>>> distribute each CF over 10 nodes, within those nodes replicate 3 times. >>>>> >>>>> One implementation of DF would be to hash the CF name, and use the same >>>>> strategies defined for RF to choose the N nodes in DF=N. >>>>> >>> >>> >> >> The single partitioner is "baked in" >> >> Here is a possible solution. Use OOP, but md5 hash your keys client side. >> >> This solves that, but when you have keyspaces using OOP but with >> different key distributions this falls apart. >
Not to say that this is a bad idea but it breaks the #1 Cassandra law of Cassandra "keep everything balanced". That routine that calculates natural endpoints does not take the CF into account. Regarding multi-tenancy, I do not think there is a line in the sand between "running N clusters " and multi-tenancy. "Multi-tenancy" is also ambiguous like "real time". Does multi-tenancy mean efficiently supporting 10-20 CFs or 20,000?. I do not see the cassandra code base supporting a very large number of cf's since it was designed around a low number of CFs! Some who may have who have moved from a RDBMS background where a "table" looks/works like a "columnfamily". But if that is probably not denormalized enough. Many in fact advocate "You only need 1 CF!"