Re: Distribution Factor: part of the solution to many-CF problem?

Edward Capriolo Tue, 22 Feb 2011 14:06:29 -0800

On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton <aa...@thelastpickle.com> wrote:
>> The single partitioner is "baked in"
> That was my point.
>
> You could perhaps write a partitioner that considers the CF when deciding 
> what nodes to put data on. Off the top of my head the partitioner is not told 
> about the  CF the key is storing in.
>
> Aaron
>
> On 23/02/2011, at 6:01 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
>
>> On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn <da...@lookin2.com> wrote:
>>> No, that's not what I mean at all.
>>>
>>> That message is about the ability to use different partitioners for
>>> different CFs, say, RandomPartitioner for one, OPP for another.
>>>
>>> I'm talking about defining how many nodes a CF should be distributed over,
>>> which would be useful if you have a lot of nodes and a lot of small CFs
>>> (small relative to the total amount of data).
>>>
>>>
>>> On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton <aa...@thelastpickle.com>
>>> wrote:
>>>>
>>>> Sounds a bit like this idea
>>>> http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html
>>>>
>>>> Aaron
>>>>
>>>> On 22/02/2011, at 1:28 AM, David Boxenhorn <da...@lookin2.com> wrote:
>>>>
>>>>> Cassandra is both distributed and replicated. We have Replication Factor
>>>>> but no Distribution Factor!
>>>>>
>>>>> Distribution Factor would define over how many nodes a CF should be
>>>>> distributed.
>>>>>
>>>>> Say you want to support millions of multi-tenant users in clusters with
>>>>> thousands of nodes, where you don't know the user's schema in advance, so
>>>>> you can't have users share CFs.
>>>>>
>>>>> In this case you wouldn't want to spread out each user's Column Families
>>>>> over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
>>>>> distribute each CF over 10 nodes, within those nodes replicate 3 times.
>>>>>
>>>>> One implementation of DF would be to hash the CF name, and use the same
>>>>> strategies defined for RF to choose the N nodes in DF=N.
>>>>>
>>>
>>>
>>
>> The single partitioner is "baked in"
>>
>> Here is a possible solution. Use OOP, but md5 hash your keys client side.
>>
>> This solves that, but when you have keyspaces using OOP but with
>> different key distributions this falls apart.
>



Not to say that this is a bad idea but it breaks the #1 Cassandra law
of Cassandra "keep everything balanced". That routine that calculates
natural endpoints does not take the CF into account.

Regarding multi-tenancy, I do not think there is a line in the sand
between "running N clusters " and multi-tenancy.

"Multi-tenancy" is also ambiguous like "real time". Does multi-tenancy
mean efficiently supporting 10-20 CFs or 20,000?. I do not see the
cassandra code base supporting a very large number of cf's since it
was designed around a low number of CFs!

Some who may have who have moved from a RDBMS background where a
"table" looks/works like a "columnfamily".  But if that is probably
not denormalized enough. Many in fact advocate "You only need 1 CF!"

Re: Distribution Factor: part of the solution to many-CF problem?

Reply via email to