Re: Practical limit on number of column families

Brian Sam-Bodden Tue, 01 Mar 2016 15:27:18 -0800

Eric,
  Is the keyspace as a multitenancy solution as bad as the many tables
pattern? Is the memory overhead of keyspaces as heavy as that of tables?


Cheers,
Brian

On Tuesday, March 1, 2016, Eric Stevens <migh...@gmail.com> wrote:

> It's definitely not true for every use case of a large number of tables,
> but for many uses where you'd be tempted to do that, adding whatever would
> have driven your table naming instead as a column in your partition key on
> a smaller number of tables will meet your needs.  This is especially true
> if you're looking to solve multi-tenancy, unless you let your tenants
> dynamically drive your schema (which is a separate can of worms).
>
> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky <jack.krupan...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jack.krupan...@gmail.com');>> wrote:
>
>> I don't think Cassandra was "purposefully developed" for some target
>> number of tables - there is no evidence of any such an explicit intent.
>> Instead, it would be fair to say that Cassandra was "not purposefully
>> developed" with a goal of supporting "large numbers of tables." Sometimes
>> features and capabilities come for free or as a side effect of the
>> technologies used, but usually specific features and specific capabilities
>> (such as large numbers of tables) require explicit intent and explicit
>> effort.
>>
>> One could indeed endeavor to design a data store (I'm not even sure it
>> would still be considered a database per se) that supported either large
>> numbers of tables or an additional level of storage model in between table
>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>> not designed with that goal in mind.
>>
>> Traditionally, a "table" is a defined relation over a set of data.
>> Relation and data are distinct concepts. And a relation name is not simply
>> a Java-style "object". A relation (table) name is supposed to represent an
>> abstraction or entity type, while essentially all of the cases I have heard
>> of for wanting thousands (or even hundreds) of tables are trying to use
>> table as more of a container for a group of rows for a specific entity
>> instance rather than a distinct entity type. Granted, Cassandra is not
>> obligated to be limited to the relational model, but Cassandra, especially
>> CQL, is intentionally modeled reasonably closely with the relational model
>> in terms of the data modeling abstractions even though the storage engine
>> is designed to scale across nodes.
>>
>> You could file a Jira requesting such a feature improvement. And then we
>> would see if sentiment has shifted over the years.
>>
>> The key thing is to offer up a use case that warrants support for large
>> numbers of tables. So far, it has usually been the case that the perceived
>> need for separate tables could easily be met using clustering columns of a
>> single table.
>>
>> Seriously, if you guys can define a legitimate use case that can't easily
>> be handled by a single table, that could get the discussion started.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com
>> <javascript:_e(%7B%7D,'cvml','fernando.jime...@wealth-port.com');>>
>> wrote:
>>
>>> Hi Jack
>>>
>>> Being purposefully developed to only handle up to “a few hundred” tables
>>> is reason enough. I accept that, and likely a use case with many tables was
>>> never really considered. But I would still like to understand the design
>>> choices made so perhaps we gain some confidence level in this upper limit
>>> in the number of tables. The best estimate we have so far is “a few
>>> hundred” which is a bit vague.
>>>
>>> Regarding scaling, I’m not talking about scaling in terms of data
>>> volume, but on how the data is structured. One thousand tables with one row
>>> each is the same data volume as one table with one thousand rows, excluding
>>> any data structures required to maintain the extra tables. But whereas the
>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>> will run happily on a single node cluster in a low end machine.
>>>
>>> We will design our code to use a single table to avoid having nightmares
>>> with this issue. But if there is any authoritative documentation on this
>>> characteristic of Cassandra, I would love to know more.
>>>
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 14:23, Jack Krupansky <jack.krupan...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','jack.krupan...@gmail.com');>> wrote:
>>>
>>> I don't think there are any "reasons behind it." It is simply empirical
>>> experience - as reported here.
>>>
>>> Cassandra scales in two dimension - number of rows per node and number
>>> of nodes. If some source of information lead you to believe otherwise,
>>> please point out the source so that we can endeavor to correct it.
>>>
>>> The exact number of rows per node and tables per node will always have
>>> to be evaluated empirically - a proof of concept implementation, since it
>>> all depends on the mix of capabilities of your hardware combined with your
>>> specific data model, your specific data values, your specific access
>>> patterns, and your specific load. And it also depends on your own personal
>>> tolerance for degradation of latency and throughput - some people might
>>> find a given set of performance  metrics acceptable while other might not.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com
>>> <javascript:_e(%7B%7D,'cvml','fernando.jime...@wealth-port.com');>>
>>> wrote:
>>>
>>>> Hi Tommaso
>>>>
>>>> It’s not that I _need_ a large number of tables. This approach maps
>>>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>>>> not the right approach.
>>>>
>>>> At the moment I’m trying to understand the limitations in Cassandra
>>>> regarding number of Tables and the reasons behind it. I’ve come to the
>>>> email list as my Google-foo is not giving me what I’m looking for :(
>>>>
>>>> FJ
>>>>
>>>>
>>>>
>>>> On 01 Mar 2016, at 09:36, tommaso barbugli <tbarbu...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','tbarbu...@gmail.com');>> wrote:
>>>>
>>>> Hi Fernando,
>>>>
>>>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it
>>>> was a real pain in terms of operations. Repairs were terribly slow, boot of
>>>> C* slowed down and in general tracking table metrics becomes bit more work.
>>>> Why do you need this high number of tables?
>>>>
>>>> Tommaso
>>>>
>>>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
>>>> fernando.jime...@wealth-port.com
>>>> <javascript:_e(%7B%7D,'cvml','fernando.jime...@wealth-port.com');>>
>>>> wrote:
>>>>
>>>>> Hi Jack
>>>>>
>>>>> By entry I mean row
>>>>>
>>>>> Apologies for the “obsolete terminology”. When I first looked at
>>>>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>>>>> defaulted to the terms I already knew. I will bear it in mind and call 
>>>>> them
>>>>> tables from now on.
>>>>>
>>>>> Is there any documentation about this limit? for example, I’d be keen
>>>>> to know how much memory is consumed per table, and I’m also curious about
>>>>> the reasons for keeping this in memory. I’m trying to understand the
>>>>> limitations here, rather than challenge them.
>>>>>
>>>>> So far I found nothing in my search, hence why I had to resort to some
>>>>> “load testing” to see what happens when you push the table count high
>>>>>
>>>>> Thanks
>>>>> FJ
>>>>>
>>>>>
>>>>> On 01 Mar 2016, at 06:23, Jack Krupansky <jack.krupan...@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','jack.krupan...@gmail.com');>> wrote:
>>>>>
>>>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>>>>
>>>>> You are using the obsolete terminology of CQL2 and Thrift - column
>>>>> family. With CQL3 you should be creating "tables". The practical
>>>>> recommendation of an upper limit of a few hundred tables across all key
>>>>> spaces remains.
>>>>>
>>>>> Technically you can go higher and technically you can reduce the
>>>>> overhead per table (an undocumented Jira - intentionally undocumented 
>>>>> since
>>>>> it is strongly not recommended), but... it is unlikely that you will be
>>>>> happy with the results.
>>>>>
>>>>> What is the nature of the use case?
>>>>>
>>>>> You basically have two choices: an additional cluster column to
>>>>> distinguish categories of table, or separate clusters for each few hundred
>>>>> of tables.
>>>>>
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>>>>> fernando.jime...@wealth-port.com
>>>>> <javascript:_e(%7B%7D,'cvml','fernando.jime...@wealth-port.com');>>
>>>>> wrote:
>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I have a use case for Cassandra that would require creating a large
>>>>>> number of column families. I have found references to early versions of
>>>>>> Cassandra where each column family would require a fixed amount of memory
>>>>>> on all nodes, effectively imposing an upper limit on the total number of
>>>>>> CFs. I have also seen rumblings that this may have been fixed in later
>>>>>> versions.
>>>>>>
>>>>>> To put the question to rest, I have setup a DSE sandbox and created
>>>>>> some code to generate column families populated with 3,000 entries each.
>>>>>>
>>>>>> Unfortunately I have now hit this issue:
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>>>>
>>>>>> So I will have to retest against Cassandra 3.0 instead
>>>>>>
>>>>>> However, I would like to understand the limitations regarding
>>>>>> creation of column families.
>>>>>>
>>>>>> * Is there a practical upper limit?
>>>>>> * is this a fixed limit, or does it scale as more nodes are added
>>>>>> into the cluster?
>>>>>> * Is there a difference between one keyspace with thousands of column
>>>>>> families, vs thousands of keyspaces with only a few column families each?
>>>>>>
>>>>>> I haven’t found any hard evidence/documentation to help me here, but
>>>>>> if you can point me in the right direction, I will oblige and RTFM away.
>>>>>>
>>>>>> Many thanks for your help!
>>>>>>
>>>>>> Cheers
>>>>>> FJ
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>

-- 
Cheers,
Brian
http://www.integrallis.com

Re: Practical limit on number of column families

Reply via email to