Re: Dynamic Columns in Cassandra 2.X

graham sanderson Fri, 13 Jun 2014 18:02:26 -0700

Note as I mentioned mid post, thrift also supports async nowadays (there was a 
recent discussion on cassandra dev and the choice was not to move to it)


I think the binary protocol is the way forward; CQL3 needs some new features, 
or there need to be some other types of requests you can make over the binary 
protocol

On Jun 13, 2014, at 5:51 PM, Peter Lin <wool...@gmail.com> wrote:

> 
> without a doubt there's nice features of CQL3 like notifications and async. I 
> want to see CQL3 mature and handle all the use cases that Thrift handles 
> easily today. It's to everyone's benefit to work together and improve CQL3.
> 
> Other benefits of Thrift drivers today is being able to use object API with 
> generics. For tool builders, this is especially useful. Not everyone wants to 
> write tools, but I do so it matters to me.
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 6:39 PM, Laing, Michael <michael.la...@nytimes.com> 
> wrote:
> Just to add 2 more cents... :)
> 
> The CQL3 protocol is asynchronous. This can provide a substantial throughput 
> increase, according to my benchmarking, when one uses non-blocking techniques.
> 
> It is also peer-to-peer. Hence the server can generate events to send to the 
> client, e.g. schema changes - in general, 'triggers' become possible.
> 
> ml
> 
> 
> On Fri, Jun 13, 2014 at 6:21 PM, graham sanderson <gra...@vast.com> wrote:
> My 2 cents…
> 
> A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL users. 
> This is a valid goal, and works well in many cases.
> Equally there are use cases (that some might find ugly) where Cassandra is 
> chosen explicitly because of the sorts of things you can do at the thrift 
> level, which aren’t (currently) exposed via CQL3
> 
> To Robert’s point earlier - "Rational people should presume that Thrift 
> support must eventually disappear”… he is probably right (though frankly I’d 
> rather the non-blocking thrift version was added instead). However if we do 
> get rid of the thrift interface, then it needs to be at a time that CQLn is 
> capable of expressing all the things you could do via the thrift API. Note, I 
> need to go look and see if the non-blocking thrift version also requires 
> materializing the entire thrift object in memory.
> 
> On Jun 13, 2014, at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
>> There are always the pros and the cons with a querying language, as always.
>> 
>> But as far as I can see, the advantages of Thrift I can see over CQL3 are:
>> 
>>  1) Thrift require a little bit less decoding server-side (a difference 
>> around 10% in CPU usage).
>> 
>>  2) Thrift use more "compact" storage because CQL3 need to add extra 
>> "marker" columns to guarantee the existence of primary key. It is worsen 
>> when you use clustering columns because for each distinct clustering group 
>> you have a related "marker" columns.
>> 
>>  That being said, point 1) is not really an issue since most of the time 
>> nodes are more I/O bound than CPU bound. Only in extreme cases where you 
>> have incredible read rate with data that fits entirely in memory that you 
>> may notice the difference.
>> 
>>  For point 2) this is a small trade-off to have access to a query language 
>> and being able to do slice queries using the WHERE clause. Some like it, 
>> other hate it, it's just a question of taste.  Please note that the "waste" 
>> in disk space is somehow mitigated by compression.
>> 
>>  Long story short I think Thrift may have appropriate usage but only in very 
>> few use cases. Recently a lot of improvement and features have been added to 
>> CQL3 so that it shoud be considered as the first choice for most users and 
>> if they fall into those few use cases then switch back to Thrift
>> 
>> My 2 cents
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin <wool...@gmail.com> wrote:
>> 
>> With text based query approach like CQL, you loose the type with dynamic 
>> columns. Yes, we're storing it as bytes, but it is simpler and easier with 
>> Thrift to do these types of things.
>> 
>> I like CQL3 and what it does, but text based query languages make certain 
>> dynamic schema use cases painful. Having used and built ORM's they are 
>> poorly suited to dynamic schemas. If you've never had to write an ORM to 
>> handle dynamic user defined schemas at runtime, it's tough to see where the 
>> problems arise and how that makes life painful.
>> 
>> Just to be clear, I'm not saying "don't use CQL3" or "CQL3 is bad". I'm 
>> saying CQL3 is good for certain kinds of use cases and Thrift is good at 
>> certain use cases. People need to look at what and how they're storing data 
>> and do what makes the most sense to them. Slavishly following CQL3 doesn't 
>> make any sense to me.
>>  
>> 
>> 
>> On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> "the validation type is set to bytes, and my code is type safe, so it knows 
>> which serializers to use. Those dynamic columns are driven off the types in 
>> Java."  --> Correct. However, you are still bound by the column comparator 
>> type which should be fixed (unless again you set it to bytes, in this case 
>> you loose the ordering and sorting feature)
>> 
>>  Basically what you are doing is telling Cassandra to save data in the cells 
>> as raw bytes, the serialization is taken care client side using the 
>> appropriate serializer. This is perfectly a valid strategy.
>> 
>>  But how is it different from using CQL3 and setting the value to "blob" 
>> (equivalent to bytes) and take care of the serialization client-side also ? 
>> You can even imagine saving value in JSON format and set the type to "text".
>> 
>>  Really, I don't see why CQL3 cannot achieve the scenario you describe.
>> 
>>  For the record, when you create a table in CQL3 as follow:
>> 
>>  CREATE TABLE user (
>>      id bigint PRIMARY KEY,
>>      firstname text,
>>      lastname text,
>>      last_connection timestamp,
>>      ....);
>> 
>>  C* will create a column family with validation type = bytes to accommodate 
>> the timestamp and text types for the firstname, lastname and last_connection 
>> columns. Basically the CQL3 engine is doing the serialization server-side 
>> for you
>> 
>>  
>> 
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 11:19 PM, Peter Lin <wool...@gmail.com> wrote:
>> 
>> the validation type is set to bytes, and my code is type safe, so it knows 
>> which serializers to use. Those dynamic columns are driven off the types in 
>> Java.
>> 
>> Having said that, CQL3 does have a new custom type feature, but the 
>> documentation is basically non-existent on how that actually works. One 
>> could also modify CQL such that insert statements gives Cassandra hints 
>> about what type it is, but I'm not aware of anyone enhancing CQL3 to do that.
>> 
>> I realize my kind of use case is a bit unique, but I do know of others that 
>> are doing similar kinds of things.
>> 
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 5:11 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> In thrift, when creating a column family, you need to define
>> 
>> 1) the row/partition key type
>> 2) the column comparator type
>> 3) the validation type for the actual value (cell in CQL3 terminology)
>> 
>> Unless you use "dynamic composites" feature, which does not exist (and 
>> probably won't) in CQL3, I don't see how you can have columns with 
>> "different types" on the same row/partition
>> 
>> 
>> On Fri, Jun 13, 2014 at 11:06 PM, Peter Lin <wool...@gmail.com> wrote:
>> 
>> when I say dynamic column, I mean non-static columns of different types 
>> within the same row. Some could be an object or one of the defined datatypes.
>> 
>> with thrift I use the appropriate serializer to handle these dynamic columns.
>> 
>> 
>> On Fri, Jun 13, 2014 at 4:55 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> Well, before talking and discussing about "dynamic columns", we should first 
>> define it clearly. What do people mean by "dynamic columns" exactly ? Is it 
>> the ability to add many columns "of same type" to an existing physical row?  
>> If yes then CQL3 does support it with clustering columns. 
>> 
>> 
>> On Fri, Jun 13, 2014 at 10:36 PM, Mark Greene <green...@gmail.com> wrote:
>> Yeah I don't anticipate more than 1000 properties, well under in fact. I 
>> guess the trade off of using the clustered columns is that I'd have a table 
>> that would be tall and skinny which also has its challenges w/r/t memory. 
>> 
>> I'll look into your suggestion a bit more and consider some others around a 
>> hybrid of CQL and Thrift (where necssary). But from a newb's perspective, I 
>> sense the community is unsettled around this concept of truly dynamic 
>> columns. Coming from an HBase background, it's a consideration I didn't 
>> anticipate having to evaluate.
>> 
>> 
>> --
>> about.me
>> 
>> 
>> On Fri, Jun 13, 2014 at 4:19 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> Hi Mark
>> 
>>  I believe that in your table you want to have some "common" fields that 
>> will be there whatever customer is, and other fields that are entirely 
>> customer-dependent, isn't it ?
>> 
>>  In this case, creating a table with static columns for the common fields 
>> and a clustering column representing all custom fields defined by a customer 
>> could be a solution (see here for static column: 
>> https://issues.apache.org/jira/browse/CASSANDRA-6561 )
>> 
>> CREATE TABLE user_data (
>>    user_id bigint,
>>    user_firstname text static,
>>    user_lastname text static,
>>    ...
>>    custom_property_name text,
>>    custom_property_value text,
>>    PRIMARY KEY(user_id, custom_property_name, custom_property_value));
>> 
>>  Please note that with this solution you need to have "at least one" custom 
>> property per customer to make it work
>> 
>>  The only thing to take care of is the type of custom_property_value. You 
>> need to define it once for all. To accommodate for dynamic types, you can 
>> either save the value as blob or text(as JSON) and take care of the 
>> serialization/deserialization yourself at the client side
>> 
>>  As an alternative you can save custom properties in a map, provided that 
>> their number is not too large. But considering the business case of CRM, I 
>> believe that it's quite rare and user has more than 1000 custom properties 
>> isn't it ?
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 10:03 PM, Mark Greene <green...@gmail.com> wrote:
>> My use case requires the support of arbitrary columns much like a CRM. My 
>> users can define 'custom' fields within the application. Ideally I wouldn't 
>> have to change the schema at all, which is why I like the old thrift 
>> approach rather than the CQL approach. 
>> 
>> Having said all that, I'd be willing to adapt my API to make explicit schema 
>> changes to Cassandra whenever my user makes a change to their custom fields 
>> if that's an accepted practice. 
>> 
>> Ultimately, I'm trying to figure out of the Cassandra community intends to 
>> support true schemaless use cases in the future.
>> 
>> --
>> about.me
>> 
>> 
>> On Fri, Jun 13, 2014 at 3:47 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> This strikes me as bad practice in the world of multi tenant systems. I 
>> don't want to create a table per customer. So I'm wondering if dynamically 
>> modifying the table is an accepted practice?  --> Can you give some details 
>> about your use case ? How would you "alter" a table structure to adapt it to 
>> a new customer ?
>> 
>> Wouldn't it be better to model your table so that it supports 
>> addition/removal of customer ?
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 9:00 PM, Mark Greene <green...@gmail.com> wrote:
>> Thanks DuyHai,
>> 
>> I have a follow up question to #2. You mentioned ideally I would create a 
>> new table instead of mutating an existing one. 
>> 
>> This strikes me as bad practice in the world of multi tenant systems. I 
>> don't want to create a table per customer. So I'm wondering if dynamically 
>> modifying the table is an accepted practice?
>> 
>> --
>> about.me
>> 
>> 
>> On Fri, Jun 13, 2014 at 2:54 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>> Hello Mark
>> 
>>  Dynamic columns, as you said, are perfectly supported by CQL3 via 
>> clustering columns. And no, using collections for storing dynamic data is a 
>> very bad idea if the cardinality is very high (>> 1000 elements)
>> 
>> 1)  Is using Thrift a valid approach in the era of CQL?  --> Less and less. 
>> Unless you are looking for extreme performance, you'd better off choosing 
>> CQL3. The ease of programming and querying with CQL3 does worth the small 
>> overhead in CPU
>> 
>> 2) If CQL is the best practice,  should I alter the schema at runtime when I 
>> detect I need to do an schema mutation?  --> Ideally you should not alter 
>> schema but create a new table to adapt to your changing requirements. 
>> 
>> 3) If I utilize CQL collections, will Cassandra page the entire thing into 
>> the heap?  --> Of course. All collections and maps in Cassandra are eagerly 
>> loaded entirely in memory on server side. That's why it is recommended to 
>> limit their cardinality to ~ 1000 elements
>> 
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 8:33 PM, Mark Greene <green...@gmail.com> wrote:
>> I'm looking for some best practices w/r/t supporting arbitrary columns. It 
>> seems from the docs I've read around CQL that they are supported in some 
>> capacity via collections but you can't exceed 64K in size. For my 
>> requirements that would cause problems. 
>> 
>> So my questions are:
>> 
>> 1)  Is using Thrift a valid approach in the era of CQL? 
>> 
>> 2) If CQL is the best practice,  should I alter the schema at runtime when I 
>> detect I need to do an schema mutation?
>> 
>> 3) If I utilize CQL collections, will Cassandra page the entire thing into 
>> the heap?
>> 
>> My data model is akin to a CRM, arbitrary column definitions per customer.
>> 
>> 
>> Cheers,
>> Mark
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Dynamic Columns in Cassandra 2.X

Reply via email to