Edward Capriolo created CASSANDRA-4815:
------------------------------------------

             Summary: Make CQL3 work naturally with wide rows
                 Key: CASSANDRA-4815
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4815
             Project: Cassandra
          Issue Type: Wish
            Reporter: Edward Capriolo


I find that CQL3 is quite obtuse and does not provide me a language useful for 
accessing my data. First, lets point out how we should design Cassandra data. 

1) Denormalize
2) Eliminate seeks
3) Design for read
4) optimize for blind writes

So here is a schema that abides by these tried and tested rules large 
production uses are employing today. 
Say we have a table of movie objects:

Movie
Name
Description
-< tags   (string)
-< credits composite(role string, name string )
-1 likesToday
-1 blacklisted

The above structure is a movie notice it hold a mix of static and dynamic 
columns, but the other all number of columns is not very large. (even if it was 
larger this is OK as well) Notice this table is not just 
a single one to many relationship, it has 1 to 1 data and it has two sets of 1 
to many data.

The schema today is declared something like this:

create column family movies
with default_comparator=UTF8Type and
  column_metadata =
  [
    {column_name: blacklisted, validation_class: int},
    {column_name: likestoday, validation_class: long},
    {column_name: description, validation_class: UTF8Type}
  ];

We should be able to insert data like this:
set ['Cassandra Database, not looking for a seQL']['blacklisted']=1;
set ['Cassandra Database, not looking for a seQL']['likesToday']=34;
set ['Cassandra Database, not looking for a 
seQL']['credits-dir']='director:asf';
set ['Cassandra Database, not looking for a seQL']['credits-jir]='jiraguy:bob';
set ['Cassandra Database, not looking for a seQL']['tags-action']='';
set ['Cassandra Database, not looking for a seQL']['tags-adventure']='';
set ['Cassandra Database, not looking for a seQL']['tags-romance']='';
set ['Cassandra Database, not looking for a seQL']['tags-programming']='';

This is the correct way to do it. 1 seek to find all the information related to 
a movie. As long as this row does
not get "large" there is no reason to optimize by breaking data into other 
column families. (Notice you can not transpose this
because movies is two 1-to-many relationships of potentially different types)

Lets look at the CQL3 way to do this design:

First, contrary to the original design of cassandra CQL does not like wide 
rows. It also does not have a good way to dealing with dynamic rows together 
with static rows either.

You have two options:

Option 1: lose all schema
create table movies ( name string, column blob, value blob, primary key(name)) 
with compact storage.

This method is not so hot we have not lost all our validators, and by the way 
you have to physically shutdown everything and rename files and recreate your 
schema if you want to inform cassandra that a current table should be compact. 
This could at very least be just a metadata change. Also you can not add column 
schema either.

Option 2  Normalize (is even worse)

create table movie (name String, description string, likestoday int, 
blacklisted int);
create table movecredits( name string, role string, personname string, primary 
key(name,role) );
create table movetags( name string, tag string, primary key (name,tag) );

This is a terrible design, of the 4 key characteristics how cassandra data 
should be designed it fails 3:
It does not:
1) Denormalize
2) Eliminate seeks
3) Design for read

Why is Cassandra steering toward this course, by making a language that does 
not understand wide rows?

So what can be done? My suggestions: 

Cassandra needs to lose the COMPACT STORAGE conversions. Each table needs a 
"virtual view" that is compact storage with no work to migrate data and 
recreate schemas. Every table should have a compact view for the schemaless, or 
a simple query hint like /*transposed*/ should make this change.

Metadata should be definable by regex. For example, all columnes named "tag*" 
are of type string.

CQL should have the column[slice_start] .. column[slice_end] operator from 
cql2. 

CQL should support current users, users should not have to 
switch between CQL versions, and possibly thrift, to work with wide rows. The 
language should work for them even if 
it not expressly designed for them. Some of these features are already part of 
cql2 so they should be carried over.

Also what needs to not happen is someone to make a hand waiving statement 
like "Once we have collection types we will not need wide rows". This request 
is to satisfy current users of cassandra not future ones or theoretical ones. 
Solutions should not involve physically migrating data in any way, they should 
not involve telling someone to do something they are already doing much 
differently. The suggestions should revolve around making the query language 
work well with existing data. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to