Re: Is SuperColumn necessary?
Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1- articleID1 CatID1- articleID2 CatID1- articleID3 CatID1- articleID4 How can we map such scenarios with simple column families. Rgds. On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote: Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote: Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column
Re: Is SuperColumn necessary?
This is one of the sticking points with the key concatenation argument. You can't simply access subpartitions of data along an aggregate name using a concatenated key unless you can efficiently address a range of the keys according to a property of a subset. I'm hoping this will bear out with more of this discussion. Another facet of this issue is performance with respect to storage layout. Presently columns within a row are inherently organized for efficient range operations. The key space is not generally optimal in this way. I'm hoping to see some discussion of this, as well. On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote: Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1- articleID1 CatID1- articleID2 CatID1- articleID3 CatID1- articleID4 How can we map such scenarios with simple column families. Rgds. On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote: Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote: Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column
Re: Is SuperColumn necessary?
Hi Stu, Thanks for your hard work. That's not a easy work. With my partners, after days of reading of the code. We really know that current code implementation of the storage-layer should be rewrite for a clear implementation. On Tue, May 11, 2010 at 12:44 AM, Stu Hood stu.h...@rackspace.com wrote: I think that it is 100% ideal: it's what I've been working on implementing in #674, #847 and #998. I'm hoping to post a large patchset and docs this week, and I'm aiming to get it committed for 0.8. The work I've been doing doesn't touch the user interface: it only deals with the internal changes necessary to make this type of storage possible. -Original Message- From: Mike Malone m...@simplegeo.com Sent: Monday, May 10, 2010 11:37am To: user@cassandra.apache.org Subject: Re: Is SuperColumn necessary? Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear
Re: Is SuperColumn necessary?
On Tue, May 11, 2010 at 7:46 AM, David Boxenhorn da...@lookin2.com wrote: I would like an API with a variable number of arguments. Using Java varargs, something like value = keyspace.get(articles, cars, John Smith, 2010-05-01, comment-25); or valueArray = keyspace.get(articles, predicate1, predicate2, predicate3, predicate4); Hrm. I haven't dug that deeply into the joys of predicate logic, propositional DAGs, etc. but couldn't this also be represented as a nested tree of predicates / other primitives. So it would be something like: SubColumns = Transformation that takes a predicate, applies it to a Column, then gets it's SubColumns keyspace.get(articles, SubColumns(predicate1, SubColumns(predicate2, SubColumns(predicate3, predicate4; It's more like functional programming-ish, I suppose, but I think that model might apply more cleanly here. FP does tend to result in nice clean algorithms for manipulating large data sets. Mike The storage layout would be determined by the configuration, as below: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True ... On Tue, May 11, 2010 at 5:26 PM, Jonathan Shook jsh...@gmail.com wrote: This is one of the sticking points with the key concatenation argument. You can't simply access subpartitions of data along an aggregate name using a concatenated key unless you can efficiently address a range of the keys according to a property of a subset. I'm hoping this will bear out with more of this discussion. Another facet of this issue is performance with respect to storage layout. Presently columns within a row are inherently organized for efficient range operations. The key space is not generally optimal in this way. I'm hoping to see some discussion of this, as well. On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote: Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1- articleID1 CatID1- articleID2 CatID1- articleID3 CatID1- articleID4 How can we map such scenarios with simple column families. Rgds. On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote: Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote: Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column
Re: Is SuperColumn necessary?
Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.comwrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
I think that it is 100% ideal: it's what I've been working on implementing in #674, #847 and #998. I'm hoping to post a large patchset and docs this week, and I'm aiming to get it committed for 0.8. The work I've been doing doesn't touch the user interface: it only deals with the internal changes necessary to make this type of storage possible. -Original Message- From: Mike Malone m...@simplegeo.com Sent: Monday, May 10, 2010 11:37am To: user@cassandra.apache.org Subject: Re: Is SuperColumn necessary? Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.comwrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra
Re: Is SuperColumn necessary?
On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up
Re: Is SuperColumn necessary?
Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion,
Re: Is SuperColumn necessary?
supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:
Re: Is SuperColumn necessary?
If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it. On May 10, 2010, at 4:31 PM, AJ Chen wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.com wrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by
Re: Is SuperColumn necessary?
On Mon, May 10, 2010 at 4:31 PM, AJ Chen ajc...@web2express.org wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. There's nothing you said here that can't be implemented efficiently using columns. You can slice rows and get a subset of Columns. In fact, this example is particularly easy to implement. If you have a Blog with Entries and Comments you'd do: ColumnFamily Name=Blog CompareWith=UTF8Type / Insert blog post: batch_mutate(key=blog post id, [{name=~post:author, value=author}, {name=~post:title, value=title, ...)) Insert comment: batch_mutate(key=blog post id, [{name=TimeUUID + :author, ... }] Then you can get the Post only (slice for [~, ]), the comments only (slice for [, ~]), or the post _and_ comments (slice for [, ]). Inserting a comment does _not_ require a get/concatenate/insert. Yes, concatenating the names on the client side is hacky, clunky, and inconvenient. That's why we _should_ build an interface that doesn't require the client to concatenate names. But SuperColumns aren't the right way to do it. They add no value. They could be implemented in client libraries, for example, and nobody would know the difference. To really understand the problem with SuperColumns, though, you need to look at the Cassandra source. Removing SuperColumns would make the code-base much cleaner and tighter, and would probably reduce SLOC by 20%. I think a replacement that assumed nested Columns (or Entries, or Thingies) would be much cleaner. That's what Stu is working on. Mike On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it
Re: Is SuperColumn necessary?
I'm having a difficult time understanding your syntax. Could you provide an example with actual data? On May 10, 2010, at 5:25 PM, AJ Chen wrote: your suggestion works for fixed supercolumn name. the blog example now becomes: { blog-id {name, title, ...} blog-id-comments {time:commenter} } what about supercolumn names that are not fixed? for example, I want to store comment's details with the blog like this: { blog-id { blog { name, title, ...} comments {comment-id:commenter} comment-id {commenter, time, text, ...} } a comment-id is generated on-the-fly when the comment is made. how do you flatten the comment-id supercolumn to normal column? just for brain exercise, not meant to pick on you. thanks, -aj On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.com wrote: If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it. On May 10, 2010, at 4:31 PM, AJ Chen wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.com wrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important,
Re: Is SuperColumn necessary?
in your implementation, is the comment still sorted by TIME? Will UTF8Type sort TimeUUID:author by time? thanks, -aj On Mon, May 10, 2010 at 5:02 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 4:31 PM, AJ Chen ajc...@web2express.org wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. There's nothing you said here that can't be implemented efficiently using columns. You can slice rows and get a subset of Columns. In fact, this example is particularly easy to implement. If you have a Blog with Entries and Comments you'd do: ColumnFamily Name=Blog CompareWith=UTF8Type / Insert blog post: batch_mutate(key=blog post id, [{name=~post:author, value=author}, {name=~post:title, value=title, ...)) Insert comment: batch_mutate(key=blog post id, [{name=TimeUUID + :author, ... }] Then you can get the Post only (slice for [~, ]), the comments only (slice for [, ~]), or the post _and_ comments (slice for [, ]). Inserting a comment does _not_ require a get/concatenate/insert. Yes, concatenating the names on the client side is hacky, clunky, and inconvenient. That's why we _should_ build an interface that doesn't require the client to concatenate names. But SuperColumns aren't the right way to do it. They add no value. They could be implemented in client libraries, for example, and nobody would know the difference. To really understand the problem with SuperColumns, though, you need to look at the Cassandra source. Removing SuperColumns would make the code-base much cleaner and tighter, and would probably reduce SLOC by 20%. I think a replacement that assumed nested Columns (or Entries, or Thingies) would be much cleaner. That's what Stu is working on. Mike On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client
Re: Is SuperColumn necessary?
{ b1 { blog-id: b1 author: ba1 tittle: bt1 comment-timeuuid-1: {author: ca1 id: comment-timeuuid-1 text: text 1 comment-timeuuid-2: {author: ca2 id: comment-timeuuid-2 text: text 2 } } Mike just suggested to concate comment id with each of the comment field names so that the above data can be stored in normal column family. It looks fine except that I'm not sure the time sorting on comments still works or not. -aj On Mon, May 10, 2010 at 5:36 PM, William Ashley wash...@gmail.com wrote: I'm having a difficult time understanding your syntax. Could you provide an example with actual data? On May 10, 2010, at 5:25 PM, AJ Chen wrote: your suggestion works for fixed supercolumn name. the blog example now becomes: { blog-id {name, title, ...} blog-id-comments {time:commenter} } what about supercolumn names that are not fixed? for example, I want to store comment's details with the blog like this: { blog-id { blog { name, title, ...} comments {comment-id:commenter} comment-id {commenter, time, text, ...} } a comment-id is generated on-the-fly when the comment is made. how do you flatten the comment-id supercolumn to normal column? just for brain exercise, not meant to pick on you. thanks, -aj On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.com wrote: If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it. On May 10, 2010, at 4:31 PM, AJ Chen wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should
Re: Is SuperColumn necessary?
Mike just suggested to concate comment id with each of the comment field names so that the above data can be stored in normal column family. It looks fine except that I'm not sure the time sorting on comments still works or not. In the case of time you can just use lexicographically sortable strings that represent your timestamp (e.g., RFC 3339). You're right, I don't think TimeUUID does that. For more complicated things (e.g., TimeUUIDs or packed numerics that you don't want to zero pad) you'd have to implement a custom comparator. So the convenience mechanisms that would have to be implemented (and, in fact, Stu and Ed have pretty much already implemented) would take care of concatenating the column names and doing the chained comparisons for you. Mike On Mon, May 10, 2010 at 5:36 PM, William Ashley wash...@gmail.com wrote: I'm having a difficult time understanding your syntax. Could you provide an example with actual data? On May 10, 2010, at 5:25 PM, AJ Chen wrote: your suggestion works for fixed supercolumn name. the blog example now becomes: { blog-id {name, title, ...} blog-id-comments {time:commenter} } what about supercolumn names that are not fixed? for example, I want to store comment's details with the blog like this: { blog-id { blog { name, title, ...} comments {comment-id:commenter} comment-id {commenter, time, text, ...} } a comment-id is generated on-the-fly when the comment is made. how do you flatten the comment-id supercolumn to normal column? just for brain exercise, not meant to pick on you. thanks, -aj On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.comwrote: If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it. On May 10, 2010, at 4:31 PM, AJ Chen wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.orgwrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm
Re: Is SuperColumn necessary?
Guys, this is beginning to sound like MUMPS! http://en.wikipedia.org/wiki/MUMPS In MUMPS, all variables are sparse, multidimensional arrays, which can be stored to disk. It is an arcane, and archaic, language (does anyone but me remember it?), but it has been used successfully for years. Maybe we can learn something from it. I like the terminology of sparse multidimensional arrays very much - it really clarifies my thinking. A column family would just be a variable. On Fri, May 7, 2010 at 7:06 PM, Ed Anuff e...@anuff.com wrote: On Thu, May 6, 2010 at 11:10 PM, Mike Malone m...@simplegeo.com wrote: The upshot is, the Cassandra data model would go from being it's a nested dictionary, just kidding no it's not! to being it's a nested dictionary, for serious. Again, these are all just ideas... but I think this simplified data model would allow you to express pretty much any query in a graph of simple primitives like Predicates, Filters, Aggregations, Transformations, etc. The indexes would allow you to cheat when evaluating certain types of queries - if you get a SlicePredicate on an indexed thingy you don't have to enumerate the entire set of sub-thingies for example. This would be my dream implementation. I'm working an an application that needs that sort of capability. SuperColumns lead you to thinking that should be done in the cassandra tier but then fall short, so my thought was that I was just going to do everything that was in Cassandra as regular columnfamilies and columns using composite keys and composite column names ala the code I shared above, and then implement the n-level hierarchy in the app tier. It looks like your suggestion is to take it in the other direction and make it part of the fundamental data model, which would be very useful if it could be made to work without big tradeoffs.
Re: Is SuperColumn necessary?
I'm not sure this is much of an improvement. It does illustrate, however, the desire to couch the concepts in terms that each is already comfortable with. Nearly every set of terms which come from an existing system will have baggage which doesn't map appropriately. Not that the sparse multidimensional arrays is an unfamiliar construct. It's more that sparse may or may not apply depending on the part of your data you are describing. Multidimensional implies uniformity of structure, which is not to be taken for granted. Arrays are just one way to think of the structures. They also serve well as maps and sets (Which can be modeled using arrays as well). There are certain semantics of sets, lists, and maps which people have wired into their brains, and reducing it all to arrays is likely to create more confusion. I think if we want to borrow terms form another system, it shouldn't be a computing system, or at least should be so different or fundamental that the terms have to be re-understood free of baggage. On Sun, May 9, 2010 at 1:30 AM, David Boxenhorn da...@lookin2.com wrote: Guys, this is beginning to sound like MUMPS! http://en.wikipedia.org/wiki/MUMPS In MUMPS, all variables are sparse, multidimensional arrays, which can be stored to disk. It is an arcane, and archaic, language (does anyone but me remember it?), but it has been used successfully for years. Maybe we can learn something from it. I like the terminology of sparse multidimensional arrays very much - it really clarifies my thinking. A column family would just be a variable. On Fri, May 7, 2010 at 7:06 PM, Ed Anuff e...@anuff.com wrote: On Thu, May 6, 2010 at 11:10 PM, Mike Malone m...@simplegeo.com wrote: The upshot is, the Cassandra data model would go from being it's a nested dictionary, just kidding no it's not! to being it's a nested dictionary, for serious. Again, these are all just ideas... but I think this simplified data model would allow you to express pretty much any query in a graph of simple primitives like Predicates, Filters, Aggregations, Transformations, etc. The indexes would allow you to cheat when evaluating certain types of queries - if you get a SlicePredicate on an indexed thingy you don't have to enumerate the entire set of sub-thingies for example. This would be my dream implementation. I'm working an an application that needs that sort of capability. SuperColumns lead you to thinking that should be done in the cassandra tier but then fall short, so my thought was that I was just going to do everything that was in Cassandra as regular columnfamilies and columns using composite keys and composite column names ala the code I shared above, and then implement the n-level hierarchy in the app tier. It looks like your suggestion is to take it in the other direction and make it part of the fundamental data model, which would be very useful if it could be made to work without big tradeoffs.
Re: Is SuperColumn necessary?
On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote: I would rather be interested in Tree type structure where supercolumns have supercolumns in it. you dont need to compare all the columns to find a set of columns and will also reduce the bytes transfered for separator, at least string concatenation (Or something like that) for read and write column name generation. it is more logically stored and structured by this way and also we can make caching work better by selectively caching the tree (User defined if you will) But nothing wrong in supporting both :) I'm 99% sure we're talking about the same thing and we don't need to support both. How names/values are separated is pretty irrelevant. It has to happen somewhere. I agree that it'd be nice if it happened on the server, but doing it in the client makes it easier to explore ideas. On Thu, May 6, 2010 at 5:27 PM, philip andrew philip14...@gmail.com wrote: Please create a new term word if the existing terms are misleading, if its not a file system then its not good to call it a file system. While it's seriously bikesheddy, I guess you're right. Let's call them thingies for now, then. So you can have a top-level thingy and it can have an arbitrarily nested tree of sub-thingies. Each thingy has a thingy type [1]. You can also tell Cassandra if you want a particular level of thingy to be indexed. At one (or maybe more) levels you can tell Cassandra you want your thingies to be split onto separate nodes in your cluster. At one (or maybe more) levels you could also tell Cassandra that you want your thingies split into separate files [2]. The upshot is, the Cassandra data model would go from being it's a nested dictionary, just kidding no it's not! to being it's a nested dictionary, for serious. Again, these are all just ideas... but I think this simplified data model would allow you to express pretty much any query in a graph of simple primitives like Predicates, Filters, Aggregations, Transformations, etc. The indexes would allow you to cheat when evaluating certain types of queries - if you get a SlicePredicate on an indexed thingy you don't have to enumerate the entire set of sub-thingies for example. So, you'd query your thingies by building out a predicate, transformations, filters, etc., serializing the graph of primitives, and sending it over the wire to Cassandra. Cassandra would rebuild the graph and run it over your dataset. So instead of: Cassandra.get_range_slices( keyspace=AwesomeApp, column_parent=ColumnParent(column_family=user), slice_predicate=SlicePredicate(column_names=['username', 'dob']), range=KeyRange(start_key='a', end_key='m'), consistency_level=ONE ) You'd do something like: Cassandra.query( SubThingyTransformer( NamePredicate(names=[AwesomeApp], SubThingyTransformer( NamePredicate(names=[user]), SubThingyTransformer( SlicePredicate(start=a, end=m), NamePredicate(names=[username, dob]) ) ) ), consistency_level=ONE ) Which seems complicated, but it's basically just [(user['username'], user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')] and could probably be expressed that way in a client library. I think batch_mutate is awesome the way it is and should be the only way to insert/update data. I'd rename it mutate. So our interface becomes: Cassandra.query(query, consistency_level) Cassandra.mutate(mutation, consistency_level) Ta-da. Anyways, I was trying to avoid writing all of this out in prose and try mocking some of it up in code instead. I guess this this works too. Either way, I do think something like this would simplify the codebase, simplify the data model, simplify the interface, make the entire system more flexible, and be generally awesome. Mike [1] These can be subclasses of Thingy in Java... or maybe they'd implement IThingy. But either way they'd handle serialization and probably implement compareTo to define natural ordering. So you'd have classes like ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace comparators. [2] I think there's another simplification here. Splitting into separate files is really very similar to splitting onto separate nodes. There might be a way around some of the row size limitations with this sort of concept. And we may be able to get better utilization of multiple disks by giving each disk (or data directory) a subset of the node's token range. Caveat: thought not fully baked.
Re: Is SuperColumn necessary?
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Clever. I wonder what a useful abstraction in Hector or one of the other idiomatic clients would look like. -- Eric Evans eev...@rackspace.com
Re: Is SuperColumn necessary?
That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
+1 on all of that On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
Please create a new term word if the existing terms are misleading, if its not a file system then its not good to call it a file system. On Thu, May 6, 2010 at 3:50 PM, Torsten Curdt tcu...@vafer.org wrote: +1 on all of that On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
I would rather be interested in Tree type structure where supercolumns have supercolumns in it. you dont need to compare all the columns to find a set of columns and will also reduce the bytes transfered for separator, at least string concatenation (Or something like that) for read and write column name generation. it is more logically stored and structured by this way and also we can make caching work better by selectively caching the tree (User defined if you will) But nothing wrong in supporting both :) Regards, /VJ On Wed, May 5, 2010 at 11:31 AM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike
Re: Is SuperColumn necessary?
Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike
Re: Is SuperColumn necessary?
Hey Ed, I've been working on a similar approach for arbitarily nested/compound column names in #998. See: http://github.com/stuhood/cassandra/blob/998/src/java/org/apache/cassandra/db/ColumnKey.java The goal is to provide native support and potentially (in the very long term), API support for nested/compound names. The difference between our approaches boils down to needing to define a comparator for every level in #998, versus having dynamic types per name in your approach. Thanks, Stu -Original Message- From: Ed Anuff e...@anuff.com Sent: Wednesday, May 5, 2010 1:31pm To: user@cassandra.apache.org Subject: Re: Is SuperColumn necessary? Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike
Re: Is SuperColumn necessary?
Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
I don't think secondary index is necessary for cassandra core, at least it is not urgent. I think currently, the first urgent improvements of cassandra are: 1. re-clarify the data-model. 2. re-implement the storage and index, especially the current SSTable implement is not good. In fact, the current storage/index implement is the most poor point. On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis jbel...@gmail.com wrote: I think that once we have built-in indexing (CASSANDRA-749) you can make a good case for dropping supercolumns (at least, dropping them from the public API and reserving them for internal use). On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com wrote: I don't think the SuperColumn is so necessary. I think this level of logic can be leaved to application. Do you think so? If SuperColumn is needed, as https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index in SuperColumns level and SubColumns level. Thus, the levels of index is too many.
Re: Is SuperColumn necessary?
I think, at least currently, we should leave the logic of current SuperColumn and addational indexing features to application layer of cassandra core. On Wed, Apr 28, 2010 at 6:44 PM, Schubert Zhang zson...@gmail.com wrote: I don't think secondary index is necessary for cassandra core, at least it is not urgent. I think currently, the first urgent improvements of cassandra are: 1. re-clarify the data-model. 2. re-implement the storage and index, especially the current SSTable implement is not good. In fact, the current storage/index implement is the most poor point. On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis jbel...@gmail.comwrote: I think that once we have built-in indexing (CASSANDRA-749) you can make a good case for dropping supercolumns (at least, dropping them from the public API and reserving them for internal use). On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com wrote: I don't think the SuperColumn is so necessary. I think this level of logic can be leaved to application. Do you think so? If SuperColumn is needed, as https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index in SuperColumns level and SubColumns level. Thus, the levels of index is too many.
Re: Is SuperColumn necessary?
If I understand correctly, the distinction between supercolumns and subcolumns is critical to good database design if you want to use random partitioning: you can do range queries on subcolumns but not on supercolumns. Is this correct? On Mon, Apr 26, 2010 at 7:11 PM, Jonathan Ellis jbel...@gmail.com wrote: I think that once we have built-in indexing (CASSANDRA-749) you can make a good case for dropping supercolumns (at least, dropping them from the public API and reserving them for internal use). On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com wrote: I don't think the SuperColumn is so necessary. I think this level of logic can be leaved to application. Do you think so? If SuperColumn is needed, as https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index in SuperColumns level and SubColumns level. Thus, the levels of index is too many.
Re: Is SuperColumn necessary?
On Wed, Apr 28, 2010 at 5:24 AM, David Boxenhorn da...@lookin2.com wrote: If I understand correctly, the distinction between supercolumns and subcolumns is critical to good database design if you want to use random partitioning: you can do range queries on subcolumns but not on supercolumns. Is this correct? You can do efficient range queries of normal (not super) columns in a ColumnFamily. I think SuperColumn's are not indexed, so it's less efficient to do a slice of subcolumns from a column, if there are lots of subcolumns. I agree that SuperColumns are technically unnecessary. There aren't any use cases I can come up with that a SuperColumn satisfies that normal Columns can't. You can simulate SuperColumn behavior by concatenating key parts with a separator and using the concatenated key as your column name, then doing a slice. So if you had a SuperColumn that stored usernames, and sub-columns that stored document IDs, you could instead have a normal CF that stores username:document-id. The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike
Is SuperColumn necessary?
I don't think the SuperColumn is so necessary. I think this level of logic can be leaved to application. Do you think so? If SuperColumn is needed, as https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index in SuperColumns level and SubColumns level. Thus, the levels of index is too many.
Re: Is SuperColumn necessary?
I think that once we have built-in indexing (CASSANDRA-749) you can make a good case for dropping supercolumns (at least, dropping them from the public API and reserving them for internal use). On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com wrote: I don't think the SuperColumn is so necessary. I think this level of logic can be leaved to application. Do you think so? If SuperColumn is needed, as https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index in SuperColumns level and SubColumns level. Thus, the levels of index is too many.