Re: Is SuperColumn necessary?

2010-05-11 Thread vd
Hi

Can we make range search on ID:ID format as this would be treated as
single ID by API or can it bifurcate on ':' . If now then how do can
we ignore usage of supercolumns where we need to associate 'n' number
of rows to a single ID.
Like
  CatID1- articleID1
  CatID1- articleID2
  CatID1- articleID3
  CatID1- articleID4
How can we map such scenarios with simple column families.

Rgds.

On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote:
 Exactly.

 On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote:
 Don't think of it as getting rid of supercolum. Think of it as adding
 superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
 array[dim1][dim2][dim3].[dimN] = value

 Or, as said above:

   Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 Type=UTF8
     Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
       Column Name=ThingThatsNowSuperColumnName Type=Long
         Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII
           Column Name=ThingThatCantCurrentlyBeRepresented/
         /Column
       /Column
     /Column
   /Column



Re: Is SuperColumn necessary?

2010-05-11 Thread Jonathan Shook
This is one of the sticking points with the key concatenation
argument. You can't simply access subpartitions of data along an
aggregate name using a concatenated key unless you can efficiently
address a range of the keys according to a property of a subset. I'm
hoping this will bear out with more of this discussion.

Another facet of this issue is performance with respect to storage
layout. Presently columns within a row are inherently organized for
efficient range operations. The key space is not generally optimal in
this way. I'm hoping to see some discussion of this, as well.

On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote:
 Hi

 Can we make range search on ID:ID format as this would be treated as
 single ID by API or can it bifurcate on ':' . If now then how do can
 we ignore usage of supercolumns where we need to associate 'n' number
 of rows to a single ID.
 Like
          CatID1- articleID1
          CatID1- articleID2
          CatID1- articleID3
          CatID1- articleID4
 How can we map such scenarios with simple column families.

 Rgds.

 On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote:
 Exactly.

 On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote:
 Don't think of it as getting rid of supercolum. Think of it as adding
 superdupercolums, supertriplecolums, etc. Or, in sparse array terminology:
 array[dim1][dim2][dim3].[dimN] = value

 Or, as said above:

   Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 Type=UTF8
     Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
       Column Name=ThingThatsNowSuperColumnName Type=Long
         Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII
           Column Name=ThingThatCantCurrentlyBeRepresented/
         /Column
       /Column
     /Column
   /Column




Re: Is SuperColumn necessary?

2010-05-11 Thread Schubert Zhang
Hi Stu,
Thanks for your hard work. That's not a easy work.

With my partners, after days of reading of the code.
We really know that current code implementation of the storage-layer should
be rewrite for a clear implementation.


On Tue, May 11, 2010 at 12:44 AM, Stu Hood stu.h...@rackspace.com wrote:

 I think that it is 100% ideal: it's what I've been working on implementing
 in #674, #847 and #998. I'm hoping to post a large patchset and docs this
 week, and I'm aiming to get it committed for 0.8.

 The work I've been doing doesn't touch the user interface: it only deals
 with the internal changes necessary to make this type of storage possible.


 -Original Message-
 From: Mike Malone m...@simplegeo.com
 Sent: Monday, May 10, 2010 11:37am
 To: user@cassandra.apache.org
 Subject: Re: Is SuperColumn necessary?

 Maybe... but honestly, it doesn't affect the architecture or interface at
 all. I'm more interested in thinking about how the system should work than
 what things are called. Naming things are important, but that can happen
 later.

 Does anyone have any thoughts or comments on the architecture I suggested
 earlier?

 Mike

 On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote:

  Yes, the column here is not appropriate.
  Maybe we need not to create new terms, in Google's Bigtable, the term
  qualifier is a good one.
 
 
  On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com
 wrote:
 
  That would be a good time to get rid of the confusing column term,
 which
  incorrectly suggests a two-dimensional tabular structure.
 
  Suggestions:
 
  1. A hypercube (or hypocube, if only two dimensions): replace key and
  column with 1st dimension, 2nd dimension, etc.
 
  2. A file system: replace key and column with directory and
  subdirectory
 
  3. A tuple tree: Column family replaced by top-level tuple, whose
 value
  is the set of keys, whose value is the set of supercolumns of the key,
 whose
  value is the set of columns for the supercolumn, etc.
 
  4. Etc.
 
  On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:
 
  Nice, Ed, we're doing something very similar but less generic.
 
  Now replace all of the various methods for querying with a simple query
  interface that takes a Predicate, allow the user to specify (in
  storage-conf) which levels of the nested Columns should be indexed, and
  completely remove Comparators and have people subclass Column /
 implement
  IColumn and we'd really be on to something ;).
 
  Mock storage-conf.xml:
Column Name=ThingThatsNowKey Indexed=True
  ClusterPartitioned=True Type=UTF8
  Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
  Type=UTF8
Column Name=ThingThatsNowSuperColumnName Type=Long
  Column Name=ThingThatsNowColumnName Indexed=True
  Type=ASCII
Column Name=ThingThatCantCurrentlyBeRepresented/
  /Column
/Column
  /Column
/Column
 
  Thrift:
struct NamePredicate {
  1: required listbinary column_names,
}
struct SlicePredicate {
  1: required binary start,
  2: required binary end,
}
struct CountPredicate {
  1: required struct predicate,
  2: required i32 count=100,
}
struct AndPredicate {
  1: required Predicate left,
  2: required Predicate right,
}
struct SubColumnsPredicate {
  1: required Predicate columns,
  2: required Predicate subcolumns,
}
... OrPredicate, OtherUsefulPredicates ...
query(predicate, count, consistency_level) # Count here would be
 total
  count of leaf values returned, whereas CountPredicate specifies a
 column
  count for a particular sub-slice.
 
  Not fully baked... but I think this could really simplify stuff and
 make
  it more flexible. Downside is it may give people enough rope to hang
  themselves, but at least the predicate stuff is easily distributable.
 
  I'm thinking I'll play around with implementing some of this stuff
 myself
  if I have any free time in the near future.
 
  Mike
 
 
  On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  Very interesting, thanks!
 
  On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
   Follow-up from last weeks discussion, I've been playing around with
 a
  simple
   column comparator for composite column names that I put up on
 github.
  I'd
   be interested to hear what people think of this approach.
  
   http://github.com/edanuff/CassandraCompositeType
  
   Ed
  
   On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
  
   It might make sense to create a CompositeType subclass of
  AbstractType for
   the purpose of constructing and comparing these types of
 composite
  column
   names so that if you could more easily do that sort of thing rather
  than
   having to concatenate into one big string.
  
   On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
  wrote:
  
   The only thing SuperColumns appear

Re: Is SuperColumn necessary?

2010-05-11 Thread Mike Malone
On Tue, May 11, 2010 at 7:46 AM, David Boxenhorn da...@lookin2.com wrote:

 I would like an API with a variable number of arguments. Using Java
 varargs, something like

 value = keyspace.get(articles, cars, John Smith, 2010-05-01,
 comment-25);

 or

 valueArray = keyspace.get(articles, predicate1, predicate2, predicate3,
 predicate4);


Hrm. I haven't dug that deeply into the joys of predicate logic,
propositional DAGs, etc. but couldn't this also be represented as a nested
tree of predicates / other primitives. So it would be something like:

   SubColumns = Transformation that takes a predicate, applies it to a
Column, then gets it's SubColumns
   keyspace.get(articles, SubColumns(predicate1, SubColumns(predicate2,
SubColumns(predicate3, predicate4;

It's more like functional programming-ish, I suppose, but I think that model
might apply more cleanly here. FP does tend to result in nice clean
algorithms for manipulating large data sets.

Mike




 The storage layout would be determined by the configuration, as below:

 Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 ...




 On Tue, May 11, 2010 at 5:26 PM, Jonathan Shook jsh...@gmail.com wrote:

 This is one of the sticking points with the key concatenation
 argument. You can't simply access subpartitions of data along an
 aggregate name using a concatenated key unless you can efficiently
 address a range of the keys according to a property of a subset. I'm
 hoping this will bear out with more of this discussion.

 Another facet of this issue is performance with respect to storage
 layout. Presently columns within a row are inherently organized for
 efficient range operations. The key space is not generally optimal in
 this way. I'm hoping to see some discussion of this, as well.

 On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote:
  Hi
 
  Can we make range search on ID:ID format as this would be treated as
  single ID by API or can it bifurcate on ':' . If now then how do can
  we ignore usage of supercolumns where we need to associate 'n' number
  of rows to a single ID.
  Like
   CatID1- articleID1
   CatID1- articleID2
   CatID1- articleID3
   CatID1- articleID4
  How can we map such scenarios with simple column families.
 
  Rgds.
 
  On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org
 wrote:
  Exactly.
 
  On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com
 wrote:
  Don't think of it as getting rid of supercolum. Think of it as adding
  superdupercolums, supertriplecolums, etc. Or, in sparse array
 terminology:
  array[dim1][dim2][dim3].[dimN] = value
 
  Or, as said above:
 
Column Name=ThingThatsNowKey Indexed=True
 ClusterPartitioned=True
  Type=UTF8
  Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
  Type=UTF8
Column Name=ThingThatsNowSuperColumnName Type=Long
  Column Name=ThingThatsNowColumnName Indexed=True
 Type=ASCII
Column Name=ThingThatCantCurrentlyBeRepresented/
  /Column
/Column
  /Column
/Column
 
 





Re: Is SuperColumn necessary?

2010-05-10 Thread Schubert Zhang
Yes, the column here is not appropriate.
Maybe we need not to create new terms, in Google's Bigtable, the term
qualifier is a good one.

On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote:

 That would be a good time to get rid of the confusing column term, which
 incorrectly suggests a two-dimensional tabular structure.

 Suggestions:

 1. A hypercube (or hypocube, if only two dimensions): replace key and
 column with 1st dimension, 2nd dimension, etc.

 2. A file system: replace key and column with directory and
 subdirectory

 3. A tuple tree: Column family replaced by top-level tuple, whose value
 is the set of keys, whose value is the set of supercolumns of the key, whose
 value is the set of columns for the supercolumn, etc.

 4. Etc.

 On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:

 Nice, Ed, we're doing something very similar but less generic.

 Now replace all of the various methods for querying with a simple query
 interface that takes a Predicate, allow the user to specify (in
 storage-conf) which levels of the nested Columns should be indexed, and
 completely remove Comparators and have people subclass Column / implement
 IColumn and we'd really be on to something ;).

 Mock storage-conf.xml:
   Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 Type=UTF8
 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
   Column Name=ThingThatsNowSuperColumnName Type=Long
 Column Name=ThingThatsNowColumnName Indexed=True
 Type=ASCII
   Column Name=ThingThatCantCurrentlyBeRepresented/
 /Column
   /Column
 /Column
   /Column

 Thrift:
   struct NamePredicate {
 1: required listbinary column_names,
   }
   struct SlicePredicate {
 1: required binary start,
 2: required binary end,
   }
   struct CountPredicate {
 1: required struct predicate,
 2: required i32 count=100,
   }
   struct AndPredicate {
 1: required Predicate left,
 2: required Predicate right,
   }
   struct SubColumnsPredicate {
 1: required Predicate columns,
 2: required Predicate subcolumns,
   }
   ... OrPredicate, OtherUsefulPredicates ...
   query(predicate, count, consistency_level) # Count here would be total
 count of leaf values returned, whereas CountPredicate specifies a column
 count for a particular sub-slice.

 Not fully baked... but I think this could really simplify stuff and make
 it more flexible. Downside is it may give people enough rope to hang
 themselves, but at least the predicate stuff is easily distributable.

 I'm thinking I'll play around with implementing some of this stuff myself
 if I have any free time in the near future.

 Mike


 On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
 simple
  column comparator for composite column names that I put up on github.
 I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of AbstractType
 for
  the purpose of constructing and comparing these types of composite
 column
  names so that if you could more easily do that sort of thing rather
 than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed out
 to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
 you can
  use different comparator types for the Super/SubColumns, I guess..?
 But you
  should be able to do the same thing by creating your own Column
 comparator.
  I guess my point is that SuperColumns are mostly a convenience
 mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com






Re: Is SuperColumn necessary?

2010-05-10 Thread Mike Malone
Maybe... but honestly, it doesn't affect the architecture or interface at
all. I'm more interested in thinking about how the system should work than
what things are called. Naming things are important, but that can happen
later.

Does anyone have any thoughts or comments on the architecture I suggested
earlier?

Mike

On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote:

 Yes, the column here is not appropriate.
 Maybe we need not to create new terms, in Google's Bigtable, the term
 qualifier is a good one.


 On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote:

 That would be a good time to get rid of the confusing column term, which
 incorrectly suggests a two-dimensional tabular structure.

 Suggestions:

 1. A hypercube (or hypocube, if only two dimensions): replace key and
 column with 1st dimension, 2nd dimension, etc.

 2. A file system: replace key and column with directory and
 subdirectory

 3. A tuple tree: Column family replaced by top-level tuple, whose value
 is the set of keys, whose value is the set of supercolumns of the key, whose
 value is the set of columns for the supercolumn, etc.

 4. Etc.

 On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:

 Nice, Ed, we're doing something very similar but less generic.

 Now replace all of the various methods for querying with a simple query
 interface that takes a Predicate, allow the user to specify (in
 storage-conf) which levels of the nested Columns should be indexed, and
 completely remove Comparators and have people subclass Column / implement
 IColumn and we'd really be on to something ;).

 Mock storage-conf.xml:
   Column Name=ThingThatsNowKey Indexed=True
 ClusterPartitioned=True Type=UTF8
 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
   Column Name=ThingThatsNowSuperColumnName Type=Long
 Column Name=ThingThatsNowColumnName Indexed=True
 Type=ASCII
   Column Name=ThingThatCantCurrentlyBeRepresented/
 /Column
   /Column
 /Column
   /Column

 Thrift:
   struct NamePredicate {
 1: required listbinary column_names,
   }
   struct SlicePredicate {
 1: required binary start,
 2: required binary end,
   }
   struct CountPredicate {
 1: required struct predicate,
 2: required i32 count=100,
   }
   struct AndPredicate {
 1: required Predicate left,
 2: required Predicate right,
   }
   struct SubColumnsPredicate {
 1: required Predicate columns,
 2: required Predicate subcolumns,
   }
   ... OrPredicate, OtherUsefulPredicates ...
   query(predicate, count, consistency_level) # Count here would be total
 count of leaf values returned, whereas CountPredicate specifies a column
 count for a particular sub-slice.

 Not fully baked... but I think this could really simplify stuff and make
 it more flexible. Downside is it may give people enough rope to hang
 themselves, but at least the predicate stuff is easily distributable.

 I'm thinking I'll play around with implementing some of this stuff myself
 if I have any free time in the near future.

 Mike


 On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
 simple
  column comparator for composite column names that I put up on github.
 I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of
 AbstractType for
  the purpose of constructing and comparing these types of composite
 column
  names so that if you could more easily do that sort of thing rather
 than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed
 out to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
 you can
  use different comparator types for the Super/SubColumns, I guess..?
 But you
  should be able to do the same thing by creating your own Column
 comparator.
  I guess my point is that SuperColumns are mostly a convenience
 mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com







Re: Is SuperColumn necessary?

2010-05-10 Thread Stu Hood
I think that it is 100% ideal: it's what I've been working on implementing in 
#674, #847 and #998. I'm hoping to post a large patchset and docs this week, 
and I'm aiming to get it committed for 0.8.

The work I've been doing doesn't touch the user interface: it only deals with 
the internal changes necessary to make this type of storage possible.


-Original Message-
From: Mike Malone m...@simplegeo.com
Sent: Monday, May 10, 2010 11:37am
To: user@cassandra.apache.org
Subject: Re: Is SuperColumn necessary?

Maybe... but honestly, it doesn't affect the architecture or interface at
all. I'm more interested in thinking about how the system should work than
what things are called. Naming things are important, but that can happen
later.

Does anyone have any thoughts or comments on the architecture I suggested
earlier?

Mike

On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote:

 Yes, the column here is not appropriate.
 Maybe we need not to create new terms, in Google's Bigtable, the term
 qualifier is a good one.


 On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote:

 That would be a good time to get rid of the confusing column term, which
 incorrectly suggests a two-dimensional tabular structure.

 Suggestions:

 1. A hypercube (or hypocube, if only two dimensions): replace key and
 column with 1st dimension, 2nd dimension, etc.

 2. A file system: replace key and column with directory and
 subdirectory

 3. A tuple tree: Column family replaced by top-level tuple, whose value
 is the set of keys, whose value is the set of supercolumns of the key, whose
 value is the set of columns for the supercolumn, etc.

 4. Etc.

 On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:

 Nice, Ed, we're doing something very similar but less generic.

 Now replace all of the various methods for querying with a simple query
 interface that takes a Predicate, allow the user to specify (in
 storage-conf) which levels of the nested Columns should be indexed, and
 completely remove Comparators and have people subclass Column / implement
 IColumn and we'd really be on to something ;).

 Mock storage-conf.xml:
   Column Name=ThingThatsNowKey Indexed=True
 ClusterPartitioned=True Type=UTF8
 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
   Column Name=ThingThatsNowSuperColumnName Type=Long
 Column Name=ThingThatsNowColumnName Indexed=True
 Type=ASCII
   Column Name=ThingThatCantCurrentlyBeRepresented/
 /Column
   /Column
 /Column
   /Column

 Thrift:
   struct NamePredicate {
 1: required listbinary column_names,
   }
   struct SlicePredicate {
 1: required binary start,
 2: required binary end,
   }
   struct CountPredicate {
 1: required struct predicate,
 2: required i32 count=100,
   }
   struct AndPredicate {
 1: required Predicate left,
 2: required Predicate right,
   }
   struct SubColumnsPredicate {
 1: required Predicate columns,
 2: required Predicate subcolumns,
   }
   ... OrPredicate, OtherUsefulPredicates ...
   query(predicate, count, consistency_level) # Count here would be total
 count of leaf values returned, whereas CountPredicate specifies a column
 count for a particular sub-slice.

 Not fully baked... but I think this could really simplify stuff and make
 it more flexible. Downside is it may give people enough rope to hang
 themselves, but at least the predicate stuff is easily distributable.

 I'm thinking I'll play around with implementing some of this stuff myself
 if I have any free time in the near future.

 Mike


 On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
 simple
  column comparator for composite column names that I put up on github.
 I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of
 AbstractType for
  the purpose of constructing and comparing these types of composite
 column
  names so that if you could more easily do that sort of thing rather
 than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed
 out to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
 you can
  use different comparator types for the Super/SubColumns, I guess..?
 But you
  should be able to do the same thing by creating your own Column
 comparator.
  I guess my point is that SuperColumns are mostly a convenience
 mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra

Re: Is SuperColumn necessary?

2010-05-10 Thread Mike Malone
On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote:

 I have to disagree about the naming of things. The name of something
 isn't just a literal identifier. It affects the way people think about
 it. For new users, the whole naming thing has been a persistent
 barrier.


I'm saying we shouldn't be worried too much about coming up with names and
analogies until we've decided what it is we're naming.


 As for your suggestions, I'm all for simplifying or generalizing the
 how it works part down to a more generalized set of operations. I'm
 not sure it's a good idea to require users to think in terms building
 up a fluffy query structure just to thread it through a needle of an
 API, even for the simplest of queries. At some point, the level of
 generic boilerplate takes away from the semantic hand rails that
 developers like. So I guess I'm suggesting that how it works and
 how we use it are not always exactly the same. At least they should
 both hinge on a common conceptual model, which is where the naming
 becomes an important anchoring point.


If things are done properly, client libraries could expose simplified query
interfaces without much effort. Most ORMs these days work by building a
propositional directed acyclic graph that's serialized to SQL. This would
work the same way, but it wouldn't be converted into a 4GL.

Mike



 Jonathan

 On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote:
  Maybe... but honestly, it doesn't affect the architecture or interface at
  all. I'm more interested in thinking about how the system should work
 than
  what things are called. Naming things are important, but that can happen
  later.
  Does anyone have any thoughts or comments on the architecture I suggested
  earlier?
 
  Mike
 
  On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com
 wrote:
 
  Yes, the column here is not appropriate.
  Maybe we need not to create new terms, in Google's Bigtable, the term
  qualifier is a good one.
 
  On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com
 wrote:
 
  That would be a good time to get rid of the confusing column term,
  which incorrectly suggests a two-dimensional tabular structure.
 
  Suggestions:
 
  1. A hypercube (or hypocube, if only two dimensions): replace key and
  column with 1st dimension, 2nd dimension, etc.
 
  2. A file system: replace key and column with directory and
  subdirectory
 
  3. A tuple tree: Column family replaced by top-level tuple, whose
 value
  is the set of keys, whose value is the set of supercolumns of the key,
 whose
  value is the set of columns for the supercolumn, etc.
 
  4. Etc.
 
  On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  Nice, Ed, we're doing something very similar but less generic.
  Now replace all of the various methods for querying with a simple
 query
  interface that takes a Predicate, allow the user to specify (in
  storage-conf) which levels of the nested Columns should be indexed,
 and
  completely remove Comparators and have people subclass Column /
 implement
  IColumn and we'd really be on to something ;).
  Mock storage-conf.xml:
Column Name=ThingThatsNowKey Indexed=True
  ClusterPartitioned=True Type=UTF8
  Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
  Type=UTF8
Column Name=ThingThatsNowSuperColumnName Type=Long
  Column Name=ThingThatsNowColumnName Indexed=True
  Type=ASCII
Column Name=ThingThatCantCurrentlyBeRepresented/
  /Column
/Column
  /Column
/Column
  Thrift:
struct NamePredicate {
  1: required listbinary column_names,
}
struct SlicePredicate {
  1: required binary start,
  2: required binary end,
}
struct CountPredicate {
  1: required struct predicate,
  2: required i32 count=100,
}
struct AndPredicate {
  1: required Predicate left,
  2: required Predicate right,
}
struct SubColumnsPredicate {
  1: required Predicate columns,
  2: required Predicate subcolumns,
}
... OrPredicate, OtherUsefulPredicates ...
query(predicate, count, consistency_level) # Count here would be
 total
  count of leaf values returned, whereas CountPredicate specifies a
 column
  count for a particular sub-slice.
  Not fully baked... but I think this could really simplify stuff and
 make
  it more flexible. Downside is it may give people enough rope to hang
  themselves, but at least the predicate stuff is easily distributable.
  I'm thinking I'll play around with implementing some of this stuff
  myself if I have any free time in the near future.
  Mike
 
  On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
  Very interesting, thanks!
 
  On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
   Follow-up from last weeks discussion, I've been playing around with
 a
   simple
   column comparator for composite column names that I put up 

Re: Is SuperColumn necessary?

2010-05-10 Thread Jonathan Shook
Agreed

On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote:
 On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote:

 I have to disagree about the naming of things. The name of something
 isn't just a literal identifier. It affects the way people think about
 it. For new users, the whole naming thing has been a persistent
 barrier.

 I'm saying we shouldn't be worried too much about coming up with names and
 analogies until we've decided what it is we're naming.


 As for your suggestions, I'm all for simplifying or generalizing the
 how it works part down to a more generalized set of operations. I'm
 not sure it's a good idea to require users to think in terms building
 up a fluffy query structure just to thread it through a needle of an
 API, even for the simplest of queries. At some point, the level of
 generic boilerplate takes away from the semantic hand rails that
 developers like. So I guess I'm suggesting that how it works and
 how we use it are not always exactly the same. At least they should
 both hinge on a common conceptual model, which is where the naming
 becomes an important anchoring point.

 If things are done properly, client libraries could expose simplified query
 interfaces without much effort. Most ORMs these days work by building a
 propositional directed acyclic graph that's serialized to SQL. This would
 work the same way, but it wouldn't be converted into a 4GL.
 Mike


 Jonathan

 On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote:
  Maybe... but honestly, it doesn't affect the architecture or interface
  at
  all. I'm more interested in thinking about how the system should work
  than
  what things are called. Naming things are important, but that can happen
  later.
  Does anyone have any thoughts or comments on the architecture I
  suggested
  earlier?
 
  Mike
 
  On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com
  wrote:
 
  Yes, the column here is not appropriate.
  Maybe we need not to create new terms, in Google's Bigtable, the term
  qualifier is a good one.
 
  On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com
  wrote:
 
  That would be a good time to get rid of the confusing column term,
  which incorrectly suggests a two-dimensional tabular structure.
 
  Suggestions:
 
  1. A hypercube (or hypocube, if only two dimensions): replace key
  and
  column with 1st dimension, 2nd dimension, etc.
 
  2. A file system: replace key and column with directory and
  subdirectory
 
  3. A tuple tree: Column family replaced by top-level tuple, whose
  value
  is the set of keys, whose value is the set of supercolumns of the key,
  whose
  value is the set of columns for the supercolumn, etc.
 
  4. Etc.
 
  On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com
  wrote:
 
  Nice, Ed, we're doing something very similar but less generic.
  Now replace all of the various methods for querying with a simple
  query
  interface that takes a Predicate, allow the user to specify (in
  storage-conf) which levels of the nested Columns should be indexed,
  and
  completely remove Comparators and have people subclass Column /
  implement
  IColumn and we'd really be on to something ;).
  Mock storage-conf.xml:
    Column Name=ThingThatsNowKey Indexed=True
  ClusterPartitioned=True Type=UTF8
      Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
  Type=UTF8
        Column Name=ThingThatsNowSuperColumnName Type=Long
          Column Name=ThingThatsNowColumnName Indexed=True
  Type=ASCII
            Column Name=ThingThatCantCurrentlyBeRepresented/
          /Column
        /Column
      /Column
    /Column
  Thrift:
    struct NamePredicate {
      1: required listbinary column_names,
    }
    struct SlicePredicate {
      1: required binary start,
      2: required binary end,
    }
    struct CountPredicate {
      1: required struct predicate,
      2: required i32 count=100,
    }
    struct AndPredicate {
      1: required Predicate left,
      2: required Predicate right,
    }
    struct SubColumnsPredicate {
      1: required Predicate columns,
      2: required Predicate subcolumns,
    }
    ... OrPredicate, OtherUsefulPredicates ...
    query(predicate, count, consistency_level) # Count here would be
  total
  count of leaf values returned, whereas CountPredicate specifies a
  column
  count for a particular sub-slice.
  Not fully baked... but I think this could really simplify stuff and
  make
  it more flexible. Downside is it may give people enough rope to hang
  themselves, but at least the predicate stuff is easily distributable.
  I'm thinking I'll play around with implementing some of this stuff
  myself if I have any free time in the near future.
  Mike
 
  On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com
  wrote:
 
  Very interesting, thanks!
 
  On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
   Follow-up from last weeks discussion, 

Re: Is SuperColumn necessary?

2010-05-10 Thread AJ Chen
supercolumn is good for modeling profile type of data. simple example is
blog:
blog { blog {author,  title, ...}
 comments   {time: commenter}  //sort by TimeUUID
}
when retrieving a blog, you get all the comments sorted by time already.
without supercolumn, you would need to concatenate multiple comment times
together as you suggested.

requiring user to concatenating data fields together is not only an extra
burden on user but also a less clean design.  there will be cases where the
list property of a profile data is a long list (say a million items). in
such cases, user wants to be able to directly insert/delete an item in that
list because it's more efficient.  Retrieving the whole list, updating it,
concatenating again, and then putting it back to datastore is awkward and
less efficient.

-aj


On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:

 Could someone confirm this discussion is not about abandoning supercolumn
 family? I have found modeling data with supercolumn family is actually an
 advantage of cassadra compared to relational database. Hope you are going to
 drop this important concept.  How it's implemented internally is a different
 matter.


 SuperColumns are useful as a convenience mechanism. That's pretty much it.
 There's _nothing_ (as far as I can tell) that you can do with SuperColumns
 that you can't do by manually concatenating key names with a separator on
 the client side and implementing a custom comparator on the server (as ugly
 as that is).

 This discussion is about getting rid of SuperColumns and adding a more
 generic mechanism that will actually be useful and interesting and will
 continue to be convenient for the types of use cases for which people use
 SuperColumns.

 If there's a particular use case that you feel you can only implement with
 SuperColumns, please share! I honestly can't think of any.

 Mike


 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote:

 Agreed

 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com
 wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com
 wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with names
 and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations. I'm
  not sure it's a good idea to require users to think in terms building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they should
  both hinge on a common conceptual model, which is where the naming
  becomes an important anchoring point.
 
  If things are done properly, client libraries could expose simplified
 query
  interfaces without much effort. Most ORMs these days work by building a
  propositional directed acyclic graph that's serialized to SQL. This
 would
  work the same way, but it wouldn't be converted into a 4GL.
  Mike
 
 
  Jonathan
 
  On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com
 wrote:
   Maybe... but honestly, it doesn't affect the architecture or
 interface
   at
   all. I'm more interested in thinking about how the system should
 work
   than
   what things are called. Naming things are important, but that can
 happen
   later.
   Does anyone have any thoughts or comments on the architecture I
   suggested
   earlier?
  
   Mike
  
   On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com
   wrote:
  
   Yes, the column here is not appropriate.
   Maybe we need not to create new terms, in Google's Bigtable, the
 term
   qualifier is a good one.
  
   On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com
 
   wrote:
  
   That would be a good time to get rid of the confusing column
 term,
   which incorrectly suggests a two-dimensional tabular structure.
  
   Suggestions:
  
   1. A hypercube (or hypocube, if only two dimensions): replace
 key
   and
   column with 1st dimension, 2nd dimension, etc.
  
   2. A file system: replace key and column with directory and
   subdirectory
  
   3. A tuple tree: Column family replaced by top-level tuple,
 whose
   value
   is the set of keys, whose value is the set of supercolumns of the
 key,
   whose
   value is the set of columns for the supercolumn, etc.
  
   4. Etc.
  
   On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com
   wrote:
  
   

Re: Is SuperColumn necessary?

2010-05-10 Thread William Ashley
If you're storing your super column under a fixed name, you could just 
concatenate that name with the row key and use normal columns. Then you get 
your paging and sorting the way you want it.


On May 10, 2010, at 4:31 PM, AJ Chen wrote:

 supercolumn is good for modeling profile type of data. simple example is blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times 
 together as you suggested. 
 
 requiring user to concatenating data fields together is not only an extra 
 burden on user but also a less clean design.  there will be cases where the 
 list property of a profile data is a long list (say a million items). in such 
 cases, user wants to be able to directly insert/delete an item in that list 
 because it's more efficient.  Retrieving the whole list, updating it, 
 concatenating again, and then putting it back to datastore is awkward and 
 less efficient.
 
 -aj
 
 
 On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:
 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:
 Could someone confirm this discussion is not about abandoning supercolumn 
 family? I have found modeling data with supercolumn family is actually an 
 advantage of cassadra compared to relational database. Hope you are going to 
 drop this important concept.  How it's implemented internally is a different 
 matter.
 
 SuperColumns are useful as a convenience mechanism. That's pretty much it. 
 There's _nothing_ (as far as I can tell) that you can do with SuperColumns 
 that you can't do by manually concatenating key names with a separator on the 
 client side and implementing a custom comparator on the server (as ugly as 
 that is).
 
 This discussion is about getting rid of SuperColumns and adding a more 
 generic mechanism that will actually be useful and interesting and will 
 continue to be convenient for the types of use cases for which people use 
 SuperColumns.
 
 If there's a particular use case that you feel you can only implement with 
 SuperColumns, please share! I honestly can't think of any.
 
 Mike
 
 
 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.com wrote:
 Agreed
 
 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with names and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations. I'm
  not sure it's a good idea to require users to think in terms building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they should
  both hinge on a common conceptual model, which is where the naming
  becomes an important anchoring point.
 
  If things are done properly, client libraries could expose simplified query
  interfaces without much effort. Most ORMs these days work by building a
  propositional directed acyclic graph that's serialized to SQL. This would
  work the same way, but it wouldn't be converted into a 4GL.
  Mike
 
 
  Jonathan
 
  On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote:
   Maybe... but honestly, it doesn't affect the architecture or interface
   at
   all. I'm more interested in thinking about how the system should work
   than
   what things are called. Naming things are important, but that can happen
   later.
   Does anyone have any thoughts or comments on the architecture I
   suggested
   earlier?
  
   Mike
  
   On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com
   wrote:
  
   Yes, the column here is not appropriate.
   Maybe we need not to create new terms, in Google's Bigtable, the term
   qualifier is a good one.
  
   On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com
   wrote:
  
   That would be a good time to get rid of the confusing column term,
   which incorrectly suggests a two-dimensional tabular structure.
  
   Suggestions:
  
   1. A hypercube (or hypocube, if only two dimensions): replace key
   and
   column with 1st dimension, 2nd dimension, etc.
  
   2. A file system: replace key and column with directory and
   subdirectory
  
   3. A tuple tree: Column family replaced by 

Re: Is SuperColumn necessary?

2010-05-10 Thread Mike Malone
On Mon, May 10, 2010 at 4:31 PM, AJ Chen ajc...@web2express.org wrote:

 supercolumn is good for modeling profile type of data. simple example is
 blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times
 together as you suggested.

 requiring user to concatenating data fields together is not only an extra
 burden on user but also a less clean design.  there will be cases where the
 list property of a profile data is a long list (say a million items). in
 such cases, user wants to be able to directly insert/delete an item in that
 list because it's more efficient.  Retrieving the whole list, updating it,
 concatenating again, and then putting it back to datastore is awkward and
 less efficient.


There's nothing you said here that can't be implemented efficiently using
columns. You can slice rows and get a subset of Columns. In fact, this
example is particularly easy to implement. If you have a Blog with Entries
and Comments you'd do:

  ColumnFamily Name=Blog CompareWith=UTF8Type /

  Insert blog post:
batch_mutate(key=blog post id, [{name=~post:author, value=author},
{name=~post:title, value=title, ...))
  Insert comment:
batch_mutate(key=blog post id, [{name=TimeUUID + :author, ... }]

Then you can get the Post only (slice for [~, ]), the comments only
(slice for [, ~]), or the post _and_ comments (slice for [, ]).
Inserting a comment does _not_ require a get/concatenate/insert.

Yes, concatenating the names on the client side is hacky, clunky, and
inconvenient. That's why we _should_ build an interface that doesn't require
the client to concatenate names. But SuperColumns aren't the right way to do
it. They add no value. They could be implemented in client libraries, for
example, and nobody would know the difference.

To really understand the problem with SuperColumns, though, you need to look
at the Cassandra source. Removing SuperColumns would make the code-base much
cleaner and tighter, and would probably reduce SLOC by 20%. I think a
replacement that assumed nested Columns (or Entries, or Thingies) would be
much cleaner. That's what Stu is working on.

Mike

On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:

 Could someone confirm this discussion is not about abandoning supercolumn
 family? I have found modeling data with supercolumn family is actually an
 advantage of cassadra compared to relational database. Hope you are going to
 drop this important concept.  How it's implemented internally is a different
 matter.


 SuperColumns are useful as a convenience mechanism. That's pretty much it.
 There's _nothing_ (as far as I can tell) that you can do with SuperColumns
 that you can't do by manually concatenating key names with a separator on
 the client side and implementing a custom comparator on the server (as ugly
 as that is).

 This discussion is about getting rid of SuperColumns and adding a more
 generic mechanism that will actually be useful and interesting and will
 continue to be convenient for the types of use cases for which people use
 SuperColumns.

 If there's a particular use case that you feel you can only implement with
 SuperColumns, please share! I honestly can't think of any.

 Mike


 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote:

 Agreed

 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com
 wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com
 wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think
 about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with names
 and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations. I'm
  not sure it's a good idea to require users to think in terms building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they should
  both hinge on a common conceptual model, which is where the naming
  becomes an important anchoring point.
 
  If things are done properly, client libraries could expose simplified
 query
  interfaces without much effort. Most ORMs these days work by building
 a
  propositional directed acyclic graph that's serialized to SQL. This
 would
  work the same way, but it 

Re: Is SuperColumn necessary?

2010-05-10 Thread William Ashley
I'm having a difficult time understanding your syntax. Could you provide an 
example with actual data?

On May 10, 2010, at 5:25 PM, AJ Chen wrote:

 your suggestion works for fixed supercolumn name. the blog example now 
 becomes:
 { blog-id {name, title, ...}
   blog-id-comments {time:commenter}
 }
 
 what about supercolumn names that are not fixed? for example, I want to store 
 comment's details with the blog like this:
 { blog-id { blog { name, title, ...}
   comments {comment-id:commenter}
   comment-id {commenter, time, text, ...}
 }
 
 a comment-id is generated on-the-fly when the comment is made.  how do you 
 flatten the comment-id supercolumn to normal column?  just for brain 
 exercise, not meant to pick on you.
 
 thanks,
 -aj
   
 
 
 On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.com wrote:
 If you're storing your super column under a fixed name, you could just 
 concatenate that name with the row key and use normal columns. Then you get 
 your paging and sorting the way you want it.
 
 
 On May 10, 2010, at 4:31 PM, AJ Chen wrote:
 
 supercolumn is good for modeling profile type of data. simple example is 
 blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times 
 together as you suggested. 
 
 requiring user to concatenating data fields together is not only an extra 
 burden on user but also a less clean design.  there will be cases where the 
 list property of a profile data is a long list (say a million items). in 
 such cases, user wants to be able to directly insert/delete an item in that 
 list because it's more efficient.  Retrieving the whole list, updating it, 
 concatenating again, and then putting it back to datastore is awkward and 
 less efficient.
 
 -aj
 
 
 On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:
 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:
 Could someone confirm this discussion is not about abandoning supercolumn 
 family? I have found modeling data with supercolumn family is actually an 
 advantage of cassadra compared to relational database. Hope you are going to 
 drop this important concept.  How it's implemented internally is a different 
 matter.
 
 SuperColumns are useful as a convenience mechanism. That's pretty much it. 
 There's _nothing_ (as far as I can tell) that you can do with SuperColumns 
 that you can't do by manually concatenating key names with a separator on 
 the client side and implementing a custom comparator on the server (as ugly 
 as that is).
 
 This discussion is about getting rid of SuperColumns and adding a more 
 generic mechanism that will actually be useful and interesting and will 
 continue to be convenient for the types of use cases for which people use 
 SuperColumns.
 
 If there's a particular use case that you feel you can only implement with 
 SuperColumns, please share! I honestly can't think of any.
 
 Mike
 
 
 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.com wrote:
 Agreed
 
 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with names and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations. I'm
  not sure it's a good idea to require users to think in terms building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they should
  both hinge on a common conceptual model, which is where the naming
  becomes an important anchoring point.
 
  If things are done properly, client libraries could expose simplified query
  interfaces without much effort. Most ORMs these days work by building a
  propositional directed acyclic graph that's serialized to SQL. This would
  work the same way, but it wouldn't be converted into a 4GL.
  Mike
 
 
  Jonathan
 
  On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote:
   Maybe... but honestly, it doesn't affect the architecture or interface
   at
   all. I'm more interested in thinking about how the system should work
   than
   what things are called. Naming things are important, 

Re: Is SuperColumn necessary?

2010-05-10 Thread AJ Chen
in your implementation, is the comment still sorted by TIME?  Will UTF8Type
sort TimeUUID:author by time?
thanks,
-aj

On Mon, May 10, 2010 at 5:02 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 4:31 PM, AJ Chen ajc...@web2express.org wrote:

 supercolumn is good for modeling profile type of data. simple example is
 blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times
 together as you suggested.

 requiring user to concatenating data fields together is not only an extra
 burden on user but also a less clean design.  there will be cases where the
 list property of a profile data is a long list (say a million items). in
 such cases, user wants to be able to directly insert/delete an item in that
 list because it's more efficient.  Retrieving the whole list, updating it,
 concatenating again, and then putting it back to datastore is awkward and
 less efficient.


 There's nothing you said here that can't be implemented efficiently using
 columns. You can slice rows and get a subset of Columns. In fact, this
 example is particularly easy to implement. If you have a Blog with Entries
 and Comments you'd do:

   ColumnFamily Name=Blog CompareWith=UTF8Type /

   Insert blog post:
 batch_mutate(key=blog post id, [{name=~post:author,
 value=author}, {name=~post:title, value=title, ...))
   Insert comment:
 batch_mutate(key=blog post id, [{name=TimeUUID + :author, ... }]

 Then you can get the Post only (slice for [~, ]), the comments only
 (slice for [, ~]), or the post _and_ comments (slice for [, ]).
 Inserting a comment does _not_ require a get/concatenate/insert.

 Yes, concatenating the names on the client side is hacky, clunky, and
 inconvenient. That's why we _should_ build an interface that doesn't require
 the client to concatenate names. But SuperColumns aren't the right way to do
 it. They add no value. They could be implemented in client libraries, for
 example, and nobody would know the difference.

 To really understand the problem with SuperColumns, though, you need to
 look at the Cassandra source. Removing SuperColumns would make the code-base
 much cleaner and tighter, and would probably reduce SLOC by 20%. I think a
 replacement that assumed nested Columns (or Entries, or Thingies) would be
 much cleaner. That's what Stu is working on.

 Mike

 On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:

 Could someone confirm this discussion is not about abandoning
 supercolumn family? I have found modeling data with supercolumn family is
 actually an advantage of cassadra compared to relational database. Hope you
 are going to drop this important concept.  How it's implemented internally
 is a different matter.


 SuperColumns are useful as a convenience mechanism. That's pretty much
 it. There's _nothing_ (as far as I can tell) that you can do with
 SuperColumns that you can't do by manually concatenating key names with a
 separator on the client side and implementing a custom comparator on the
 server (as ugly as that is).

 This discussion is about getting rid of SuperColumns and adding a more
 generic mechanism that will actually be useful and interesting and will
 continue to be convenient for the types of use cases for which people use
 SuperColumns.

 If there's a particular use case that you feel you can only implement
 with SuperColumns, please share! I honestly can't think of any.

 Mike


 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote:

 Agreed

 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com
 wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com
 wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think
 about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with
 names and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations.
 I'm
  not sure it's a good idea to require users to think in terms
 building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they
 should
  both hinge on a common conceptual model, which is where the naming
  becomes an important anchoring point.
 
  If things are done properly, client 

Re: Is SuperColumn necessary?

2010-05-10 Thread AJ Chen
{
b1  { blog-id: b1
  author: ba1
  tittle: bt1
  comment-timeuuid-1: {author: ca1
   id: comment-timeuuid-1
   text: text 1
  comment-timeuuid-2: {author: ca2
   id: comment-timeuuid-2
   text: text 2
  }
}

Mike just suggested to concate comment id with each of the comment field
names so that the above data can be stored in normal column family. It looks
fine except that I'm not sure the time sorting on comments still works or
not.

-aj

On Mon, May 10, 2010 at 5:36 PM, William Ashley wash...@gmail.com wrote:

 I'm having a difficult time understanding your syntax. Could you provide an
 example with actual data?

 On May 10, 2010, at 5:25 PM, AJ Chen wrote:

 your suggestion works for fixed supercolumn name. the blog example now
 becomes:
 { blog-id {name, title, ...}
   blog-id-comments {time:commenter}
 }

 what about supercolumn names that are not fixed? for example, I want to
 store comment's details with the blog like this:
 { blog-id { blog { name, title, ...}
   comments {comment-id:commenter}
   comment-id {commenter, time, text, ...}
 }

 a comment-id is generated on-the-fly when the comment is made.  how do you
 flatten the comment-id supercolumn to normal column?  just for brain
 exercise, not meant to pick on you.

 thanks,
 -aj



 On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.com wrote:

 If you're storing your super column under a fixed name, you could just
 concatenate that name with the row key and use normal columns. Then you get
 your paging and sorting the way you want it.


 On May 10, 2010, at 4:31 PM, AJ Chen wrote:

 supercolumn is good for modeling profile type of data. simple example is
 blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times
 together as you suggested.

 requiring user to concatenating data fields together is not only an extra
 burden on user but also a less clean design.  there will be cases where the
 list property of a profile data is a long list (say a million items). in
 such cases, user wants to be able to directly insert/delete an item in that
 list because it's more efficient.  Retrieving the whole list, updating it,
 concatenating again, and then putting it back to datastore is awkward and
 less efficient.

 -aj


 On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote:

 Could someone confirm this discussion is not about abandoning
 supercolumn family? I have found modeling data with supercolumn family is
 actually an advantage of cassadra compared to relational database. Hope you
 are going to drop this important concept.  How it's implemented internally
 is a different matter.


 SuperColumns are useful as a convenience mechanism. That's pretty much
 it. There's _nothing_ (as far as I can tell) that you can do with
 SuperColumns that you can't do by manually concatenating key names with a
 separator on the client side and implementing a custom comparator on the
 server (as ugly as that is).

 This discussion is about getting rid of SuperColumns and adding a more
 generic mechanism that will actually be useful and interesting and will
 continue to be convenient for the types of use cases for which people use
 SuperColumns.

 If there's a particular use case that you feel you can only implement
 with SuperColumns, please share! I honestly can't think of any.

 Mike


 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote:

 Agreed

 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com
 wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com
 wrote:
 
  I have to disagree about the naming of things. The name of something
  isn't just a literal identifier. It affects the way people think
 about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with
 names and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing the
  how it works part down to a more generalized set of operations.
 I'm
  not sure it's a good idea to require users to think in terms
 building
  up a fluffy query structure just to thread it through a needle of an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm suggesting that how it works and
  how we use it are not always exactly the same. At least they
 should
  

Re: Is SuperColumn necessary?

2010-05-10 Thread Mike Malone

 Mike just suggested to concate comment id with each of the comment field
 names so that the above data can be stored in normal column family. It looks
 fine except that I'm not sure the time sorting on comments still works or
 not.


In the case of time you can just use lexicographically sortable strings that
represent your timestamp (e.g., RFC 3339). You're right, I don't think
TimeUUID does that. For more complicated things (e.g., TimeUUIDs or packed
numerics that you don't want to zero pad) you'd have to implement a custom
comparator. So the convenience mechanisms that would have to be
implemented (and, in fact, Stu and Ed have pretty much already implemented)
would take care of concatenating the column names and doing the chained
comparisons for you.

Mike




 On Mon, May 10, 2010 at 5:36 PM, William Ashley wash...@gmail.com wrote:

 I'm having a difficult time understanding your syntax. Could you provide
 an example with actual data?

 On May 10, 2010, at 5:25 PM, AJ Chen wrote:

 your suggestion works for fixed supercolumn name. the blog example now
 becomes:
 { blog-id {name, title, ...}
   blog-id-comments {time:commenter}
 }

 what about supercolumn names that are not fixed? for example, I want to
 store comment's details with the blog like this:
 { blog-id { blog { name, title, ...}
   comments {comment-id:commenter}
   comment-id {commenter, time, text, ...}
 }

 a comment-id is generated on-the-fly when the comment is made.  how do you
 flatten the comment-id supercolumn to normal column?  just for brain
 exercise, not meant to pick on you.

 thanks,
 -aj



 On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.comwrote:

 If you're storing your super column under a fixed name, you could just
 concatenate that name with the row key and use normal columns. Then you get
 your paging and sorting the way you want it.


 On May 10, 2010, at 4:31 PM, AJ Chen wrote:

 supercolumn is good for modeling profile type of data. simple example is
 blog:
 blog { blog {author,  title, ...}
  comments   {time: commenter}  //sort by TimeUUID
 }
 when retrieving a blog, you get all the comments sorted by time already.
 without supercolumn, you would need to concatenate multiple comment times
 together as you suggested.

 requiring user to concatenating data fields together is not only an extra
 burden on user but also a less clean design.  there will be cases where the
 list property of a profile data is a long list (say a million items). in
 such cases, user wants to be able to directly insert/delete an item in that
 list because it's more efficient.  Retrieving the whole list, updating it,
 concatenating again, and then putting it back to datastore is awkward and
 less efficient.

 -aj


 On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote:

 On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.orgwrote:

 Could someone confirm this discussion is not about abandoning
 supercolumn family? I have found modeling data with supercolumn family is
 actually an advantage of cassadra compared to relational database. Hope 
 you
 are going to drop this important concept.  How it's implemented internally
 is a different matter.


 SuperColumns are useful as a convenience mechanism. That's pretty much
 it. There's _nothing_ (as far as I can tell) that you can do with
 SuperColumns that you can't do by manually concatenating key names with a
 separator on the client side and implementing a custom comparator on the
 server (as ugly as that is).

 This discussion is about getting rid of SuperColumns and adding a more
 generic mechanism that will actually be useful and interesting and will
 continue to be convenient for the types of use cases for which people use
 SuperColumns.

 If there's a particular use case that you feel you can only implement
 with SuperColumns, please share! I honestly can't think of any.

 Mike


 On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote:

 Agreed

 On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com
 wrote:
  On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com
 wrote:
 
  I have to disagree about the naming of things. The name of
 something
  isn't just a literal identifier. It affects the way people think
 about
  it. For new users, the whole naming thing has been a persistent
  barrier.
 
  I'm saying we shouldn't be worried too much about coming up with
 names and
  analogies until we've decided what it is we're naming.
 
 
  As for your suggestions, I'm all for simplifying or generalizing
 the
  how it works part down to a more generalized set of operations.
 I'm
  not sure it's a good idea to require users to think in terms
 building
  up a fluffy query structure just to thread it through a needle of
 an
  API, even for the simplest of queries. At some point, the level of
  generic boilerplate takes away from the semantic hand rails that
  developers like. So I guess I'm 

Re: Is SuperColumn necessary?

2010-05-09 Thread David Boxenhorn
Guys, this is beginning to sound like MUMPS!
http://en.wikipedia.org/wiki/MUMPS

In MUMPS, all variables are sparse, multidimensional arrays, which can be
stored to disk.

It is an arcane, and archaic, language (does anyone but me remember it?),
but it has been used successfully for years. Maybe we can learn something
from it.

I like the terminology of sparse multidimensional arrays very much - it
really clarifies my thinking. A column family would just be a variable.

On Fri, May 7, 2010 at 7:06 PM, Ed Anuff e...@anuff.com wrote:

 On Thu, May 6, 2010 at 11:10 PM, Mike Malone m...@simplegeo.com wrote:


 The upshot is, the Cassandra data model would go from being it's a nested
 dictionary, just kidding no it's not! to being it's a nested dictionary,
 for serious. Again, these are all just ideas... but I think this
 simplified
 data model would allow you to express pretty much any query in a graph of
 simple primitives like Predicates, Filters, Aggregations, Transformations,
 etc. The indexes would allow you to cheat when evaluating certain types of
 queries - if you get a SlicePredicate on an indexed thingy you don't
 have
 to enumerate the entire set of sub-thingies for example.


 This would be my dream implementation. I'm working an an application that
 needs that sort of capability.  SuperColumns lead you to thinking that
 should be done in the cassandra tier but then fall short, so my thought was
 that I was just going to do everything that was in Cassandra as regular
 columnfamilies and columns using composite keys and composite column names
 ala the code I shared above, and then implement the n-level hierarchy in the
 app tier.  It looks like your suggestion is to take it in the other
 direction and make it part of the fundamental data model, which would be
 very useful if it could be made to work without big tradeoffs.





Re: Is SuperColumn necessary?

2010-05-09 Thread Jonathan Shook
I'm not sure this is much of an improvement. It does illustrate,
however, the desire to couch the concepts in terms that each is
already comfortable with. Nearly every set of terms which come from an
existing system will have baggage which doesn't map appropriately. Not
that the sparse multidimensional arrays is an unfamiliar construct.
It's more that sparse may or may not apply depending on the part of
your data you are describing. Multidimensional implies uniformity of
structure, which is not to be taken for granted. Arrays are just one
way to think of the structures. They also serve well as maps and sets
(Which can be modeled using arrays as well). There are certain
semantics of sets, lists, and maps which people have wired into their
brains, and reducing it all to arrays is likely to create more
confusion.

I think if we want to borrow terms form another system, it shouldn't
be a computing system, or at least should be so different or
fundamental that the terms have to be re-understood free of baggage.

On Sun, May 9, 2010 at 1:30 AM, David Boxenhorn da...@lookin2.com wrote:
 Guys, this is beginning to sound like MUMPS!
 http://en.wikipedia.org/wiki/MUMPS

 In MUMPS, all variables are sparse, multidimensional arrays, which can be
 stored to disk.

 It is an arcane, and archaic, language (does anyone but me remember it?),
 but it has been used successfully for years. Maybe we can learn something
 from it.

 I like the terminology of sparse multidimensional arrays very much - it
 really clarifies my thinking. A column family would just be a variable.

 On Fri, May 7, 2010 at 7:06 PM, Ed Anuff e...@anuff.com wrote:

 On Thu, May 6, 2010 at 11:10 PM, Mike Malone m...@simplegeo.com wrote:

 The upshot is, the Cassandra data model would go from being it's a
 nested
 dictionary, just kidding no it's not! to being it's a nested
 dictionary,
 for serious. Again, these are all just ideas... but I think this
 simplified
 data model would allow you to express pretty much any query in a graph of
 simple primitives like Predicates, Filters, Aggregations,
 Transformations,
 etc. The indexes would allow you to cheat when evaluating certain types
 of
 queries - if you get a SlicePredicate on an indexed thingy you don't
 have
 to enumerate the entire set of sub-thingies for example.


 This would be my dream implementation. I'm working an an application that
 needs that sort of capability.  SuperColumns lead you to thinking that
 should be done in the cassandra tier but then fall short, so my thought was
 that I was just going to do everything that was in Cassandra as regular
 columnfamilies and columns using composite keys and composite column names
 ala the code I shared above, and then implement the n-level hierarchy in the
 app tier.  It looks like your suggestion is to take it in the other
 direction and make it part of the fundamental data model, which would be
 very useful if it could be made to work without big tradeoffs.





Re: Is SuperColumn necessary?

2010-05-07 Thread Mike Malone
On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote:

 I would rather be interested in Tree type structure where supercolumns have
 supercolumns in it. you dont need to compare all the columns to find a
 set of columns and will also reduce the bytes transfered for separator, at
 least string concatenation (Or something like that) for read and write
 column name generation. it is more logically stored and structured by this
 way and also we can make caching work better by selectively caching the
 tree (User defined if you will)

 But nothing wrong in supporting both :)


I'm 99% sure we're talking about the same thing and we don't need to support
both. How names/values are separated is pretty irrelevant. It has to happen
somewhere. I agree that it'd be nice if it happened on the server, but doing
it in the client makes it easier to explore ideas.

On Thu, May 6, 2010 at 5:27 PM, philip andrew philip14...@gmail.com wrote:

 Please create a new term word if the existing terms are misleading, if its
 not a file system then its not good to call it a file system.


While it's seriously bikesheddy, I guess you're right.

Let's call them thingies for now, then. So you can have a top-level
thingy and it can have an arbitrarily nested tree of sub-thingies. Each
thingy has a thingy type [1]. You can also tell Cassandra if you want a
particular level of thingy to be indexed. At one (or maybe more) levels
you can tell Cassandra you want your thingies to be split onto separate
nodes in your cluster. At one (or maybe more) levels you could also tell
Cassandra that you want your thingies split into separate files [2].

The upshot is, the Cassandra data model would go from being it's a nested
dictionary, just kidding no it's not! to being it's a nested dictionary,
for serious. Again, these are all just ideas... but I think this simplified
data model would allow you to express pretty much any query in a graph of
simple primitives like Predicates, Filters, Aggregations, Transformations,
etc. The indexes would allow you to cheat when evaluating certain types of
queries - if you get a SlicePredicate on an indexed thingy you don't have
to enumerate the entire set of sub-thingies for example.

So, you'd query your thingies by building out a predicate,
transformations, filters, etc., serializing the graph of primitives, and
sending it over the wire to Cassandra. Cassandra would rebuild the graph and
run it over your dataset.

So instead of:

  Cassandra.get_range_slices(
keyspace=AwesomeApp,
column_parent=ColumnParent(column_family=user),
slice_predicate=SlicePredicate(column_names=['username', 'dob']),
range=KeyRange(start_key='a', end_key='m'),
consistency_level=ONE
  )

You'd do something like:

  Cassandra.query(
SubThingyTransformer(
NamePredicate(names=[AwesomeApp],
SubThingyTransformer(
NamePredicate(names=[user]),
SubThingyTransformer(
SlicePredicate(start=a, end=m),
NamePredicate(names=[username, dob])
)
)
),
consistency_level=ONE
  )

Which seems complicated, but it's basically just [(user['username'],
user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')]
and could probably be expressed that way in a client library.

I think batch_mutate is awesome the way it is and should be the only way to
insert/update data. I'd rename it mutate. So our interface becomes:

  Cassandra.query(query, consistency_level)
  Cassandra.mutate(mutation, consistency_level)

Ta-da.

Anyways, I was trying to avoid writing all of this out in prose and try
mocking some of it up in code instead. I guess this this works too. Either
way, I do think something like this would simplify the codebase, simplify
the data model, simplify the interface, make the entire system more
flexible, and be generally awesome.

Mike

[1] These can be subclasses of Thingy in Java... or maybe they'd implement
IThingy. But either way they'd handle serialization and probably implement
compareTo to define natural ordering. So you'd have classes like
ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace
comparators.

[2] I think there's another simplification here. Splitting into separate
files is really very similar to splitting onto separate nodes. There might
be a way around some of the row size limitations with this sort of concept.
And we may be able to get better utilization of multiple disks by giving
each disk (or data directory) a subset of the node's token range. Caveat:
thought not fully baked.


Re: Is SuperColumn necessary?

2010-05-07 Thread Eric Evans
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote:
 Follow-up from last weeks discussion, I've been playing around with a
 simple
 column comparator for composite column names that I put up on github.
 I'd
 be interested to hear what people think of this approach.
 
 http://github.com/edanuff/CassandraCompositeType 

Clever. I wonder what a useful abstraction in Hector or one of the other
idiomatic clients would look like.

-- 
Eric Evans
eev...@rackspace.com



Re: Is SuperColumn necessary?

2010-05-06 Thread David Boxenhorn
That would be a good time to get rid of the confusing column term, which
incorrectly suggests a two-dimensional tabular structure.

Suggestions:

1. A hypercube (or hypocube, if only two dimensions): replace key and
column with 1st dimension, 2nd dimension, etc.

2. A file system: replace key and column with directory and
subdirectory

3. A tuple tree: Column family replaced by top-level tuple, whose value is
the set of keys, whose value is the set of supercolumns of the key, whose
value is the set of columns for the supercolumn, etc.

4. Etc.

On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:

 Nice, Ed, we're doing something very similar but less generic.

 Now replace all of the various methods for querying with a simple query
 interface that takes a Predicate, allow the user to specify (in
 storage-conf) which levels of the nested Columns should be indexed, and
 completely remove Comparators and have people subclass Column / implement
 IColumn and we'd really be on to something ;).

 Mock storage-conf.xml:
   Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 Type=UTF8
 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
   Column Name=ThingThatsNowSuperColumnName Type=Long
 Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII
   Column Name=ThingThatCantCurrentlyBeRepresented/
 /Column
   /Column
 /Column
   /Column

 Thrift:
   struct NamePredicate {
 1: required listbinary column_names,
   }
   struct SlicePredicate {
 1: required binary start,
 2: required binary end,
   }
   struct CountPredicate {
 1: required struct predicate,
 2: required i32 count=100,
   }
   struct AndPredicate {
 1: required Predicate left,
 2: required Predicate right,
   }
   struct SubColumnsPredicate {
 1: required Predicate columns,
 2: required Predicate subcolumns,
   }
   ... OrPredicate, OtherUsefulPredicates ...
   query(predicate, count, consistency_level) # Count here would be total
 count of leaf values returned, whereas CountPredicate specifies a column
 count for a particular sub-slice.

 Not fully baked... but I think this could really simplify stuff and make it
 more flexible. Downside is it may give people enough rope to hang
 themselves, but at least the predicate stuff is easily distributable.

 I'm thinking I'll play around with implementing some of this stuff myself
 if I have any free time in the near future.

 Mike


 On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
 simple
  column comparator for composite column names that I put up on github.
 I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of AbstractType
 for
  the purpose of constructing and comparing these types of composite
 column
  names so that if you could more easily do that sort of thing rather
 than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed out
 to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
 you can
  use different comparator types for the Super/SubColumns, I guess..?
 But you
  should be able to do the same thing by creating your own Column
 comparator.
  I guess my point is that SuperColumns are mostly a convenience
 mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com





Re: Is SuperColumn necessary?

2010-05-06 Thread Torsten Curdt
+1 on all of that

On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote:
 That would be a good time to get rid of the confusing column term, which
 incorrectly suggests a two-dimensional tabular structure.

 Suggestions:

 1. A hypercube (or hypocube, if only two dimensions): replace key and
 column with 1st dimension, 2nd dimension, etc.

 2. A file system: replace key and column with directory and
 subdirectory

 3. A tuple tree: Column family replaced by top-level tuple, whose value is
 the set of keys, whose value is the set of supercolumns of the key, whose
 value is the set of columns for the supercolumn, etc.

 4. Etc.

 On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:

 Nice, Ed, we're doing something very similar but less generic.
 Now replace all of the various methods for querying with a simple query
 interface that takes a Predicate, allow the user to specify (in
 storage-conf) which levels of the nested Columns should be indexed, and
 completely remove Comparators and have people subclass Column / implement
 IColumn and we'd really be on to something ;).
 Mock storage-conf.xml:
   Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
 Type=UTF8
     Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
 Type=UTF8
       Column Name=ThingThatsNowSuperColumnName Type=Long
         Column Name=ThingThatsNowColumnName Indexed=True
 Type=ASCII
           Column Name=ThingThatCantCurrentlyBeRepresented/
         /Column
       /Column
     /Column
   /Column
 Thrift:
   struct NamePredicate {
     1: required listbinary column_names,
   }
   struct SlicePredicate {
     1: required binary start,
     2: required binary end,
   }
   struct CountPredicate {
     1: required struct predicate,
     2: required i32 count=100,
   }
   struct AndPredicate {
     1: required Predicate left,
     2: required Predicate right,
   }
   struct SubColumnsPredicate {
     1: required Predicate columns,
     2: required Predicate subcolumns,
   }
   ... OrPredicate, OtherUsefulPredicates ...
   query(predicate, count, consistency_level) # Count here would be total
 count of leaf values returned, whereas CountPredicate specifies a column
 count for a particular sub-slice.
 Not fully baked... but I think this could really simplify stuff and make
 it more flexible. Downside is it may give people enough rope to hang
 themselves, but at least the predicate stuff is easily distributable.
 I'm thinking I'll play around with implementing some of this stuff myself
 if I have any free time in the near future.
 Mike

 On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
  simple
  column comparator for composite column names that I put up on github.
  I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of AbstractType
  for
  the purpose of constructing and comparing these types of composite
  column
  names so that if you could more easily do that sort of thing rather
  than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
  wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed out
  to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
  you can
  use different comparator types for the Super/SubColumns, I guess..?
  But you
  should be able to do the same thing by creating your own Column
  comparator.
  I guess my point is that SuperColumns are mostly a convenience
  mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com





Re: Is SuperColumn necessary?

2010-05-06 Thread philip andrew
Please create a new term word if the existing terms are misleading, if its
not a file system then its not good to call it a file system.

On Thu, May 6, 2010 at 3:50 PM, Torsten Curdt tcu...@vafer.org wrote:

 +1 on all of that

 On Thu, May 6, 2010 at 09:04, David Boxenhorn da...@lookin2.com wrote:
  That would be a good time to get rid of the confusing column term,
 which
  incorrectly suggests a two-dimensional tabular structure.
 
  Suggestions:
 
  1. A hypercube (or hypocube, if only two dimensions): replace key and
  column with 1st dimension, 2nd dimension, etc.
 
  2. A file system: replace key and column with directory and
  subdirectory
 
  3. A tuple tree: Column family replaced by top-level tuple, whose value
 is
  the set of keys, whose value is the set of supercolumns of the key, whose
  value is the set of columns for the supercolumn, etc.
 
  4. Etc.
 
  On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote:
 
  Nice, Ed, we're doing something very similar but less generic.
  Now replace all of the various methods for querying with a simple query
  interface that takes a Predicate, allow the user to specify (in
  storage-conf) which levels of the nested Columns should be indexed, and
  completely remove Comparators and have people subclass Column /
 implement
  IColumn and we'd really be on to something ;).
  Mock storage-conf.xml:
Column Name=ThingThatsNowKey Indexed=True
 ClusterPartitioned=True
  Type=UTF8
  Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
  Type=UTF8
Column Name=ThingThatsNowSuperColumnName Type=Long
  Column Name=ThingThatsNowColumnName Indexed=True
  Type=ASCII
Column Name=ThingThatCantCurrentlyBeRepresented/
  /Column
/Column
  /Column
/Column
  Thrift:
struct NamePredicate {
  1: required listbinary column_names,
}
struct SlicePredicate {
  1: required binary start,
  2: required binary end,
}
struct CountPredicate {
  1: required struct predicate,
  2: required i32 count=100,
}
struct AndPredicate {
  1: required Predicate left,
  2: required Predicate right,
}
struct SubColumnsPredicate {
  1: required Predicate columns,
  2: required Predicate subcolumns,
}
... OrPredicate, OtherUsefulPredicates ...
query(predicate, count, consistency_level) # Count here would be total
  count of leaf values returned, whereas CountPredicate specifies a column
  count for a particular sub-slice.
  Not fully baked... but I think this could really simplify stuff and make
  it more flexible. Downside is it may give people enough rope to hang
  themselves, but at least the predicate stuff is easily distributable.
  I'm thinking I'll play around with implementing some of this stuff
 myself
  if I have any free time in the near future.
  Mike
 
  On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  Very interesting, thanks!
 
  On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
   Follow-up from last weeks discussion, I've been playing around with a
   simple
   column comparator for composite column names that I put up on github.
   I'd
   be interested to hear what people think of this approach.
  
   http://github.com/edanuff/CassandraCompositeType
  
   Ed
  
   On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
  
   It might make sense to create a CompositeType subclass of
 AbstractType
   for
   the purpose of constructing and comparing these types of composite
   column
   names so that if you could more easily do that sort of thing rather
   than
   having to concatenate into one big string.
  
   On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
   wrote:
  
   The only thing SuperColumns appear to buy you (as someone pointed
 out
   to
   me at the Cassandra meetup - I think it was Eric Florenzano) is
 that
   you can
   use different comparator types for the Super/SubColumns, I guess..?
   But you
   should be able to do the same thing by creating your own Column
   comparator.
   I guess my point is that SuperColumns are mostly a convenience
   mechanism, as
   far as I can tell.
   Mike
  
  
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of Riptano, the source for professional Cassandra support
  http://riptano.com
 
 
 



Re: Is SuperColumn necessary?

2010-05-06 Thread Vijay
I would rather be interested in Tree type structure where supercolumns have
supercolumns in it. you dont need to compare all the columns to find a
set of columns and will also reduce the bytes transfered for separator, at
least string concatenation (Or something like that) for read and write
column name generation. it is more logically stored and structured by this
way and also we can make caching work better by selectively caching the
tree (User defined if you will)

But nothing wrong in supporting both :)

Regards,
/VJ



On Wed, May 5, 2010 at 11:31 AM, Ed Anuff e...@anuff.com wrote:

 Follow-up from last weeks discussion, I've been playing around with a
 simple column comparator for composite column names that I put up on
 github.  I'd be interested to hear what people think of this approach.

 http://github.com/edanuff/CassandraCompositeType

 Ed

 On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:

 It might make sense to create a CompositeType subclass of AbstractType for
 the purpose of constructing and comparing these types of composite column
 names so that if you could more easily do that sort of thing rather than
 having to concatenate into one big string.


 On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote:

 The only thing SuperColumns appear to buy you (as someone pointed out to
 me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
 use different comparator types for the Super/SubColumns, I guess..? But you
 should be able to do the same thing by creating your own Column comparator.
 I guess my point is that SuperColumns are mostly a convenience mechanism, as
 far as I can tell.

 Mike






Re: Is SuperColumn necessary?

2010-05-05 Thread Ed Anuff
Follow-up from last weeks discussion, I've been playing around with a simple
column comparator for composite column names that I put up on github.  I'd
be interested to hear what people think of this approach.

http://github.com/edanuff/CassandraCompositeType

Ed

On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:

 It might make sense to create a CompositeType subclass of AbstractType for
 the purpose of constructing and comparing these types of composite column
 names so that if you could more easily do that sort of thing rather than
 having to concatenate into one big string.


 On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote:

 The only thing SuperColumns appear to buy you (as someone pointed out to
 me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
 use different comparator types for the Super/SubColumns, I guess..? But you
 should be able to do the same thing by creating your own Column comparator.
 I guess my point is that SuperColumns are mostly a convenience mechanism, as
 far as I can tell.

 Mike





Re: Is SuperColumn necessary?

2010-05-05 Thread Stu Hood
Hey Ed,

I've been working on a similar approach for arbitarily nested/compound column 
names in #998. See: 
http://github.com/stuhood/cassandra/blob/998/src/java/org/apache/cassandra/db/ColumnKey.java

The goal is to provide native support and potentially (in the very long term), 
API support for nested/compound names. The difference between our approaches 
boils down to needing to define a comparator for every level in #998, versus 
having dynamic types per name in your approach.

Thanks,
Stu


-Original Message-
From: Ed Anuff e...@anuff.com
Sent: Wednesday, May 5, 2010 1:31pm
To: user@cassandra.apache.org
Subject: Re: Is SuperColumn necessary?

Follow-up from last weeks discussion, I've been playing around with a simple
column comparator for composite column names that I put up on github.  I'd
be interested to hear what people think of this approach.

http://github.com/edanuff/CassandraCompositeType

Ed

On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:

 It might make sense to create a CompositeType subclass of AbstractType for
 the purpose of constructing and comparing these types of composite column
 names so that if you could more easily do that sort of thing rather than
 having to concatenate into one big string.


 On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote:

 The only thing SuperColumns appear to buy you (as someone pointed out to
 me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
 use different comparator types for the Super/SubColumns, I guess..? But you
 should be able to do the same thing by creating your own Column comparator.
 I guess my point is that SuperColumns are mostly a convenience mechanism, as
 far as I can tell.

 Mike







Re: Is SuperColumn necessary?

2010-05-05 Thread Jonathan Ellis
Very interesting, thanks!

On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
 Follow-up from last weeks discussion, I've been playing around with a simple
 column comparator for composite column names that I put up on github.  I'd
 be interested to hear what people think of this approach.

 http://github.com/edanuff/CassandraCompositeType

 Ed

 On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:

 It might make sense to create a CompositeType subclass of AbstractType for
 the purpose of constructing and comparing these types of composite column
 names so that if you could more easily do that sort of thing rather than
 having to concatenate into one big string.

 On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote:

 The only thing SuperColumns appear to buy you (as someone pointed out to
 me at the Cassandra meetup - I think it was Eric Florenzano) is that you can
 use different comparator types for the Super/SubColumns, I guess..? But you
 should be able to do the same thing by creating your own Column comparator.
 I guess my point is that SuperColumns are mostly a convenience mechanism, as
 far as I can tell.
 Mike





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Is SuperColumn necessary?

2010-05-05 Thread Mike Malone
Nice, Ed, we're doing something very similar but less generic.

Now replace all of the various methods for querying with a simple query
interface that takes a Predicate, allow the user to specify (in
storage-conf) which levels of the nested Columns should be indexed, and
completely remove Comparators and have people subclass Column / implement
IColumn and we'd really be on to something ;).

Mock storage-conf.xml:
  Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True
Type=UTF8
Column Name=ThingThatsNowColumnFamily DiskPartitioned=True
Type=UTF8
  Column Name=ThingThatsNowSuperColumnName Type=Long
Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII
  Column Name=ThingThatCantCurrentlyBeRepresented/
/Column
  /Column
/Column
  /Column

Thrift:
  struct NamePredicate {
1: required listbinary column_names,
  }
  struct SlicePredicate {
1: required binary start,
2: required binary end,
  }
  struct CountPredicate {
1: required struct predicate,
2: required i32 count=100,
  }
  struct AndPredicate {
1: required Predicate left,
2: required Predicate right,
  }
  struct SubColumnsPredicate {
1: required Predicate columns,
2: required Predicate subcolumns,
  }
  ... OrPredicate, OtherUsefulPredicates ...
  query(predicate, count, consistency_level) # Count here would be total
count of leaf values returned, whereas CountPredicate specifies a column
count for a particular sub-slice.

Not fully baked... but I think this could really simplify stuff and make it
more flexible. Downside is it may give people enough rope to hang
themselves, but at least the predicate stuff is easily distributable.

I'm thinking I'll play around with implementing some of this stuff myself if
I have any free time in the near future.

Mike

On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Very interesting, thanks!

 On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote:
  Follow-up from last weeks discussion, I've been playing around with a
 simple
  column comparator for composite column names that I put up on github.
 I'd
  be interested to hear what people think of this approach.
 
  http://github.com/edanuff/CassandraCompositeType
 
  Ed
 
  On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote:
 
  It might make sense to create a CompositeType subclass of AbstractType
 for
  the purpose of constructing and comparing these types of composite
 column
  names so that if you could more easily do that sort of thing rather than
  having to concatenate into one big string.
 
  On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com
 wrote:
 
  The only thing SuperColumns appear to buy you (as someone pointed out
 to
  me at the Cassandra meetup - I think it was Eric Florenzano) is that
 you can
  use different comparator types for the Super/SubColumns, I guess..? But
 you
  should be able to do the same thing by creating your own Column
 comparator.
  I guess my point is that SuperColumns are mostly a convenience
 mechanism, as
  far as I can tell.
  Mike
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: Is SuperColumn necessary?

2010-04-28 Thread Schubert Zhang
I don't think secondary index is necessary for cassandra core, at least it
is not urgent.
I think currently, the first urgent improvements of cassandra are:
1. re-clarify the data-model.
2. re-implement the storage and index, especially the current SSTable
implement is not good.

In fact, the current storage/index implement is the most poor point.


On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis jbel...@gmail.com wrote:

 I think that once we have built-in indexing (CASSANDRA-749) you can
 make a good case for dropping supercolumns (at least, dropping them
 from the public API and reserving them for internal use).

 On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com
 wrote:
  I don't think the SuperColumn is so necessary.
  I think this level of logic can be leaved to application.
 
  Do you think so?
 
  If SuperColumn is needed,  as
  https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
 index
  in SuperColumns level and SubColumns level.
  Thus, the levels of index is too many.
 
 



Re: Is SuperColumn necessary?

2010-04-28 Thread Schubert Zhang
I think, at least currently, we should leave the logic of current
SuperColumn and addational indexing features to application layer of
cassandra core.

On Wed, Apr 28, 2010 at 6:44 PM, Schubert Zhang zson...@gmail.com wrote:

 I don't think secondary index is necessary for cassandra core, at least it
 is not urgent.
 I think currently, the first urgent improvements of cassandra are:
 1. re-clarify the data-model.
 2. re-implement the storage and index, especially the current SSTable
 implement is not good.

 In fact, the current storage/index implement is the most poor point.



 On Tue, Apr 27, 2010 at 12:11 AM, Jonathan Ellis jbel...@gmail.comwrote:

 I think that once we have built-in indexing (CASSANDRA-749) you can
 make a good case for dropping supercolumns (at least, dropping them
 from the public API and reserving them for internal use).

 On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com
 wrote:
  I don't think the SuperColumn is so necessary.
  I think this level of logic can be leaved to application.
 
  Do you think so?
 
  If SuperColumn is needed,  as
  https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
 index
  in SuperColumns level and SubColumns level.
  Thus, the levels of index is too many.
 
 





Re: Is SuperColumn necessary?

2010-04-28 Thread David Boxenhorn
If I understand correctly, the distinction between supercolumns and
subcolumns is critical to good database design if you want to use random
partitioning: you can do range queries on subcolumns but not on
supercolumns.

Is this correct?

On Mon, Apr 26, 2010 at 7:11 PM, Jonathan Ellis jbel...@gmail.com wrote:

 I think that once we have built-in indexing (CASSANDRA-749) you can
 make a good case for dropping supercolumns (at least, dropping them
 from the public API and reserving them for internal use).

 On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com
 wrote:
  I don't think the SuperColumn is so necessary.
  I think this level of logic can be leaved to application.
 
  Do you think so?
 
  If SuperColumn is needed,  as
  https://issues.apache.org/jira/browse/CASSANDRA-598, we should build
 index
  in SuperColumns level and SubColumns level.
  Thus, the levels of index is too many.
 
 



Re: Is SuperColumn necessary?

2010-04-28 Thread Mike Malone
On Wed, Apr 28, 2010 at 5:24 AM, David Boxenhorn da...@lookin2.com wrote:

 If I understand correctly, the distinction between supercolumns and
 subcolumns is critical to good database design if you want to use random
 partitioning: you can do range queries on subcolumns but not on
 supercolumns.

 Is this correct?


You can do efficient range queries of normal (not super) columns in a
ColumnFamily. I think SuperColumn's are not indexed, so it's less efficient
to do a slice of subcolumns from a column, if there are lots of subcolumns.

I agree that SuperColumns are technically unnecessary. There aren't any use
cases I can come up with that a SuperColumn satisfies that normal Columns
can't. You can simulate SuperColumn behavior by concatenating key parts with
a separator and using the concatenated key as your column name, then doing a
slice. So if you had a SuperColumn that stored usernames, and sub-columns
that stored document IDs, you could instead have a normal CF that stores
username:document-id.

The only thing SuperColumns appear to buy you (as someone pointed out to me
at the Cassandra meetup - I think it was Eric Florenzano) is that you can
use different comparator types for the Super/SubColumns, I guess..? But you
should be able to do the same thing by creating your own Column comparator.
I guess my point is that SuperColumns are mostly a convenience mechanism, as
far as I can tell.

Mike


Is SuperColumn necessary?

2010-04-26 Thread Schubert Zhang
I don't think the SuperColumn is so necessary.
I think this level of logic can be leaved to application.

Do you think so?

If SuperColumn is needed,  as
https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
in SuperColumns level and SubColumns level.
Thus, the levels of index is too many.


Re: Is SuperColumn necessary?

2010-04-26 Thread Jonathan Ellis
I think that once we have built-in indexing (CASSANDRA-749) you can
make a good case for dropping supercolumns (at least, dropping them
from the public API and reserving them for internal use).

On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang zson...@gmail.com wrote:
 I don't think the SuperColumn is so necessary.
 I think this level of logic can be leaved to application.

 Do you think so?

 If SuperColumn is needed,  as
 https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
 in SuperColumns level and SubColumns level.
 Thus, the levels of index is too many.