The composite-key approach with counters would work very well in this case. It will also obviate the concern of not knowing the exact column names apriori...although for efficiencies, you might to look at maintaining a secondary cachelike cf for lookup....
Depending on your data patterns(not to hit 2b columns) and actual queries, you could store each Zs as one row and composite key on Z - value + X:X-value and then as counter-column. Other optimizations may be possible. If you're using AOP, as I read it, there's really no need to intercept your own writes at the C* level; instead do it (use aop)at the client level. Your migration also needs to be attended to and might need a MR first and AOP intercepted writes. Hth Milind /*********************** sent from my android...please pardon occasional typos as I respond @ the speed of thought ************************/ On Jan 22, 2012 4:42 AM, "Brian O'Neill" <boneil...@gmail.com> wrote: Thanks for all the ideas... Since we can't predict all the values, we actually cut to Oracle today via a map/reduce job. Oracle is able to support all the ad hoc queries the users want (via Indexes), but the extract job takes a long time (hours). The users need more "real-time", which is driving us to look at other alternatives, or better extract methods. (HDFS -> BulkLoad, JDBC, etc.) We also have SOLR in place, which is indexing all the information. That can satisfy > 40-50% of the queries, especially with the FieldGrouping, and some other features available in 4.0: http://wiki.apache.org/solr/FieldCollapsing But there are still cases that SOLR can't handle, because it has a flat document structure and we need to query on multiple dimensions. Eric, we were just about to head down the path you suggested, when we started seeing how heavy the client-side code was going to get for inserts. (something we wanted to keep simple) Also, as I said, we aren't sure what attributes we'll be storing/querying, so some of the queries we'll never be able to accommodate. Regardless, based on your comments though, I'm going to take another look at using composite keys and counters. Another approach may be REAL-TIME data replication... We started looking at a "real-time" solution that would keep Oracle up to date with Cassandra using Triggers. Effectively we would use Cassandra as our transactional system (OLTP) and leave Oracle in place for OLAP. Looks like others have looked at exactly this model: http://maxgrinev.com/2010/07/23/extending-cassandra-with-asynchronous-triggers/ And there's been lots of discussion... https://issues.apache.org/jira/browse/CASSANDRA-1311 And mention that the crew was going to start working on it after 1.0: http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php But I didn't see anything in trunk, and I didn't get any response from the dev list. Alas, we may pick it up this week and implement it. (maybe as part of Virgil<http://code.google.com/a/apache-extras.org/p/virgil/> ) If we use a column family to keep a distributed commit log of mutations, it should be a fairly easy thing to get triggers in place. Really the only question is where we code it? We could implement it in the Cassandra code as a patch, or we implement it on top. I think we might be able to do it using AOP, which would allow anyone to get the functionality just by dropping another jar onto the classpath. I'll see what we can come up with. thanks again, brian On Jan 21, 2012, at 8:35 AM, Eric Czech wrote: > Hi Brian, > > We're trying to do the exact same...