The composite-key approach with counters would work very well in this case.
It will also obviate the concern of not knowing the exact column names
apriori...although for efficiencies, you might to look at maintaining a
secondary cachelike cf for lookup....

Depending on your data patterns(not to hit 2b columns) and actual queries,
you could store each Zs as one row and composite key on Z - value +
X:X-value and then as counter-column. Other optimizations may be possible.

If you're using AOP, as I read it, there's really no need to intercept your
own writes at the C* level; instead do it (use aop)at the client level.

Your migration also needs to be attended to and might need a MR first and
AOP intercepted writes.

Hth
Milind

/***********************
sent from my android...please pardon occasional typos as I respond @ the
speed of thought
************************/

On Jan 22, 2012 4:42 AM, "Brian O&apos;Neill" <boneil...@gmail.com> wrote:


Thanks for all the ideas...

Since we can't predict all the values, we actually cut to Oracle today via
a map/reduce job.  Oracle is able to support all the ad hoc queries the
users want (via Indexes), but the extract job takes a long time (hours).
 The users need more "real-time", which is driving us to look at other
alternatives, or better extract methods. (HDFS -> BulkLoad, JDBC, etc.)

We also have SOLR in place, which is indexing all the information.  That
can satisfy > 40-50% of the queries, especially with the FieldGrouping, and
some other features available in 4.0:
http://wiki.apache.org/solr/FieldCollapsing

But there are still cases that SOLR can't handle, because it has a flat
document structure and we need to query on multiple dimensions.

Eric, we were just about to head down the path you suggested, when we
started seeing how heavy the client-side code was going to get for inserts.
(something we wanted to keep simple)  Also, as I said, we aren't sure what
attributes we'll be storing/querying, so some of the queries we'll never be
able to accommodate. Regardless, based on your comments though, I'm going
to take another look at using composite keys and counters.

Another approach may be REAL-TIME data replication...

We started looking at a "real-time" solution that would keep Oracle up to
date with Cassandra using Triggers.  Effectively we would use Cassandra as
our transactional system (OLTP) and leave Oracle in place for OLAP.  Looks
like others have looked at exactly this model:
http://maxgrinev.com/2010/07/23/extending-cassandra-with-asynchronous-triggers/

And there's been lots of discussion...
https://issues.apache.org/jira/browse/CASSANDRA-1311

And mention that the crew was going to start working on it after 1.0:
http://www.readwriteweb.com/cloud/2011/10/cassandra-reaches-10-whats-nex.php

But I didn't see anything in trunk, and I didn't get any response from the
dev list.

Alas, we may pick it up this week and implement it. (maybe as part of
Virgil<http://code.google.com/a/apache-extras.org/p/virgil/>
)

If we use a column family to keep a distributed commit log of mutations, it
should be a fairly easy thing to get triggers in place.  Really the only
question is where we code it?  We could implement it in the Cassandra code
as a patch, or we implement it on top.  I think we might be able to do it
using AOP, which would allow anyone to get the functionality just by
dropping another jar onto the classpath.

I'll see what we can come up with.

thanks again,
brian




On Jan 21, 2012, at 8:35 AM, Eric Czech wrote:

> Hi Brian,
>
> We're trying to do the exact same...

Reply via email to