Re: Use Case scenario: Keeping a window of data + online analytics

2010-03-11 Thread Bill Au
Daniel,
 Can you provide more information (an example would be very nice) on
using batch_mutate deletes to build a time-series store in Cassandra?  I
have been reading up on batch_mutate from the Wiki:

http://wiki.apache.org/cassandra/API

It seems to me that since the outer map of mutation_map maps key to the
inner map, it would be removing old data associated with the keys provided.
Is it possible to remove old data based on time stamp for all keys?  Is it
also possible remove old keys if there has been no new data associated with
it?

Bill


On Mon, Mar 8, 2010 at 8:44 AM, Daniel Lundin d...@eintr.org wrote:

 A few comments on building a time-series store in Cassandra...

 Using the timestamp dimension of columns, reusing columns, could prove
 quite useful. This allows simple use of batch_mutate deletes (new in
 0.6) to purge old data outside the active time window.




Use Case scenario: Keeping a window of data + online analytics

2010-03-08 Thread Aníbal Rojas
Hello,

Have been testing alternatives for MySQL / Postgres based app with
the following characteristics:

- A high rate of inserts. Heavy bursts are expected.
- A high rate of deletes to remove old data. We keep a window, as
old data is not relevant.
- Online analytics based on _aging_ and other variables, right now
mostly calculated client side, this means a lot of data transfer.
- Cloud based deployment ala Amazon. EBS slow disks.

I have been reading the Wiki, blogs, and mailing list discussions
related to provisioning and performance and I would like to know your
opinion in relation to:

- Can we keep this data window approach, or will a high rate of
delete pose a problem?
- Also regarding  the data window, what about replication?
- Will slow disk performance affect read speed?
- We need read speed, I understand writes won't be a problem, but
there will be a lot of reads, some of them with large sets of values.
- What role plays RAM in Cassandra under this scenario?

Of course we are looking at Cassandra as a possible solution
and/or part of the solution, against / or combined with a in memory
DB.

Thanks in advance for sharing your experience, and opinions.

--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas


Re: Use Case scenario: Keeping a window of data + online analytics

2010-03-08 Thread Daniel Lundin
A few comments on building a time-series store in Cassandra...

Using the timestamp dimension of columns, reusing columns, could prove
quite useful. This allows simple use of batch_mutate deletes (new in
0.6) to purge old data outside the active time window.

Otherwise, performance wise, deletes and updates are the same in
Cassandra (see
http://spyced.blogspot.com/2010/02/distributed-deletes-in-cassandra.html).

Data should be spread out over the ring, so load distribution is
constant regardless of time or burst peaks.

A separate location cache, using a counting/timestamped bloom filter
might be useful too, depending on your app, data structures, and
throughput requirements. This should be kept outside cassandra and in
RAM (redis or even memcache would fit nicely, but a simple RPC service
would be faster). Something like such would allow you to build a tuned
sliding-window type cache to ensure reads are minimized.

Rinse, refactor, repeat, until fast enough and/or job is done ...

 - Can we keep this data window approach, or will a high rate of
 delete pose a problem?

Delete and insert are both mutations, so if you can do one, you can do
the other in ~ the same time. IOW, your rate of mutations in a
one-in-one-out scenario is simply 2 * insert-rate.

Due to the nature of deletes, you need to plan for storing deleted
data until compaction though. The compaction phase itself will probably
need accounting for, but that too is predictable.

 - We need read speed, I understand writes won't be a problem, but
 there will be a lot of reads, some of them with large sets of values.
 - What role plays RAM in Cassandra under this scenario?

0.6 has improved caching for reads, but if your app truly needs high
performance reads, some kind of application-tuned cache frontend (as
mentioned above) is not a bad thing. For sliding-window time series,
it's hard to beat a simple bloom-filter based cache without reaching for
complexity.

 Of course we are looking at Cassandra as a possible solution
 and/or part of the solution, against / or combined with a in memory
 DB.

It's certainly possible to decouple purging from insertion in Cassandra,
but there's no generic this is how you do it answer.

This, IMHO, is a good thing though.

/d