Re: Use Case scenario: Keeping a window of data + online analytics
Daniel, Can you provide more information (an example would be very nice) on using batch_mutate deletes to build a time-series store in Cassandra? I have been reading up on batch_mutate from the Wiki: http://wiki.apache.org/cassandra/API It seems to me that since the outer map of mutation_map maps key to the inner map, it would be removing old data associated with the keys provided. Is it possible to remove old data based on time stamp for all keys? Is it also possible remove old keys if there has been no new data associated with it? Bill On Mon, Mar 8, 2010 at 8:44 AM, Daniel Lundin d...@eintr.org wrote: A few comments on building a time-series store in Cassandra... Using the timestamp dimension of columns, reusing columns, could prove quite useful. This allows simple use of batch_mutate deletes (new in 0.6) to purge old data outside the active time window.
Use Case scenario: Keeping a window of data + online analytics
Hello, Have been testing alternatives for MySQL / Postgres based app with the following characteristics: - A high rate of inserts. Heavy bursts are expected. - A high rate of deletes to remove old data. We keep a window, as old data is not relevant. - Online analytics based on _aging_ and other variables, right now mostly calculated client side, this means a lot of data transfer. - Cloud based deployment ala Amazon. EBS slow disks. I have been reading the Wiki, blogs, and mailing list discussions related to provisioning and performance and I would like to know your opinion in relation to: - Can we keep this data window approach, or will a high rate of delete pose a problem? - Also regarding the data window, what about replication? - Will slow disk performance affect read speed? - We need read speed, I understand writes won't be a problem, but there will be a lot of reads, some of them with large sets of values. - What role plays RAM in Cassandra under this scenario? Of course we are looking at Cassandra as a possible solution and/or part of the solution, against / or combined with a in memory DB. Thanks in advance for sharing your experience, and opinions. -- Aníbal Rojas Ruby on Rails Web Developer http://www.google.com/profiles/anibalrojas
Re: Use Case scenario: Keeping a window of data + online analytics
A few comments on building a time-series store in Cassandra... Using the timestamp dimension of columns, reusing columns, could prove quite useful. This allows simple use of batch_mutate deletes (new in 0.6) to purge old data outside the active time window. Otherwise, performance wise, deletes and updates are the same in Cassandra (see http://spyced.blogspot.com/2010/02/distributed-deletes-in-cassandra.html). Data should be spread out over the ring, so load distribution is constant regardless of time or burst peaks. A separate location cache, using a counting/timestamped bloom filter might be useful too, depending on your app, data structures, and throughput requirements. This should be kept outside cassandra and in RAM (redis or even memcache would fit nicely, but a simple RPC service would be faster). Something like such would allow you to build a tuned sliding-window type cache to ensure reads are minimized. Rinse, refactor, repeat, until fast enough and/or job is done ... - Can we keep this data window approach, or will a high rate of delete pose a problem? Delete and insert are both mutations, so if you can do one, you can do the other in ~ the same time. IOW, your rate of mutations in a one-in-one-out scenario is simply 2 * insert-rate. Due to the nature of deletes, you need to plan for storing deleted data until compaction though. The compaction phase itself will probably need accounting for, but that too is predictable. - We need read speed, I understand writes won't be a problem, but there will be a lot of reads, some of them with large sets of values. - What role plays RAM in Cassandra under this scenario? 0.6 has improved caching for reads, but if your app truly needs high performance reads, some kind of application-tuned cache frontend (as mentioned above) is not a bad thing. For sliding-window time series, it's hard to beat a simple bloom-filter based cache without reaching for complexity. Of course we are looking at Cassandra as a possible solution and/or part of the solution, against / or combined with a in memory DB. It's certainly possible to decouple purging from insertion in Cassandra, but there's no generic this is how you do it answer. This, IMHO, is a good thing though. /d