And DateTieredCompactionStrategy can be used to efficiently remove whole sstables when the TTL expires, but this implies knowing what TTL to set in advance.
I don't know if there are any tools to bulk delete older than a specific age when DateTieredCompactionStrategy is used, but it might be a nice feature. -- Jack Krupansky On Tue, Nov 24, 2015 at 12:53 PM, Saladi Naidu <[email protected]> wrote: > I can think of following features to solve > > 1. If you know the time period of after how long data should be removed > then use TTL feature > 2. Use Time Series to model the data and use inverted index to query the > data by time period? > > Naidu Saladi > > > > On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky < > [email protected]> wrote: > > > How often is sometimes - closer to 20% of the batches or 2%? > > How are you querying batches, both current and older ones? > > As always, your queries should drive your data models. > > If deleting a batch is very infrequent, maybe best to not do it and simply > have logic in the app to ignore deleted batches - if your queries would > reference them at all. > > What reasons would you have to delete a batch? Depending on the nature of > the reason there may be an alternative. > > Make sure your cluster is adequately provisioned so that these expensive > operations can occur in parallel to reduce their time and resources per > node. > > Do all batches eventually get aged and deleted or are you expecting that > most batches will live for many years to come? Have you planned for how you > will grow the cluster over time? > > Maybe bite the bullet and use a background process to delete a batch if > deletion is competing too heavily with query access - if they really need > to be deleted at all. > > Number of keyspaces - and/or tables - should be limited to "low hundreds", > and even then you are limited by RAM and CPU of each node. If a keyspace > has 14 tables, then 250/14 = 20 would be a recommended upper limit for > number of key spaces. Even if your total number of tables was under 300 or > even 200, you would need to do a proof of concept implementation to verify > that your specific data works well on your specific hardware. > > > -- Jack Krupansky > > On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet <[email protected]> > wrote: > > Hi, > > we are running an application which produces every night a batch with > several hundreds of Gigabytes of data. Once a batch has been computed, it > is never modified (nor updates nor deletes), we just keep producing new > batches every day. > > Now, we are *sometimes* interested to remove a complete specific batch > altogether. At the moment, we are accumulating all these data into only one > keyspace which has a batch ID column in all our tables which is also part > of the primary key. A sample table looks similar to this: > > CREATE TABLE computation_results ( > batch_id int, > id1 int, > id2 int, > value double, > PRIMARY KEY ((batch_id, id1), id2) > ) WITH CLUSTERING ORDER BY (id2 ASC); > > But we found out it is very difficult to remove a specific batch as we > need to know all the IDs to delete the entries and it's both time and > resource consuming (ie. it takes a long time and I'm not sure it's going to > scale at all.) > > So, we are currently looking into having each of our batches in a keyspace > of their own so removing a batch is merely equivalent to delete a keyspace. > Potentially, it means we will end up having several hundreds of keyspaces > in one cluster, although most of the time only the very last one will be > used (we might still want to access the older ones, but that would be a > very seldom use-case.) At the moment, the keyspace has about 14 tables and > is probably not going to evolve much. > > > Are there any counter-indications of using lot of keyspaces (300+) into > one Cassandra cluster? > Are there any good practices that we should follow? > After reading the "Anti-patterns in Cassandra > Too many keyspaces or > tables", does it mean we should plan ahead to already split our keyspace > among several clusters? > > Finally, would there be any other way to achieve what we want to do? > > Thanks for your help! > > Jonathan > > > > >
