Hi,

we are running an application which produces every night a batch with several hundreds of Gigabytes of data. Once a batch has been computed, it is never modified (nor updates nor deletes), we just keep producing new batches every day.

Now, we are *sometimes* interested to remove a complete specific batch altogether. At the moment, we are accumulating all these data into only one keyspace which has a batch ID column in all our tables which is also part of the primary key. A sample table looks similar to this:

  CREATE TABLE computation_results (
      batch_id int,
      id1 int,
      id2 int,
      value double,
      PRIMARY KEY ((batch_id, id1), id2)
  ) WITH CLUSTERING ORDER BY (id2 ASC);

But we found out it is very difficult to remove a specific batch as we need to know all the IDs to delete the entries and it's both time and resource consuming (ie. it takes a long time and I'm not sure it's going to scale at all.)

So, we are currently looking into having each of our batches in a keyspace of their own so removing a batch is merely equivalent to delete a keyspace. Potentially, it means we will end up having several hundreds of keyspaces in one cluster, although most of the time only the very last one will be used (we might still want to access the older ones, but that would be a very seldom use-case.) At the moment, the keyspace has about 14 tables and is probably not going to evolve much.


Are there any counter-indications of using lot of keyspaces (300+) into one Cassandra cluster?
Are there any good practices that we should follow?
After reading the "Anti-patterns in Cassandra > Too many keyspaces or tables", does it mean we should plan ahead to already split our keyspace among several clusters?

Finally, would there be any other way to achieve what we want to do?

Thanks for your help!

 Jonathan

Reply via email to