Re: Many keyspaces pattern

2015-11-24 Thread Jack Krupansky
And DateTieredCompactionStrategy can be used to efficiently remove whole
sstables when the TTL expires, but this implies knowing what TTL to set in
advance.

I don't know if there are any tools to bulk delete older than a specific
age when DateTieredCompactionStrategy is used, but it might be a nice
feature.

-- Jack Krupansky

On Tue, Nov 24, 2015 at 12:53 PM, Saladi Naidu 
wrote:

> I can think of following features to solve
>
> 1. If you know the time period of after how long data should be removed
> then use TTL feature
> 2. Use Time Series to model the data and use inverted index to query the
> data by time period?
>
> Naidu Saladi
>
>
>
> On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
>
>
> How often is sometimes - closer to 20% of the batches or 2%?
>
> How are you querying batches, both current and older ones?
>
> As always, your queries should drive your data models.
>
> If deleting a batch is very infrequent, maybe best to not do it and simply
> have logic in the app to ignore deleted batches - if your queries would
> reference them at all.
>
> What reasons would you have to delete a batch? Depending on the nature of
> the reason there may be an alternative.
>
> Make sure your cluster is adequately provisioned so that these expensive
> operations can occur in parallel to reduce their time and resources per
> node.
>
> Do all batches eventually get aged and deleted or are you expecting that
> most batches will live for many years to come? Have you planned for how you
> will grow the cluster over time?
>
> Maybe bite the bullet and use a background process to delete a batch if
> deletion is competing too heavily with query access - if they really need
> to be deleted at all.
>
> Number of keyspaces - and/or tables - should be limited to "low hundreds",
> and even then you are limited by RAM and CPU of each node. If a keyspace
> has 14 tables, then 250/14 = 20 would be a recommended upper limit for
> number of key spaces. Even if your total number of tables was under 300 or
> even 200, you would need to do a proof of concept implementation to verify
> that your specific data works well on your specific hardware.
>
>
> -- Jack Krupansky
>
> On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet 
> wrote:
>
> Hi,
>
> we are running an application which produces every night a batch with
> several hundreds of Gigabytes of data. Once a batch has been computed, it
> is never modified (nor updates nor deletes), we just keep producing new
> batches every day.
>
> Now, we are *sometimes* interested to remove a complete specific batch
> altogether. At the moment, we are accumulating all these data into only one
> keyspace which has a batch ID column in all our tables which is also part
> of the primary key. A sample table looks similar to this:
>
>   CREATE TABLE computation_results (
>   batch_id int,
>   id1 int,
>   id2 int,
>   value double,
>   PRIMARY KEY ((batch_id, id1), id2)
>   ) WITH CLUSTERING ORDER BY (id2 ASC);
>
> But we found out it is very difficult to remove a specific batch as we
> need to know all the IDs to delete the entries and it's both time and
> resource consuming (ie. it takes a long time and I'm not sure it's going to
> scale at all.)
>
> So, we are currently looking into having each of our batches in a keyspace
> of their own so removing a batch is merely equivalent to delete a keyspace.
> Potentially, it means we will end up having several hundreds of keyspaces
> in one cluster, although most of the time only the very last one will be
> used (we might still want to access the older ones, but that would be a
> very seldom use-case.) At the moment, the keyspace has about 14 tables and
> is probably not going to evolve much.
>
>
> Are there any counter-indications of using lot of keyspaces (300+) into
> one Cassandra cluster?
> Are there any good practices that we should follow?
> After reading the "Anti-patterns in Cassandra > Too many keyspaces or
> tables", does it mean we should plan ahead to already split our keyspace
> among several clusters?
>
> Finally, would there be any other way to achieve what we want to do?
>
> Thanks for your help!
>
>  Jonathan
>
>
>
>
>


Re: Many keyspaces pattern

2015-11-24 Thread Saladi Naidu
I can think of following features to solve
1. If you know the time period of after how long data should be removed then 
use TTL feature2. Use Time Series to model the data and use inverted index to 
query the data by time period? Naidu Saladi 
 


On Tuesday, November 24, 2015 6:49 AM, Jack Krupansky 
 wrote:
 

 How often is sometimes - closer to 20% of the batches or 2%?
How are you querying batches, both current and older ones?
As always, your queries should drive your data models.
If deleting a batch is very infrequent, maybe best to not do it and simply have 
logic in the app to ignore deleted batches - if your queries would reference 
them at all.
What reasons would you have to delete a batch? Depending on the nature of the 
reason there may be an alternative.
Make sure your cluster is adequately provisioned so that these expensive 
operations can occur in parallel to reduce their time and resources per node.
Do all batches eventually get aged and deleted or are you expecting that most 
batches will live for many years to come? Have you planned for how you will 
grow the cluster over time?
Maybe bite the bullet and use a background process to delete a batch if 
deletion is competing too heavily with query access - if they really need to be 
deleted at all.
Number of keyspaces - and/or tables - should be limited to "low hundreds", and 
even then you are limited by RAM and CPU of each node. If a keyspace has 14 
tables, then 250/14 = 20 would be a recommended upper limit for number of key 
spaces. Even if your total number of tables was under 300 or even 200, you 
would need to do a proof of concept implementation to verify that your specific 
data works well on your specific hardware.

-- Jack Krupansky
On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet  wrote:

Hi,

we are running an application which produces every night a batch with several 
hundreds of Gigabytes of data. Once a batch has been computed, it is never 
modified (nor updates nor deletes), we just keep producing new batches every 
day.

Now, we are *sometimes* interested to remove a complete specific batch 
altogether. At the moment, we are accumulating all these data into only one 
keyspace which has a batch ID column in all our tables which is also part of 
the primary key. A sample table looks similar to this:

  CREATE TABLE computation_results (
      batch_id int,
      id1 int,
      id2 int,
      value double,
      PRIMARY KEY ((batch_id, id1), id2)
  ) WITH CLUSTERING ORDER BY (id2 ASC);

But we found out it is very difficult to remove a specific batch as we need to 
know all the IDs to delete the entries and it's both time and resource 
consuming (ie. it takes a long time and I'm not sure it's going to scale at 
all.)

So, we are currently looking into having each of our batches in a keyspace of 
their own so removing a batch is merely equivalent to delete a keyspace. 
Potentially, it means we will end up having several hundreds of keyspaces in 
one cluster, although most of the time only the very last one will be used (we 
might still want to access the older ones, but that would be a very seldom 
use-case.) At the moment, the keyspace has about 14 tables and is probably not 
going to evolve much.


Are there any counter-indications of using lot of keyspaces (300+) into one 
Cassandra cluster?
Are there any good practices that we should follow?
After reading the "Anti-patterns in Cassandra > Too many keyspaces or tables", 
does it mean we should plan ahead to already split our keyspace among several 
clusters?

Finally, would there be any other way to achieve what we want to do?

Thanks for your help!

 Jonathan




  

Re: Many keyspaces pattern

2015-11-24 Thread Jack Krupansky
How often is sometimes - closer to 20% of the batches or 2%?

How are you querying batches, both current and older ones?

As always, your queries should drive your data models.

If deleting a batch is very infrequent, maybe best to not do it and simply
have logic in the app to ignore deleted batches - if your queries would
reference them at all.

What reasons would you have to delete a batch? Depending on the nature of
the reason there may be an alternative.

Make sure your cluster is adequately provisioned so that these expensive
operations can occur in parallel to reduce their time and resources per
node.

Do all batches eventually get aged and deleted or are you expecting that
most batches will live for many years to come? Have you planned for how you
will grow the cluster over time?

Maybe bite the bullet and use a background process to delete a batch if
deletion is competing too heavily with query access - if they really need
to be deleted at all.

Number of keyspaces - and/or tables - should be limited to "low hundreds",
and even then you are limited by RAM and CPU of each node. If a keyspace
has 14 tables, then 250/14 = 20 would be a recommended upper limit for
number of key spaces. Even if your total number of tables was under 300 or
even 200, you would need to do a proof of concept implementation to verify
that your specific data works well on your specific hardware.


-- Jack Krupansky

On Tue, Nov 24, 2015 at 5:05 AM, Jonathan Ballet  wrote:

> Hi,
>
> we are running an application which produces every night a batch with
> several hundreds of Gigabytes of data. Once a batch has been computed, it
> is never modified (nor updates nor deletes), we just keep producing new
> batches every day.
>
> Now, we are *sometimes* interested to remove a complete specific batch
> altogether. At the moment, we are accumulating all these data into only one
> keyspace which has a batch ID column in all our tables which is also part
> of the primary key. A sample table looks similar to this:
>
>   CREATE TABLE computation_results (
>   batch_id int,
>   id1 int,
>   id2 int,
>   value double,
>   PRIMARY KEY ((batch_id, id1), id2)
>   ) WITH CLUSTERING ORDER BY (id2 ASC);
>
> But we found out it is very difficult to remove a specific batch as we
> need to know all the IDs to delete the entries and it's both time and
> resource consuming (ie. it takes a long time and I'm not sure it's going to
> scale at all.)
>
> So, we are currently looking into having each of our batches in a keyspace
> of their own so removing a batch is merely equivalent to delete a keyspace.
> Potentially, it means we will end up having several hundreds of keyspaces
> in one cluster, although most of the time only the very last one will be
> used (we might still want to access the older ones, but that would be a
> very seldom use-case.) At the moment, the keyspace has about 14 tables and
> is probably not going to evolve much.
>
>
> Are there any counter-indications of using lot of keyspaces (300+) into
> one Cassandra cluster?
> Are there any good practices that we should follow?
> After reading the "Anti-patterns in Cassandra > Too many keyspaces or
> tables", does it mean we should plan ahead to already split our keyspace
> among several clusters?
>
> Finally, would there be any other way to achieve what we want to do?
>
> Thanks for your help!
>
>  Jonathan
>