Re: Right sizing Cassandra data nodes

2018-02-28 Thread kurt greaves
The problem with higher densities is operations, not querying. When you
need to add nodes/repair/do any streaming operation having more than 3TB
per node becomes more difficult. It's certainly doable, but you'll probably
run into issues. Having said that, an insert only workload is the best
candidate for higher densities.

I'll note that you don't need to bucket by partition really, if you can use
clustering keys (e.g a timestamp) Cassandra will be smart enough to only
read from the SSTables that contain the relevant rows.

But to answer your question, all data is active data. There is no inactive
data. If all you query is the past two months, that's the only data that
will be read by Cassandra. It won't go and read old data unless you tell it
to.

On 24 February 2018 at 07:02, onmstester onmstester 
wrote:

> Another Question on node density, in this scenario:
> 1. we should keep time series data of some years for a heavy write system
> in Cassandra (> 10K Ops in seconds)
> 2. the system is insert only and inserted data would never be updated
> 3. in partition key, we used number of months since 1970, so data for
> every month would be on separate partitions
> 4. because of rule 2, after the end of month previous partitions would
> never be accessed for write requests
> 5. more than 90% of read requests would concern current month partitions,
> so we merely access Old data, we should just keep them for that 10% of
> reports!
> 6. The overall read in comparison to writes are so small (like 0.0001 % of
> overall time)
>
> So, finally the question:
> Even in this scenario would the active data be the whole data (this month
> + all previous months)? or the one which would be accessed for most reads
> and writes (only the past two months)?
> Could i use more than 3TB  per node for this scenario?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>  On Tue, 20 Feb 2018 14:58:39 +0330 *Rahul Singh
> >* wrote 
>
> Node density is active data managed in the cluster divided by the number
> of active nodes. Eg. If you you have 500TB or active data under management
> then you would need 250-500 nodes to get beast like optimum performance. It
> also depends on how much memory is on the boxes and if you are using SSD
> drives. SSD doesn’t replace memory but it doesn’t hurt.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 5:55 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Thanks for the response Rahul. I did not understand the “node density”
> point.
>
>
>
> Charu
>
>
>
> *From:* Rahul Singh 
> *Reply-To:* "user@cassandra.apache.org" 
> *Date:* Monday, February 19, 2018 at 12:32 PM
> *To:* "user@cassandra.apache.org" 
> *Subject:* Re: Right sizing Cassandra data nodes
>
>
>
> 1. I would keep opscenter on different cluster. Why unnecessarily put
> traffic and computing for opscenter data on a real business data cluster?
> 2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it
> increases creates more replication, read repairs , etc and memory usage for
> doing the compactions etc.
> 3. Can have as much as you want for snapshots as long as you have it on
> another disk or even move it to a SAN / NAS. All you may care about us the
> most recent snapshot on the physical machine / disks on a live node.
>
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
>
> On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Hi All,
>
>
>
> Looking for some insight into how application data archive and purge is
> carried out for C* database. Are there standard guidelines on calculating
> the amount of space that can be used for storing data in a specific node.
>
>
>
> Some pointers that I got while researching are;
>
>
>
> -  Allocate 50% space for compaction, e.g. if data size is 50GB
> then allocate 25GB for compaction.
>
> -  Snapshot strategy. If old snapshots are present, then they
> occupy the disk space.
>
> -  Allocate some percentage of storage (  ) for system tables
> and OpsCenter tables ?
>
>
>
> We have a scenario where certain transaction data needs to be archived
> based on business rules and some purged, so before deciding on an A&P
> strategy, I am trying to analyze
>
> how much transactional data can be stored given the current node capacity.
> I also found out that the space available metric shown in Opscenter is not
> very reliable because it doesn’t show
>
> the snapshot space. In our case, we have a huge snapshot size. For some
> unexplained reason, we seem to be taking snapshots of our data every hour
> and purging them only after 7 days.
>
>
>
>
>
> Thanks,
>
> Charu
>
> Cisco Systems.
>
>
>
>
>
>
>
>
>
>


Re: Right sizing Cassandra data nodes

2018-02-23 Thread onmstester onmstester
Another Question on node density, in this scenario:

1. we should keep time series data of some years for a heavy write system in 
Cassandra (> 10K Ops in seconds)

2. the system is insert only and inserted data would never be updated

3. in partition key, we used number of months since 1970, so data for every 
month would be on separate partitions

4. because of rule 2, after the end of month previous partitions would never be 
accessed for write requests

5. more than 90% of read requests would concern current month partitions, so we 
merely access Old data, we should just keep them for that 10% of reports!

6. The overall read in comparison to writes are so small (like 0.0001 % of 
overall time)



So, finally the question:

Even in this scenario would the active data be the whole data (this month + all 
previous months)? or the one which would be accessed for most reads and writes 
(only the past two months)? 

Could i use more than 3TB  per node for this scenario?



Sent using Zoho Mail






 On Tue, 20 Feb 2018 14:58:39 +0330 Rahul Singh 
<rahul.xavier.si...@gmail.com> wrote 




Node density is active data managed in the cluster divided by the number of 
active nodes. Eg. If you you have 500TB or active data under management then 
you would need 250-500 nodes to get beast like optimum performance. It also 
depends on how much memory is on the boxes and if you are using SSD drives. SSD 
doesn’t replace memory but it doesn’t hurt.



--

 Rahul Singh

 rahul.si...@anant.us

 

 Anant Corporation




On Feb 19, 2018, 5:55 PM -0500, Charulata Sharma (charshar) 
<chars...@cisco.com>, wrote: 





Thanks for the response Rahul. I did not understand the “node density” point.

 

Charu

 

From: Rahul Singh <rahul.xavier.si...@gmail.com>
 Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
 Date: Monday, February 19, 2018 at 12:32 PM
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>
 Subject: Re: Right sizing Cassandra data nodes

 


1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
and computing for opscenter data on a real business data cluster?

 2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it 
increases creates more replication, read repairs , etc and memory usage for 
doing the compactions etc.

 3. Can have as much as you want for snapshots as long as you have it on 
another disk or even move it to a SAN / NAS. All you may care about us the most 
recent snapshot on the physical machine / disks on a live node.





--

 Rahul Singh

 rahul.si...@anant.us

 

 Anant Corporation





On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
<chars...@cisco.com>, wrote:

 


Hi All,

 

Looking for some insight into how application data archive and purge is carried 
out for C* database. Are there standard guidelines on calculating the amount of 
space that can be used for storing data in a specific node.

 

Some pointers that I got while researching are;

 

-  Allocate 50% space for compaction, e.g. if data size is 50GB then 
allocate 25GB for compaction.

-  Snapshot strategy. If old snapshots are present, then they occupy 
the disk space.

-  Allocate some percentage of storage (  ) for system tables and 
OpsCenter tables ?

 

We have a scenario where certain transaction data needs to be archived based on 
business rules and some purged, so before deciding on an A&P strategy, I am 
trying to analyze

how much transactional data can be stored given the current node capacity. I 
also found out that the space available metric shown in Opscenter is not very 
reliable because it doesn’t show

the snapshot space. In our case, we have a huge snapshot size. For some 
unexplained reason, we seem to be taking snapshots of our data every hour and 
purging them only after 7 days.

 

 

Thanks,

Charu

Cisco Systems.

 

 

 










Re: Right sizing Cassandra data nodes

2018-02-20 Thread Rahul Singh
Node density is active data managed in the cluster divided by the number of 
active nodes. Eg. If you you have 500TB or active data under management then 
you would need 250-500 nodes to get beast like optimum performance. It also 
depends on how much memory is on the boxes and if you are using SSD drives. SSD 
doesn’t replace memory but it doesn’t hurt.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 5:55 PM -0500, Charulata Sharma (charshar) 
, wrote:
> Thanks for the response Rahul. I did not understand the “node density” point.
>
> Charu
>
> From: Rahul Singh 
> Reply-To: "user@cassandra.apache.org" 
> Date: Monday, February 19, 2018 at 12:32 PM
> To: "user@cassandra.apache.org" 
> Subject: Re: Right sizing Cassandra data nodes
>
> 1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
> and computing for opscenter data on a real business data cluster?
> 2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it 
> increases creates more replication, read repairs , etc and memory usage for 
> doing the compactions etc.
> 3. Can have as much as you want for snapshots as long as you have it on 
> another disk or even move it to a SAN / NAS. All you may care about us the 
> most recent snapshot on the physical machine / disks on a live node.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
> , wrote:
>
> > Hi All,
> >
> > Looking for some insight into how application data archive and purge is 
> > carried out for C* database. Are there standard guidelines on calculating 
> > the amount of space that can be used for storing data in a specific node.
> >
> > Some pointers that I got while researching are;
> >
> > -  Allocate 50% space for compaction, e.g. if data size is 50GB 
> > then allocate 25GB for compaction.
> > -  Snapshot strategy. If old snapshots are present, then they 
> > occupy the disk space.
> > -  Allocate some percentage of storage (  ) for system tables 
> > and OpsCenter tables ?
> >
> > We have a scenario where certain transaction data needs to be archived 
> > based on business rules and some purged, so before deciding on an A&P 
> > strategy, I am trying to analyze
> > how much transactional data can be stored given the current node capacity. 
> > I also found out that the space available metric shown in Opscenter is not 
> > very reliable because it doesn’t show
> > the snapshot space. In our case, we have a huge snapshot size. For some 
> > unexplained reason, we seem to be taking snapshots of our data every hour 
> > and purging them only after 7 days.
> >
> >
> > Thanks,
> > Charu
> > Cisco Systems.
> >
> >
> >


Re: Right sizing Cassandra data nodes

2018-02-19 Thread Charulata Sharma (charshar)
Thanks for the response Rahul. I did not understand the “node density” point.

Charu

From: Rahul Singh 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, February 19, 2018 at 12:32 PM
To: "user@cassandra.apache.org" 
Subject: Re: Right sizing Cassandra data nodes

1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
and computing for opscenter data on a real business data cluster?
2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it increases 
creates more replication, read repairs , etc and memory usage for doing the 
compactions etc.
3. Can have as much as you want for snapshots as long as you have it on another 
disk or even move it to a SAN / NAS. All you may care about us the most recent 
snapshot on the physical machine / disks on a live node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
, wrote:

Hi All,

Looking for some insight into how application data archive and purge is carried 
out for C* database. Are there standard guidelines on calculating the amount of 
space that can be used for storing data in a specific node.

Some pointers that I got while researching are;


-  Allocate 50% space for compaction, e.g. if data size is 50GB then 
allocate 25GB for compaction.

-  Snapshot strategy. If old snapshots are present, then they occupy 
the disk space.

-  Allocate some percentage of storage (  ) for system tables and 
OpsCenter tables ?

We have a scenario where certain transaction data needs to be archived based on 
business rules and some purged, so before deciding on an A&P strategy, I am 
trying to analyze
how much transactional data can be stored given the current node capacity. I 
also found out that the space available metric shown in Opscenter is not very 
reliable because it doesn’t show
the snapshot space. In our case, we have a huge snapshot size. For some 
unexplained reason, we seem to be taking snapshots of our data every hour and 
purging them only after 7 days.


Thanks,
Charu
Cisco Systems.





Re: Right sizing Cassandra data nodes

2018-02-19 Thread Rahul Singh
1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
and computing for opscenter data on a real business data cluster?
2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it increases 
creates more replication, read repairs , etc and memory usage for doing the 
compactions etc.
3. Can have as much as you want for snapshots as long as you have it on another 
disk or even move it to a SAN / NAS. All you may care about us the most recent 
snapshot on the physical machine / disks on a live node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
, wrote:
> Hi All,
>
> Looking for some insight into how application data archive and purge is 
> carried out for C* database. Are there standard guidelines on calculating the 
> amount of space that can be used for storing data in a specific node.
>
> Some pointers that I got while researching are;
>
> -  Allocate 50% space for compaction, e.g. if data size is 50GB then 
> allocate 25GB for compaction.
> -  Snapshot strategy. If old snapshots are present, then they occupy 
> the disk space.
> -  Allocate some percentage of storage (  ) for system tables and 
> OpsCenter tables ?
>
> We have a scenario where certain transaction data needs to be archived based 
> on business rules and some purged, so before deciding on an A&P strategy, I 
> am trying to analyze
> how much transactional data can be stored given the current node capacity. I 
> also found out that the space available metric shown in Opscenter is not very 
> reliable because it doesn’t show
> the snapshot space. In our case, we have a huge snapshot size. For some 
> unexplained reason, we seem to be taking snapshots of our data every hour and 
> purging them only after 7 days.
>
>
> Thanks,
> Charu
> Cisco Systems.
>
>
>