Re: data types storage saving

2018-03-09 Thread onmstester onmstester
I've find out that blobs has no gain in storage saving!

I had some 16 digit number which been saved as bigint previously but by saving 
this as blob, the storage usage per record is still the same


Sent using Zoho Mail






 On Tue, 06 Mar 2018 19:18:31 +0330 Carl Mueller 
carl.muel...@smartthings.com wrote 




If you're willing to do the data type conversion in insert and retrieval, the 
you could use blobs as a sort of "adaptive length int" AFAIK



On Tue, Mar 6, 2018 at 6:02 AM, onmstester onmstester 
onmstes...@zoho.com wrote:








I'm using int data type for one of my columns but for 99.99...% its data never 
would be  65K, Should i change it to smallint (It would save some Gigabytes 
disks in a few months) or Cassandra Compression would take care of it in 
storage? 

What about blob data type ? Isn't  better to use it in such cases? could i 
alter column type from smallInt to int in future if needed?



Sent using Zoho Mail












Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Jeff Jirsa
The version here really matters. If it’s higher than 3.2, it’s probably related 
to this issue which places sstables for a given range in the same directory to 
avoid data loss on single drive failure:

https://issues.apache.org/jira/browse/CASSANDRA-6696



-- 
Jeff Jirsa


> On Mar 9, 2018, at 9:38 PM, Madhu B  wrote:
> 
> Yes it will helps,thanks James for correcting me
> 
>> On Mar 9, 2018, at 9:52 PM, James Shaw  wrote:
>> 
>> per my testing, repair not help.
>> repair build Merkle tree to compare data, it only write to a new file while 
>> have difference, very very small file at the end  (of course, means most 
>> data are synced)
>> 
>>> On Fri, Mar 9, 2018 at 10:31 PM, Madhu B  wrote:
>>> Yasir,
>>> I think you need to run full repair in off-peak hours
>>> 
>>> Thanks,
>>> Madhu
>>> 
>>> 
 On Mar 9, 2018, at 7:20 AM, Kenneth Brotman  
 wrote:
 
 Yasir,
 
  
 
 How many nodes are in the cluster? 
 
 What is num_tokens set to in the Cassandra.yaml file? 
 
 Is it just this one node doing this? 
 
 What replication factor do you use that affects the ranges on that disk?
 
  
 
 Kenneth Brotman
 
  
 
 From: Kyrylo Lebediev [mailto:kyrylo_lebed...@epam.com] 
 Sent: Friday, March 09, 2018 4:14 AM
 To: user@cassandra.apache.org
 Subject: Re: uneven data movement in one of the disk in Cassandra
 
  
 
 Not sure where I heard this, but AFAIK data imbalance when multiple 
 data_directories are in use is a known issue for older versions of 
 Cassandra. This might be the root-cause of your issue.
 
 Which version of C* are you using?
 
 Unfortunately, don't remember in which version this imbalance issue was 
 fixed.
 
  
 
 -- Kyrill
 
 From: Yasir Saleem 
 Sent: Friday, March 9, 2018 1:34:08 PM
 To: user@cassandra.apache.org
 Subject: Re: uneven data movement in one of the disk in Cassandra
 
  
 
 Hi Alex,
 
  
 
 no active compaction, right now.
 
  
 
 
 
  
 
 On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin 
  wrote:
 
 On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem  
 wrote:
 
 Thanks, Nicolas Guyomar
 
  
 
 I am new to cassandra, here is the properties which I can see in yaml file:
 
  
 
 # of compaction, including validation compaction.
 
 compaction_throughput_mb_per_sec: 16
 
 compaction_large_partition_warning_threshold_mb: 100
 
  
 
 To check currently active compaction please use this command:
 
  
 
 nodetool compactionstats -H
 
  
 
 on the host which shows the problem.
 
  
 
 --
 
 Alex
 
  
 
  
 
>> 


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Madhu B
Yes it will helps,thanks James for correcting me

> On Mar 9, 2018, at 9:52 PM, James Shaw  wrote:
> 
> per my testing, repair not help.
> repair build Merkle tree to compare data, it only write to a new file while 
> have difference, very very small file at the end  (of course, means most data 
> are synced)
> 
>> On Fri, Mar 9, 2018 at 10:31 PM, Madhu B  wrote:
>> Yasir,
>> I think you need to run full repair in off-peak hours
>> 
>> Thanks,
>> Madhu
>> 
>> 
>>> On Mar 9, 2018, at 7:20 AM, Kenneth Brotman  
>>> wrote:
>>> 
>>> Yasir,
>>> 
>>>  
>>> 
>>> How many nodes are in the cluster? 
>>> 
>>> What is num_tokens set to in the Cassandra.yaml file? 
>>> 
>>> Is it just this one node doing this? 
>>> 
>>> What replication factor do you use that affects the ranges on that disk?
>>> 
>>>  
>>> 
>>> Kenneth Brotman
>>> 
>>>  
>>> 
>>> From: Kyrylo Lebediev [mailto:kyrylo_lebed...@epam.com] 
>>> Sent: Friday, March 09, 2018 4:14 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: uneven data movement in one of the disk in Cassandra
>>> 
>>>  
>>> 
>>> Not sure where I heard this, but AFAIK data imbalance when multiple 
>>> data_directories are in use is a known issue for older versions of 
>>> Cassandra. This might be the root-cause of your issue.
>>> 
>>> Which version of C* are you using?
>>> 
>>> Unfortunately, don't remember in which version this imbalance issue was 
>>> fixed.
>>> 
>>>  
>>> 
>>> -- Kyrill
>>> 
>>> From: Yasir Saleem 
>>> Sent: Friday, March 9, 2018 1:34:08 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: uneven data movement in one of the disk in Cassandra
>>> 
>>>  
>>> 
>>> Hi Alex,
>>> 
>>>  
>>> 
>>> no active compaction, right now.
>>> 
>>>  
>>> 
>>> 
>>> 
>>>  
>>> 
>>> On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin 
>>>  wrote:
>>> 
>>> On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem  
>>> wrote:
>>> 
>>> Thanks, Nicolas Guyomar
>>> 
>>>  
>>> 
>>> I am new to cassandra, here is the properties which I can see in yaml file:
>>> 
>>>  
>>> 
>>> # of compaction, including validation compaction.
>>> 
>>> compaction_throughput_mb_per_sec: 16
>>> 
>>> compaction_large_partition_warning_threshold_mb: 100
>>> 
>>>  
>>> 
>>> To check currently active compaction please use this command:
>>> 
>>>  
>>> 
>>> nodetool compactionstats -H
>>> 
>>>  
>>> 
>>> on the host which shows the problem.
>>> 
>>>  
>>> 
>>> --
>>> 
>>> Alex
>>> 
>>>  
>>> 
>>>  
>>> 
> 


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread James Shaw
per my testing, repair not help.
repair build Merkle tree to compare data, it only write to a new file while
have difference, very very small file at the end  (of course, means most
data are synced)

On Fri, Mar 9, 2018 at 10:31 PM, Madhu B  wrote:

> Yasir,
> I think you need to run full repair in off-peak hours
>
> Thanks,
> Madhu
>
>
> On Mar 9, 2018, at 7:20 AM, Kenneth Brotman 
> wrote:
>
> Yasir,
>
>
>
> How many nodes are in the cluster?
>
> What is num_tokens set to in the Cassandra.yaml file?
>
> Is it just this one node doing this?
>
> What replication factor do you use that affects the ranges on that disk?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kyrylo Lebediev [mailto:kyrylo_lebed...@epam.com
> ]
> *Sent:* Friday, March 09, 2018 4:14 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: uneven data movement in one of the disk in Cassandra
>
>
>
> Not sure where I heard this, but AFAIK data imbalance when multiple
> data_directories are in use is a known issue for older versions of
> Cassandra. This might be the root-cause of your issue.
>
> Which version of C* are you using?
>
> Unfortunately, don't remember in which version this imbalance issue was
> fixed.
>
>
>
> -- Kyrill
> --
>
> *From:* Yasir Saleem 
> *Sent:* Friday, March 9, 2018 1:34:08 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: uneven data movement in one of the disk in Cassandra
>
>
>
> Hi Alex,
>
>
>
> no active compaction, right now.
>
>
>
> 
>
>
>
> On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
> wrote:
>
> Thanks, Nicolas Guyomar
>
>
>
> I am new to cassandra, here is the properties which I can see in yaml
> file:
>
>
>
> # of compaction, including validation compaction.
>
> compaction_throughput_mb_per_sec: 16
>
> compaction_large_partition_warning_threshold_mb: 100
>
>
>
> To check currently active compaction please use this command:
>
>
>
> nodetool compactionstats -H
>
>
>
> on the host which shows the problem.
>
>
>
> --
>
> Alex
>
>
>
>
>
>


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread James Shaw
Ours have similar issue and I am working to solve it this weekend.
Our case is because STCS make one huge table's sstable file bigger and
bigger after compaction  (this is STCS compaction nature, nothing wrong),
even all most all data TTL 30days, but tombstones not evicted since largest
file is waiting for other 3 files for compaction.  The largest file 99.99%
are tombstones.

use command:  nodetool upgradesstables -a keyspace table
it will re-write all existed sstables and evit tombstones.

in you case, first do a few checking:
1. cd  /data/disk03/cassandra/data_prod/data
du -ks * | sort -n
find which tables use most space

2.  check the snapshot for above bigger tables
it's possible too old snapshots caused.

3.  cd table directory
sstablemetadata  sstablefile
to look the tables, whether a lot tombstones droppable

 4.
ls -lhS /data/disk */ cassandra/data_prod/data/"that
keyspace"/"that_table"*/*Data.db
look all sstables files,  you will see what's next compaction.

Per my watch, when small size compaction, seems randomly to which disks,
but when size large, it goes to disks which has more free space.

5.  if the biggest file too big, will wait long time for next compaction.
You may test ( sorry, not in my case, so I am not 100% sure)
1) if new cassandra 3.0,  you may try nodetool compact -s  ( it will split )
2) if old cassandra version,  stop cassandra,  use sstbalesplit


Hope it helps

Thanks,

James


On Fri, Mar 9, 2018 at 7:14 AM, Kyrylo Lebediev 
wrote:

> Not sure where I heard this, but AFAIK data imbalance when multiple
> data_directories are in use is a known issue for older versions of
> Cassandra. This might be the root-cause of your issue.
>
> Which version of C* are you using?
>
> Unfortunately, don't remember in which version this imbalance issue was
> fixed.
>
>
> -- Kyrill
> --
> *From:* Yasir Saleem 
> *Sent:* Friday, March 9, 2018 1:34:08 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: uneven data movement in one of the disk in Cassandra
>
> Hi Alex,
>
> no active compaction, right now.
>
>
>
>
> On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
> wrote:
>
> Thanks, Nicolas Guyomar
>
> I am new to cassandra, here is the properties which I can see in yaml
> file:
>
> # of compaction, including validation compaction.
> compaction_throughput_mb_per_sec: 16
> compaction_large_partition_warning_threshold_mb: 100
>
>
> To check currently active compaction please use this command:
>
> nodetool compactionstats -H
>
> on the host which shows the problem.
>
> --
> Alex
>
>
>


Re: Cassandra storage: Some thoughts

2018-03-09 Thread Oleksandr Shulgin
On 9 Mar 2018 16:56, "Vangelis Koukis"  wrote:

Hello all,

My name is Vangelis Koukis and I am a Founder and the CTO of Arrikto.

I'm writing to share our thoughts on how people run distributed,
stateful applications such as Cassandra on modern infrastructure,
and would love to get the community's feedback and comments.


Thanks, that sounds interesting.

At Arrikto we are building decentralized storage to tackle this problem
for cloud-native apps. Our software, Rok


Do I understand correctly that there is only white paper available, but not
any source code?

 In this case,
  Cassandra only has to recover the changed parts, which is just a
  small fraction of the node data, and does not cause CPU load on
  the whole cluster.


How if not running a repair? And if it's a repair why would it not put CPU
load on other nodes?

Cheers,
--
Alex


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Madhu B
Yasir,
I think you need to run full repair in off-peak hours

Thanks,
Madhu


> On Mar 9, 2018, at 7:20 AM, Kenneth Brotman  
> wrote:
> 
> Yasir,
>  
> How many nodes are in the cluster? 
> What is num_tokens set to in the Cassandra.yaml file? 
> Is it just this one node doing this? 
> What replication factor do you use that affects the ranges on that disk?
>  
> Kenneth Brotman
>  
> From: Kyrylo Lebediev [mailto:kyrylo_lebed...@epam.com] 
> Sent: Friday, March 09, 2018 4:14 AM
> To: user@cassandra.apache.org
> Subject: Re: uneven data movement in one of the disk in Cassandra
>  
> Not sure where I heard this, but AFAIK data imbalance when multiple 
> data_directories are in use is a known issue for older versions of Cassandra. 
> This might be the root-cause of your issue.
> Which version of C* are you using?
> Unfortunately, don't remember in which version this imbalance issue was fixed.
>  
> -- Kyrill
> From: Yasir Saleem 
> Sent: Friday, March 9, 2018 1:34:08 PM
> To: user@cassandra.apache.org
> Subject: Re: uneven data movement in one of the disk in Cassandra
>  
> Hi Alex,
>  
> no active compaction, right now.
>  
> 
> 
>  
> On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin 
>  wrote:
> On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem  
> wrote:
> Thanks, Nicolas Guyomar
>  
> I am new to cassandra, here is the properties which I can see in yaml file:
>  
> # of compaction, including validation compaction.
> compaction_throughput_mb_per_sec: 16
> compaction_large_partition_warning_threshold_mb: 100
>  
> To check currently active compaction please use this command:
>  
> nodetool compactionstats -H
>  
> on the host which shows the problem.
>  
> --
> Alex
>  
>  


Re: Cassandra storage: Some thoughts

2018-03-09 Thread Rahul Singh
Interesting. Can this be used in conjunction with bare metal? As in does it 
present containers in place if the “real” node until the node is up and running?


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 9, 2018, 10:56 AM -0500, Vangelis Koukis , wrote:
> Hello all,
>
> My name is Vangelis Koukis and I am a Founder and the CTO of Arrikto.
>
> I'm writing to share our thoughts on how people run distributed,
> stateful applications such as Cassandra on modern infrastructure,
> and would love to get the community's feedback and comments.
>
> The fundamental question is: Where does a Cassandra node find its data?
> Does it run over local storage, e.g., a super-fast NVMe device, or does
> it run over some sort of external, managed storage, e.g., EBS on AWS?
>
> Going in one of the two directions is a tradeoff between flexibility on
> one hand, and performance/cost on the other.
>
> * External storage, e.g., EBS:
>
> Easy backups as thin/instant EBS snapshots, and easy node recovery
> in the case of instance failure by re-attaching the EBS data volume
> to a newly-created instance. But then, I/O bandwidth, I/O latency,
> and cost suffer.
>
> * Local NVMe:
>
> Blazing fast, with very low latency, excellent bandwidth, a
> fraction of the cost, but then it is not obvious how one backs up
> their data, or recovers from node failure.
>
> At Arrikto we are building decentralized storage to tackle this problem
> for cloud-native apps. Our software, Rok, allows you to run stateful
> apps directly over fast, local NVMe storage on-prem or on the cloud, and
> still be able to snapshot the containers and distribute them
> efficiently: across machines of the same cluster, or across distinct
> locations and administrative domains over a decentralized network.
>
> Rok runs on the side of Cassandra, which accesses local storage
> directly. It only has to intervene during snapshot-based node recovery,
> which is transparent to the application. It does not invoke an
> application-wide data recovery and rebalancing operation, which would
> put load on the whole cluster and impact application responsiveness.
> Instead, it performs block-level recovery of this specific node from the
> Rok snapshot store, e.g., S3, with predictable performance.
>
> This solves four important issues we have seen people running Cassandra
> at scale face today:
>
> * Node recovery / node migration:
>
> If you lose an entire Cassandra node, then your database will
> continue operating normally, as Rok in combination with your
> Container Orchestrator (e.g., Kubernetes) will present another
> Cassandra node. This node will have the data of the latest
> snapshot that resides on the Rok snapshot store. In this case,
> Cassandra only has to recover the changed parts, which is just a
> small fraction of the node data, and does not cause CPU load on
> the whole cluster. Similarly, you can migrate a Cassandra node
> from one physical host to another, without depending on external,
> EBS-like storage.
>
> * Backup and recovery:
>
> You can use Rok to take a full backup of your whole application,
> along with the DB, as a group-consistent snapshot of its VMs or
> containers, and store it externally. This does not depend on app-
> or Cassandra-specific functionality.
>
> * Data mobility:
>
> You can synchronize these snapshots to different locations, e.g.,
> across regions or cloud providers, and across administrative
> domains, i.e., share them with others without giving them direct
> access to your Cassandra DB. You can then spawn your entire
> application stack in the new location.
>
> * Testing / analytics:
>
> Being able to spawn a copy of your Cassandra DB as a thin clone
> means you can have test & dev workflows running in parallel, on
> independent, mutable clones, with real data underneath. Similarly,
> your analytics team can run their lengthy reporting and analytics
> workloads on an independent clone of your transactional DB, on
> completely distinct hardware, or even on a different location.
>
> So far, initial validation of our solution with early adopters shows
> significant performance gains at a fraction of the cost of external
> storage, while enabling a multi-region setup.
>
> Here are some numbers and a whitepaper to support this:
> https://journal.arrikto.com/why-your-cassandra-needs-local-nvme-and-rok-1787b9fc286d
> http://arrikto.com/wp-content/uploads/2018/03/20180206-rok_decentralized_storage_for_the_cloud_native_world.pdf
>
> If the above sounds interesting, we are eager to hear from you, learn
> about your potential use cases, and include you in our beta test
> program.
>
> Thank you,
> Vangelis.
>
> --
> Vangelis Koukis
> CTO, Arrikto Inc.
> 3505 El Camino Real, Palo Alto, CA 94306
> www.arrikto.com


Consistency level for the COPY command

2018-03-09 Thread Jai Bheemsen Rao Dhanwada
Hello,

What is the consistency level used when performing COPY command using CQL
interface?

don't see anything in the documents

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/copy_r.html

I am setting CONSISTENCY LEVEL at the cql level and then running a copy
command, does that honor the consistency level?

thanks in advance.


TWCS enabling tombstone compaction

2018-03-09 Thread Lucas Benevides
Dear community,

I have been using TWCS in my lab, with TTL'd data.
In the debug log there is always the sentence:
"TimeWindowCompactionStrategy.java:65 Disabling tombstone compactions for
TWCS". Indeed, the line is always repeated.

What does it actually mean? If my data gets expired, the TWCS is already
working and purging the SSTables that become expired. It surely sound
strange to me to disable tombstone compaction.

In the subcompaction subproperties there are only two subproperties,
compaction_window_unit and compaction_window_size. Jeff already told us
that the STCS properties also apply to TWCS, although it is not in the
documentation.

Thanks in advance,
Lucas Benevides Dias


Re: Adding disk to operating C*

2018-03-09 Thread Jon Haddad
I agree with Jeff - I usually advise teams to cap their density around 3TB, 
especially with TWCS.  Read heavy workloads tend to use smaller datasets and 
ring size ends up being a function of performance tuning.

Since 2.2 bootstrap can now be resumed, which helps quite a bit with the 
streaming problem, see CASSANDRA-8838.

Jon


> On Mar 9, 2018, at 7:39 AM, Jeff Jirsa  wrote:
> 
> 1.5 TB sounds very very conservative - 3-4T is where I set the limit at past 
> jobs. Have heard of people doing twice that (6-8T). 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Mar 8, 2018, at 11:09 PM, Niclas Hedhman  > wrote:
> 
>> I am curious about the side comment; "Depending on your usecase you may not
>> want to have a data density over 1.5 TB per node."
>> 
>> Why is that? I am planning much bigger than that, and now you give me
>> pause...
>> 
>> 
>> Cheers
>> Niclas
>> 
>> On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh > > wrote:
>> Are you putting both the commitlogs and the Sstables on the adds? Consider 
>> moving your snapshots often if that’s also taking up space. Maybe able to 
>> save some space before you add drives.
>> 
>> You should be able to add these new drives and mount them without an issue. 
>> Try to avoid different number of data dirs across nodes. It makes automation 
>> of operational processes a little harder.
>> 
>> As an aside, Depending on your usecase you may not want to have a data 
>> density over 1.5 TB per node.
>> 
>> --
>> Rahul Singh
>> rahul.si...@anant.us 
>> 
>> Anant Corporation
>> 
>> On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim > >, wrote:
>>> Hello,
>>> 
>>> I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)
>>> 
>>> I'm trying to mount an additional disk(SSD 1TB) on each node because each 
>>> disk usage growth rate is higher than I expected. Then I will add the the 
>>> directory to data_file_directories in cassanra.yaml
>>> 
>>> Can I get advice from who have experienced this situation?
>>> If we go through the above steps one by one, will we be able to complete 
>>> the upgrade without losing data?
>>> The replication strategy is SimpleStrategy, RF 2.
>>> 
>>> Thank you in advance
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>>> 
>>> For additional commands, e-mail: user-h...@cassandra.apache.org 
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Niclas Hedhman, Software Developer
>> http://zest.apache.org  - New Energy for Java



Cassandra storage: Some thoughts

2018-03-09 Thread Vangelis Koukis
Hello all,

My name is Vangelis Koukis and I am a Founder and the CTO of Arrikto.

I'm writing to share our thoughts on how people run distributed,
stateful applications such as Cassandra on modern infrastructure,
and would love to get the community's feedback and comments.

The fundamental question is: Where does a Cassandra node find its data?
Does it run over local storage, e.g., a super-fast NVMe device, or does
it run over some sort of external, managed storage, e.g., EBS on AWS?

Going in one of the two directions is a tradeoff between flexibility on
one hand, and performance/cost on the other.

   * External storage, e.g., EBS:

 Easy backups as thin/instant EBS snapshots, and easy node recovery
 in the case of instance failure by re-attaching the EBS data volume
 to a newly-created instance. But then, I/O bandwidth, I/O latency,
 and cost suffer.

   * Local NVMe:

 Blazing fast, with very low latency, excellent bandwidth, a
 fraction of the cost, but then it is not obvious how one backs up
 their data, or recovers from node failure.

At Arrikto we are building decentralized storage to tackle this problem
for cloud-native apps. Our software, Rok, allows you to run stateful
apps directly over fast, local NVMe storage on-prem or on the cloud, and
still be able to snapshot the containers and distribute them
efficiently: across machines of the same cluster, or across distinct
locations and administrative domains over a decentralized network.

Rok runs on the side of Cassandra, which accesses local storage
directly. It only has to intervene during snapshot-based node recovery,
which is transparent to the application. It does not invoke an
application-wide data recovery and rebalancing operation, which would
put load on the whole cluster and impact application responsiveness.
Instead, it performs block-level recovery of this specific node from the
Rok snapshot store, e.g., S3, with predictable performance.

This solves four important issues we have seen people running Cassandra
at scale face today:

* Node recovery / node migration:

  If you lose an entire Cassandra node, then your database will
  continue operating normally, as Rok in combination with your
  Container Orchestrator (e.g., Kubernetes) will present another
  Cassandra node. This node will have the data of the latest
  snapshot that resides on the Rok snapshot store. In this case,
  Cassandra only has to recover the changed parts, which is just a
  small fraction of the node data, and does not cause CPU load on
  the whole cluster. Similarly, you can migrate a Cassandra node
  from one physical host to another, without depending on external,
  EBS-like storage.

* Backup and recovery:

  You can use Rok to take a full backup of your whole application,
  along with the DB, as a group-consistent snapshot of its VMs or
  containers, and store it externally. This does not depend on app-
  or Cassandra-specific functionality.

* Data mobility:

  You can synchronize these snapshots to different locations, e.g.,
  across regions or cloud providers, and across administrative
  domains, i.e., share them with others without giving them direct
  access to your Cassandra DB. You can then spawn your entire
  application stack in the new location.

* Testing / analytics:

  Being able to spawn a copy of your Cassandra DB as a thin clone
  means you can have test & dev workflows running in parallel, on
  independent, mutable clones, with real data underneath. Similarly,
  your analytics team can run their lengthy reporting and analytics
  workloads on an independent clone of your transactional DB, on
  completely distinct hardware, or even on a different location.

So far, initial validation of our solution with early adopters shows
significant performance gains at a fraction of the cost of external
storage, while enabling a multi-region setup.

Here are some numbers and a whitepaper to support this:
https://journal.arrikto.com/why-your-cassandra-needs-local-nvme-and-rok-1787b9fc286d
http://arrikto.com/wp-content/uploads/2018/03/20180206-rok_decentralized_storage_for_the_cloud_native_world.pdf

If the above sounds interesting, we are eager to hear from you, learn
about your potential use cases, and include you in our beta test
program.

Thank you,
Vangelis.

-- 
Vangelis Koukis
CTO, Arrikto Inc.
3505 El Camino Real, Palo Alto, CA 94306
www.arrikto.com


signature.asc
Description: Digital signature


Re: Adding disk to operating C*

2018-03-09 Thread Jeff Jirsa
1.5 TB sounds very very conservative - 3-4T is where I set the limit at past 
jobs. Have heard of people doing twice that (6-8T). 

-- 
Jeff Jirsa


> On Mar 8, 2018, at 11:09 PM, Niclas Hedhman  wrote:
> 
> I am curious about the side comment; "Depending on your usecase you may not
> want to have a data density over 1.5 TB per node."
> 
> Why is that? I am planning much bigger than that, and now you give me
> pause...
> 
> 
> Cheers
> Niclas
> 
>> On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh  
>> wrote:
>> Are you putting both the commitlogs and the Sstables on the adds? Consider 
>> moving your snapshots often if that’s also taking up space. Maybe able to 
>> save some space before you add drives.
>> 
>> You should be able to add these new drives and mount them without an issue. 
>> Try to avoid different number of data dirs across nodes. It makes automation 
>> of operational processes a little harder.
>> 
>> As an aside, Depending on your usecase you may not want to have a data 
>> density over 1.5 TB per node.
>> 
>> --
>> Rahul Singh
>> rahul.si...@anant.us
>> 
>> Anant Corporation
>> 
>>> On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim , wrote:
>>> Hello,
>>> 
>>> I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)
>>> 
>>> I'm trying to mount an additional disk(SSD 1TB) on each node because each 
>>> disk usage growth rate is higher than I expected. Then I will add the the 
>>> directory to data_file_directories in cassanra.yaml
>>> 
>>> Can I get advice from who have experienced this situation?
>>> If we go through the above steps one by one, will we be able to complete 
>>> the upgrade without losing data?
>>> The replication strategy is SimpleStrategy, RF 2.
>>> 
>>> Thank you in advance
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
> 
> 
> 
> -- 
> Niclas Hedhman, Software Developer
> http://zest.apache.org - New Energy for Java


Re: Adding disk to operating C*

2018-03-09 Thread Rahul Singh

Yep. Most of my arguments are the same from seeing it in production. Cass=
andra is used for fast writes and generally fast reads with redundancy an=
d failover for OLTP and OLAP. It=E2=80=99s not just a bunch of dumb disks=
. You can throw crap into S3 or HD=46S and analyze / report with Hive or =
Spark.

You can always have more data density for use adds that are not critical =
in a different cluster or DC.

Rahul

On Mar 9, 2018, 7:25 AM -0500, Kyrylo Lebediev , 
wrote:
> Niclas,
> Here is Jeff's comment regarding this: https://stackoverflow.com/a/31690279
> From: Niclas Hedhman 
> Sent: Friday, March 9, 2018 9:09:53 AM
> To: user@cassandra.apache.org; Rahul Singh
> Subject: Re: Adding disk to operating C*
>
> I am curious about the side comment; "Depending on your usecase you may not
> want to have a data density over 1.5 TB per node."
>
> Why is that? I am planning much bigger than that, and now you give me
> pause...
>
>
> Cheers
> Niclas
>
> On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh  
> wrote:
> > Are you putting both the commitlogs and the Sstables on the adds? Consider 
> > moving your snapshots often if that’s also taking up space. Maybe able to 
> > save some space before you add drives.
> >
> > You should be able to add these new drives and mount them without an issue. 
> > Try to avoid different number of data dirs across nodes. It makes 
> > automation of operational processes a little harder.
> >
> > As an aside, Depending on your usecase you may not want to have a data 
> > density over 1.5 TB per node.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim , wrote:
> > > Hello,
> > >
> > > I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)
> > >
> > > I'm trying to mount an additional disk(SSD 1TB) on each node because each 
> > > disk usage growth rate is higher than I expected. Then I will add the the 
> > > directory to data_file_directories in cassanra.yaml
> > >
> > > Can I get advice from who have experienced this situation?
> > > If we go through the above steps one by one, will we be able to complete 
> > > the upgrade without losing data?
> > > The replication strategy is SimpleStrategy, RF 2.
> > >
> > > Thank you in advance
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > >
>
>
>
> --
> Niclas Hedhman, Software Developer
> http://zest.apache.org - New Energy for Java


RE: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Kenneth Brotman
Yasir,

 

How many nodes are in the cluster?  

What is num_tokens set to in the Cassandra.yaml file?  

Is it just this one node doing this?  

What replication factor do you use that affects the ranges on that disk?

 

Kenneth Brotman

 

From: Kyrylo Lebediev [mailto:kyrylo_lebed...@epam.com] 
Sent: Friday, March 09, 2018 4:14 AM
To: user@cassandra.apache.org
Subject: Re: uneven data movement in one of the disk in Cassandra

 

Not sure where I heard this, but AFAIK data imbalance when multiple
data_directories are in use is a known issue for older versions of
Cassandra. This might be the root-cause of your issue. 

Which version of C* are you using?

Unfortunately, don't remember in which version this imbalance issue was
fixed.

 

-- Kyrill

  _  

From: Yasir Saleem 
Sent: Friday, March 9, 2018 1:34:08 PM
To: user@cassandra.apache.org
Subject: Re: uneven data movement in one of the disk in Cassandra 

 

Hi Alex, 

 

no active compaction, right now.

 



 

On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin
 wrote:

On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
wrote:

Thanks, Nicolas Guyomar 

 

I am new to cassandra, here is the properties which I can see in yaml file: 

 

# of compaction, including validation compaction.

compaction_throughput_mb_per_sec: 16

compaction_large_partition_warning_threshold_mb: 100

 

To check currently active compaction please use this command:

 

nodetool compactionstats -H

 

on the host which shows the problem.

 

--

Alex

 

 



Re: Adding disk to operating C*

2018-03-09 Thread Kyrylo Lebediev
Niclas,

Here is Jeff's comment regarding this: https://stackoverflow.com/a/31690279


From: Niclas Hedhman 
Sent: Friday, March 9, 2018 9:09:53 AM
To: user@cassandra.apache.org; Rahul Singh
Subject: Re: Adding disk to operating C*

I am curious about the side comment; "Depending on your usecase you may not
want to have a data density over 1.5 TB per node."

Why is that? I am planning much bigger than that, and now you give me
pause...


Cheers
Niclas

On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh 
> wrote:
Are you putting both the commitlogs and the Sstables on the adds? Consider 
moving your snapshots often if that’s also taking up space. Maybe able to save 
some space before you add drives.

You should be able to add these new drives and mount them without an issue. Try 
to avoid different number of data dirs across nodes. It makes automation of 
operational processes a little harder.

As an aside, Depending on your usecase you may not want to have a data density 
over 1.5 TB per node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim 
>, wrote:
Hello,

I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)

I'm trying to mount an additional disk(SSD 1TB) on each node because each disk 
usage growth rate is higher than I expected. Then I will add the the directory 
to data_file_directories in cassanra.yaml

Can I get advice from who have experienced this situation?
If we go through the above steps one by one, will we be able to complete the 
upgrade without losing data?
The replication strategy is SimpleStrategy, RF 2.

Thank you in advance
-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org




--
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Kyrylo Lebediev
Not sure where I heard this, but AFAIK data imbalance when multiple 
data_directories are in use is a known issue for older versions of Cassandra. 
This might be the root-cause of your issue.

Which version of C* are you using?

Unfortunately, don't remember in which version this imbalance issue was fixed.


-- Kyrill


From: Yasir Saleem 
Sent: Friday, March 9, 2018 1:34:08 PM
To: user@cassandra.apache.org
Subject: Re: uneven data movement in one of the disk in Cassandra

Hi Alex,

no active compaction, right now.

[cid:ii_jejv51ck1_1620a89ebd6c7e92]


On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin 
> wrote:
On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
> wrote:
Thanks, Nicolas Guyomar

I am new to cassandra, here is the properties which I can see in yaml file:

# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16
compaction_large_partition_warning_threshold_mb: 100

To check currently active compaction please use this command:

nodetool compactionstats -H

on the host which shows the problem.

--
Alex




Re: Amazon Time Sync Service + ntpd vs chrony

2018-03-09 Thread Kyrylo Lebediev
Thank you to all who replied so far,  thank you Ben for the links you provided!


From: Ben Slater 
Sent: Friday, March 9, 2018 12:12:09 AM
To: user@cassandra.apache.org
Subject: Re: Amazon Time Sync Service + ntpd vs chrony

It is important to make sure you are using the same NTP servers across your 
cluster - we used to see relatively frequent NTP issues across our fleet using 
default/public NTP servers until (back in 2015) we implemented our own NTP pool 
(see https://www.instaclustr.com/apache-cassandra-synchronization/ which 
references some really good and detailed posts from 
logentries.com on the potential issues).

Cheers
Ben

On Fri, 9 Mar 2018 at 02:07 Michael Shuler 
> wrote:
As long as your nodes are syncing time using the same method, that
should be good. Don't mix daemons, however, since they may sync from
different sources. Whether you use ntpd, openntp, ntpsec, chrony isn't
really important, since they are all just background daemons to sync the
system clock. There is nothing Cassandra-specific.

--
Kind regards,
Michael

On 03/08/2018 04:15 AM, Kyrylo Lebediev wrote:
> Hi!
>
> Recently Amazon announced launch of Amazon Time Sync Service
> (https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-service/)
> and now it's AWS-recommended way for time sync on EC2 instances
> (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).
> It's stated there that chrony is faster / more precise than ntpd.
>
> Nothing to say correct time sync configuration is very important for any
> C* setup.
>
> Does anybody have positive experience using crony, Amazon Time Sync
> Service with Cassandra and/or combination of them?
> Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
> Are there any chrony best-practices/custom settings for C* setups?
>
> Thanks,
> Kyrill
>


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org

--

Ben Slater
Chief Product Officer
[https://cdn2.hubspot.net/hubfs/2549680/Instaclustr-Navy-logo-new.png]

[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/facebook_sig.png]
  
[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/twitter_sig.png] 
   
[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/linkedin_sig.png]
 

Read our latest technical blog posts here.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Yasir Saleem
Hi Alex,

no active compaction, right now.




On Fri, Mar 9, 2018 at 3:47 PM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
> wrote:
>
>> Thanks, Nicolas Guyomar
>>
>> I am new to cassandra, here is the properties which I can see in yaml
>> file:
>>
>> # of compaction, including validation compaction.
>> compaction_throughput_mb_per_sec: 16
>> compaction_large_partition_warning_threshold_mb: 100
>>
>
> To check currently active compaction please use this command:
>
> nodetool compactionstats -H
>
> on the host which shows the problem.
>
> --
> Alex
>
>


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Oleksandr Shulgin
On Fri, Mar 9, 2018 at 11:40 AM, Yasir Saleem 
wrote:

> Thanks, Nicolas Guyomar
>
> I am new to cassandra, here is the properties which I can see in yaml file:
>
> # of compaction, including validation compaction.
> compaction_throughput_mb_per_sec: 16
> compaction_large_partition_warning_threshold_mb: 100
>

To check currently active compaction please use this command:

nodetool compactionstats -H

on the host which shows the problem.

--
Alex


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Yasir Saleem
Thanks, Nicolas Guyomar

I am new to cassandra, here is the properties which I can see in yaml file:

# of compaction, including validation compaction.
compaction_throughput_mb_per_sec: 16
compaction_large_partition_warning_threshold_mb: 100



On Fri, Mar 9, 2018 at 3:33 PM, Nicolas Guyomar 
wrote:

> Hi,
>
> This might be a compaction which is running, have you check that ?
>
> On 9 March 2018 at 11:29, Yasir Saleem  wrote:
>
>> Hi Team,
>>
>>   we are facing issue of uneven data movement in cassandra disk for
>> specific which disk03 in our case, however all the disk are consuming
>> around 60% of space but disk03 is taking 87% space. Here is configuration
>> in yaml and current disk space:
>>
>> data_file_directories:
>> - /data/disk01/cassandra/data_prod/data
>> - /data/disk02/cassandra/data_prod/data
>> - /data/disk03/cassandra/data_prod/data
>> - /data/disk04/cassandra/data_prod/data
>> - /data/disk05/cassandra/data_prod/data
>>
>> disk space:
>>
>> 734G  417G  280G  60% /data/disk02
>> 734G  342G  355G  50% /data/disk05
>> 734G  383G  314G  55% /data/disk04
>> *734G  599G   98G  87% /data/disk03*
>> 734G  499G  198G  60% /data/disk01
>>
>> Please note that we have tried to delete data several times but still
>> space is continuously increasing in disk03. Please let me know if there is
>> any workaround to resolve this issue.
>>
>> Regards,
>>
>> Yasir.
>>
>>
>


Re: uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Nicolas Guyomar
Hi,

This might be a compaction which is running, have you check that ?

On 9 March 2018 at 11:29, Yasir Saleem  wrote:

> Hi Team,
>
>   we are facing issue of uneven data movement in cassandra disk for
> specific which disk03 in our case, however all the disk are consuming
> around 60% of space but disk03 is taking 87% space. Here is configuration
> in yaml and current disk space:
>
> data_file_directories:
> - /data/disk01/cassandra/data_prod/data
> - /data/disk02/cassandra/data_prod/data
> - /data/disk03/cassandra/data_prod/data
> - /data/disk04/cassandra/data_prod/data
> - /data/disk05/cassandra/data_prod/data
>
> disk space:
>
> 734G  417G  280G  60% /data/disk02
> 734G  342G  355G  50% /data/disk05
> 734G  383G  314G  55% /data/disk04
> *734G  599G   98G  87% /data/disk03*
> 734G  499G  198G  60% /data/disk01
>
> Please note that we have tried to delete data several times but still
> space is continuously increasing in disk03. Please let me know if there is
> any workaround to resolve this issue.
>
> Regards,
>
> Yasir.
>
>


uneven data movement in one of the disk in Cassandra

2018-03-09 Thread Yasir Saleem
Hi Team,

  we are facing issue of uneven data movement in cassandra disk for
specific which disk03 in our case, however all the disk are consuming
around 60% of space but disk03 is taking 87% space. Here is configuration
in yaml and current disk space:

data_file_directories:
- /data/disk01/cassandra/data_prod/data
- /data/disk02/cassandra/data_prod/data
- /data/disk03/cassandra/data_prod/data
- /data/disk04/cassandra/data_prod/data
- /data/disk05/cassandra/data_prod/data

disk space:

734G  417G  280G  60% /data/disk02
734G  342G  355G  50% /data/disk05
734G  383G  314G  55% /data/disk04
*734G  599G   98G  87% /data/disk03*
734G  499G  198G  60% /data/disk01

Please note that we have tried to delete data several times but still space
is continuously increasing in disk03. Please let me know if there is any
workaround to resolve this issue.

Regards,

Yasir.


Re: Removing initial_token parameter

2018-03-09 Thread kurt greaves
correct, tokens will be stored in the nodes system tables after the first
boot, so feel free to remove them (although it's not really necessary)

On 9 Mar. 2018 20:16, "Mikhail Tsaplin"  wrote:

> Is it safe to remove initial_token parameter on a cluster created by
> snapshot restore procedure presented here https://docs.datastax.com
> /en/cassandra/latest/cassandra/operations/opsSnapshotRestore
> NewCluster.html  ?
>
> For me, it seems that initial_token parameter is used only when nodes are
> started the first time and later during next reboot Cassandra obtains
> tokens from internal structures and initital_token parameter absence would
> not affect it.
>
>


Removing initial_token parameter

2018-03-09 Thread Mikhail Tsaplin
Is it safe to remove initial_token parameter on a cluster created by
snapshot restore procedure presented here https://docs.datastax.com
/en/cassandra/latest/cassandra/operations/opsSnapshotRestoreNewCluster.html
?

For me, it seems that initial_token parameter is used only when nodes are
started the first time and later during next reboot Cassandra obtains
tokens from internal structures and initital_token parameter absence would
not affect it.


Re: Joining a cluster of nodes having multi valued initial_token parameters.

2018-03-09 Thread Mikhail Tsaplin
I suspect that cluster was created by recovering from a snapshot.

PS.
I asked a related question on this mailing list. Please check
subject: Removing initial_token parameter.

2018-03-08 20:02 GMT+07:00 Oleksandr Shulgin :

> On Thu, Mar 8, 2018 at 1:41 PM, Mikhail Tsaplin 
> wrote:
>
>> Thank you for the answer, are you sure that it at least safe?
>>
>
> I would test in a lab first of course, but I don't see why it should be a
> problem.  I wonder more why did you have tokens listed explicitly on the
> existing nodes if they are randomly generated?
>
>
>> As I understand I will have to specify auto_bootstrap=true too?
>>
>
> Sure.  Set it to true or remove from configuration file altogether.
>
> --
> Alex
>
>