Re: Problem with restoring a snapshot using sstableloader

2018-12-02 Thread Alex Ott
It's a bug in the sstableloader introduced many years ago - before that, it
worked as described in documentation...

Oliver Herrmann  at "Fri, 30 Nov 2018 17:05:43 +0100" wrote:
 OH> Hi,

 OH> I'm having some problems to restore a snapshot using sstableloader. I'm 
using cassandra 3.11.1 and followed the instructions for
 OH> a creating and restoring from this page:
 OH> 
https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/tools/toolsSStables/toolsBulkloader.html
 

 OH> 1. Called nodetool cleanup on each node
 OH> $ nodetool cleanup cass_testapp

 OH> 2. Called nodetool snapshot on each node
 OH> $ nodetool snapshot -t snap1 -kt cass_testapp.table3 

 OH> 3. Checked the data and snapshot folders:
 OH> $ ll 
/var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb
 OH> drwxr-xr-x 2 cassandra cassandra    6 Nov 29 03:54 backups
 OH> -rw-r--r-- 2 cassandra cassandra   43 Nov 30 10:21 
mc-11-big-CompressionInfo.db
 OH> -rw-r--r-- 2 cassandra cassandra  241 Nov 30 10:21 mc-11-big-Data.db
 OH> -rw-r--r-- 2 cassandra cassandra    9 Nov 30 10:21 mc-11-big-Digest.crc32
 OH> -rw-r--r-- 2 cassandra cassandra   16 Nov 30 10:21 mc-11-big-Filter.db
 OH> -rw-r--r-- 2 cassandra cassandra   21 Nov 30 10:21 mc-11-big-Index.db
 OH> -rw-r--r-- 2 cassandra cassandra 4938 Nov 30 10:21 mc-11-big-Statistics.db
 OH> -rw-r--r-- 2 cassandra cassandra   95 Nov 30 10:21 mc-11-big-Summary.db
 OH> -rw-r--r-- 2 cassandra cassandra   92 Nov 30 10:21 mc-11-big-TOC.txt
 OH> drwxr-xr-x 3 cassandra cassandra   18 Nov 30 10:30 snapshots

 OH> and 

 OH> $ ll 
/var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb/snapshots/snap1/
 OH> total 44
 OH> -rw-r--r-- 1 cassandra cassandra   32 Nov 30 10:30 manifest.json
 OH> -rw-r--r-- 2 cassandra cassandra   43 Nov 30 10:21 
mc-11-big-CompressionInfo.db
 OH> -rw-r--r-- 2 cassandra cassandra  241 Nov 30 10:21 mc-11-big-Data.db
 OH> -rw-r--r-- 2 cassandra cassandra    9 Nov 30 10:21 mc-11-big-Digest.crc32
 OH> -rw-r--r-- 2 cassandra cassandra   16 Nov 30 10:21 mc-11-big-Filter.db
 OH> -rw-r--r-- 2 cassandra cassandra   21 Nov 30 10:21 mc-11-big-Index.db
 OH> -rw-r--r-- 2 cassandra cassandra 4938 Nov 30 10:21 mc-11-big-Statistics.db
 OH> -rw-r--r-- 2 cassandra cassandra   95 Nov 30 10:21 mc-11-big-Summary.db
 OH> -rw-r--r-- 2 cassandra cassandra   92 Nov 30 10:21 mc-11-big-TOC.txt
 OH> -rw-r--r-- 1 cassandra cassandra 1043 Nov 30 10:30 schema.cql

 OH> 4. Truncated the table
 OH> cqlsh:cass_testapp> TRUNCATE table3 ;

 OH> 5. Tried to restore table3 on one cassandra node
 OH> $ sstableloader -d localhost 
/var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb/snapshots/snap1/
 OH> Established connection to initial hosts
 OH> Opening sstables and calculating sections to stream
 OH> Skipping file mc-11-big-Data.db: table snapshots.table3 doesn't exist

 OH> Summary statistics: 
 OH>    Connections per host    : 1         
 OH>    Total files transferred : 0         
 OH>    Total bytes transferred : 0.000KiB  
 OH>    Total duration          : 2652 ms   
 OH>    Average transfer rate   : 0.000KiB/s
 OH>    Peak transfer rate      : 0.000KiB/s

 OH> I'm always getting the message "Skipping file mc-11-big-Data.db: table 
snapshots.table3 doesn't exist". I also tried to rename
 OH> the snapshots folder into the keyspace name (cass_testapp) but then I get 
the message "Skipping file mc-11-big-Data.db: table
 OH> snap1.snap1. doesn't exist".

 OH> What I'm doing wrong?

 OH> Thanks
 OH> Oliver



-- 
With best wishes,Alex Ott
Solutions Architect EMEA, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Understanding output of read/write histogram using opscenter API

2019-06-09 Thread Alex Ott
You can also ask at #opscenter channel at DataStax Academy Slack: 
http://academy.datastax.com/slack

Bhardwaj, Rahul  at "Tue, 4 Jun 2019 11:24:44 +" wrote:
 BR> Hi All,

 BR> Do we have any document to understand output of read/write histogram using 
opscenter API. We need them to ingest it to create
 BR> one of our dashboards. We are facing difficulty in understanding its 
output if we relate it with 5 values like
 BR> max,min,median,90th percentile,etc. Attaching one sample output. we could 
not find the way to get different percentile’s data
 BR> using API for read-histogram and write-histogram. Kindly help or provide 
some doc link related to its explanation.

 BR> Thanks and Regards,

 BR> Rahul Bhardwaj



 BR> -
 BR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 BR> For additional commands, e-mail: user-h...@cassandra.apache.org



-- 
With best wishes,Alex Ott
Solutions Architect EMEA, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Performance impact with ALLOW FILTERING clause.

2019-08-17 Thread Alex Ott
Spark connector doesn't do the "select * from table;" - it does reads by
token ranges, reading the data
(see 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/CassandraPartition.scala#L14)
 


Jacques-Henri Berthemet  at "Thu, 25 Jul 2019 14:18:57 +" wrote:
 JB> Hi Asad,

 JB> That’s because of the way Spark works. Essentially, when you execute a 
Spark job, it pulls the full content of the datastore (Cassandra
 JB> in your case) in it RDDs and works with it “in memory”. While Spark uses 
“data locality” to read data from the nodes that have the
 JB> required data on its local disks, it’s still reading all data from 
Cassandra tables. To do so it’s sending ‘select * from Table ALLOW
 JB> FILTERING’ query to Cassandra.

 JB> From Spark you don’t have much control on the initial query to fill the 
RDDs, sometimes you’ll read the whole table even if you only
 JB> need one row.

 JB> Regards,

 JB> Jacques-Henri Berthemet

 JB> From: "ZAIDI, ASAD A" 
 JB> Reply to: "user@cassandra.apache.org" 
 JB> Date: Thursday 25 July 2019 at 15:49
 JB> To: "user@cassandra.apache.org" 
 JB> Subject: Performance impact with ALLOW FILTERING clause.

 JB> Hello Folks,

 JB> I was going thru documentation and saw at many places saying ALLOW 
FILTERING causes performance unpredictability.  Our developers says
 JB> ALLOW FILTERING clause is implicitly added on bunch of queries by 
spark-Cassandra  connector and they cannot control it; however at the
 JB> same time we see unpredictability in application performance – just as 
documentation says.  

 JB> I’m trying to understand why would a connector add a clause in query when 
this can cause negative impact on database/application
 JB> performance. Is that data model that is driving connector make its 
decision and add allow filtering to query automatically or if there
 JB> are other reason this clause is added to the code. I’m not a developer 
though I want to know why developer don’t have any control on
 JB> this to happen.

 JB> I’ll appreciate your guidance here.

 JB> Thanks

 JB> Asad



-- 
With best wishes,Alex Ott
Solutions Architect EMEA, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Keyspace Clone in Existing Cluster

2019-10-29 Thread Alex Ott
You can create all tables in new keyspace, copy SSTables from 1.0 to 2.0
tables & use nodetool refresh on tables in KS 2.0 to say Cassandra about
them.

On Tue, Oct 29, 2019 at 4:10 PM Ankit Gadhiya 
wrote:

> Hello Folks,
>
> Greetings!.
>
> I've a requirement in my project to setup Blue-Green deployment for
> Cassandra. E.x. Say My current active schema (application pointing to) is
> Keyspace V1.0 and for my next release I want to setup Keysapce 2.0 (with
> some structural changes) and all testing/validation would happen on it and
> once successful , App would switch connection to keyspace 2.0 - This would
> be generic release deployment for our project.
>
> One of the approach we thought of would be to Create keyspace 2.0 as clone
> from Keyspace 1.0 including data using sstableloader but this would be time
> consuming, also being a multi-node cluster (6+6 in each DC) - it wouldn't
> be very feasible to do this manually on all the nodes for multiple tables
> part of that keyspace. Was wondering if we have any other creative way to
> suffice this requirement.
>
> Appreciate your time on this.
>
>
> *Thanks & Regards,*
> *Ankit Gadhiya*
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: COPY command with where condition

2020-01-20 Thread Alex Ott
 don't need to export...
>>
>>
>>
>> --
>> *De :* adrien ruffie 
>> *Envoyé :* vendredi 17 janvier 2020 11:39
>> *À :* Erick Ramirez ; user@cassandra.apache.org <
>> user@cassandra.apache.org>
>> *Objet :* RE: COPY command with where condition
>>
>> Thank a lot !
>> It's a good news for DSBulk ! I will take a look around this solution.
>>
>> best regards,
>> Adrian
>> --
>> *De :* Erick Ramirez 
>> *Envoyé :* vendredi 17 janvier 2020 10:02
>> *À :* user@cassandra.apache.org 
>> *Objet :* Re: COPY command with where condition
>>
>> The COPY command doesn't support filtering and it doesn't perform well
>> for large tables.
>>
>> Have you considered the DSBulk tool from DataStax? Previously, it only
>> worked with DataStax Enterprise but a few weeks ago, it was made free and
>> works with open-source Apache Cassandra. For details, see this blogpost
>> <https://www.datastax.com/blog/2019/12/tools-for-apache-cassandra>.
>> Cheers!
>>
>> On Fri, Jan 17, 2020 at 6:57 PM adrien ruffie 
>> wrote:
>>
>> Hello all,
>>
>> In my company we want to export a big dataset of our cassandra's ring.
>> We search to use COPY command but I don't find if and how can a WHERE
>> condition can be use ?
>>
>> Because we need to export only several data which must be return by a
>> WHERE closure, specially
>> and unfortunately with ALLOW FILTERING due to several old tables which
>> were poorly conceptualized...
>>
>> Do you know a means to do that please ?
>>
>> Thank all and best regards
>>
>> Adrian
>>
>>
>>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: How to know execute_async correctly?

2020-01-05 Thread Alex Ott
Hi

There are several things here:

1. When you're executing query via execute_async, instead of ResultSet
   you're getting ResponseFuture instead.  Then you can use .result on it
   to obtain results of execution, or error.  Another possibility is to
   attach callbacks to that Future object.  See
   
https://docs.datastax.com/en/developer/python-driver/3.20/getting_started/#asynchronous-queries
   for more details & examples.

2. You need to be very careful with using batches - if you just putting
   different queries there, then you're making your Cassandra slower, not
   faster. Please read
   https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useBatch.html to
   understand when you can use them effectively


I recommend first to go through "Developing applications with DataStax
drivers" guide
(https://docs.datastax.com/en/devapp/doc/devapp/aboutDrivers.html) to get
understanding how to work with Cassandra using drivers.

lampahome  at "Thu, 12 Dec 2019 12:17:42 +0800" wrote:
 l> I tried to execute async by batch in python-driver. But I don't know how to 
check query executing correctly.

 l> Code is like below:
 l> B = BatchStatement()
 l> for x in xxx:
 l>     B.add(query, (args))
 l> res = session.execute_async(B)
 l> B.clear() # for reusing
 l> r = res.result()
 l> ## Then how to know my query works correctly? print(r)?

 l> I found no doc about my question in the page of ResultSet.
 l> Can anyone explain?

 l> thx



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Disabling Swap for Cassandra

2020-04-17 Thread Alex Ott
I usually recommend following document:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configRecommendedSettings.html
- it's about DSE, but applicable to OSS Cassandra as well...

Kunal  at "Thu, 16 Apr 2020 15:49:35 -0700" wrote:
 K> Hello,

 K>  

 K> I need some suggestion from you all. I am new to Cassandra and was reading 
Cassandra best practices. On one document, it was
 K> mentioned that Cassandra should not be using swap, it degrades the 
performance.

 K> My question is instead of disabling swap system wide, can we force 
Cassandra not to use swap? Some documentation suggests to use
 K> memory_locking_policy in cassandra.yaml.

 K> How do I check if our Cassandra already has this parameter and still uses 
swap ? Is there any way i can check this. I already
 K> checked cassandra.yaml and dont see this parameter. Is there any other 
place i can check and confirm?

 K> Also, Can I set memlock parameter to unlimited (64kB default), so entire 
Heap (Xms = Xmx) can be locked at node startup ? Will that
 K> help?

 K> Or if you have any other suggestions, please let me know.

 K>  

 K>  

 K> Regards,

 K> Kunal

 K>  



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to find which table partitions having the more reads per sstables ?

2020-03-16 Thread Alex Ott
There is also nodetool toppartitions: 
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsToppartitions.html

Erick Ramirez  at "Mon, 16 Mar 2020 22:44:44 +1100" wrote:
 ER> How to find which table partitions having the more reads per sstables 
in Cassandra?

 ER> Your question is unclear. Do you want to know which tables are read the 
most? If so, you'll need to run nodetool tablestats and parse/sort
 ER> the output to get the top tables based on read count.

 ER> But if you want to know which are the "hottest" partitions, you'll need to 
have audit logging enabled to catch the incoming CQL. See Audit
 ER> Logging if your cluster is running OSS C*. If your cluster is running 
DataStax Enterprise, see Database auditing for details. Cheers!

 ER> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers! Share your expertise on 
 ER> https://community.datastax.com/.



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: OOM only on one datacenter nodes

2020-04-05 Thread Alex Ott
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/ for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb. 
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
 RP> Reply-To: "user@cassandra.apache.org" 
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org" 
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers! Share your expertise on https://community.datastax.com/.



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Query data through python using IN clause

2020-04-02 Thread Alex Ott
Hi

Working code is below, but I want to warn you - prefer not to use IN with
partition keys - because you'll have different partition key values,
coordinator node will need to perform queries to other hosts that hold
these partition keys, and this slow downs the operation, and adds an
additional load to the coordinating node.  If you execute queries in
parallel (using async) for every of combination of pk1 & pk2, and then
consolidate data application side - this could be faster than query with IN.

Answer:

You need to pass list as value of temp - IN expects list there...

query = session.prepare("select * from test.table1 where pk1 IN ? and pk2=0 and 
ck1 > ? AND ck1 < ?;")
temp = [1,2,3]

import dateutil.parser

ck1 = dateutil.parser.parse('2020-01-01T00:00:00Z')
ck2 = dateutil.parser.parse('2021-01-01T00:00:00Z')

rows = session.execute(query, (temp, ck1, ck2))
for row in rows:
print row




Nitan Kainth  at "Wed, 1 Apr 2020 18:21:54 -0500" wrote:
 NK> Hi There,

 NK> I am trying to read data from table as below structure:

 NK> table1(
 NK> pk1 bigint,
 NK> pk2 bigint,
 NK> ck1 timestamp,
 NK> value text,
 NK> primary key((pk1,pk2),ck1);

 NK> query = session.prepare("select * from table1 where pk IN ? and pk2=0 and 
ck1 > ? AND ck1 < ?;")

 NK> temp = 1,2,3

 NK> runq = session.execute(query2, (temp,ck1, ck1))

 NK> TypeError: Received an argument of invalid type for column "in(bam_user)". 
Expected: , Got:
 NK> ; (cannot convert argument to 
integer)

 NK> I found examples for prepared statements for inserts but couldn't find any 
for select and not able to make it to work. 

 NK> Any suggestions?



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
;
 >>>>    SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
 >>>> And you can do
 >>>>
 >>>>    SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC
 >>>>
 >>>> And you can do ranges:
 >>>>
 >>>>    SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
 >>>> fromtime DESC
 >>>>
 >>>> But you have to work within the boundaries of how the data is stored.
 >>>> It's stored grouped by signalid+monthyear, and then sorted by fromtime,
 >>>> and then sorted by totime.
 >>>>
 >>>>
 >>>>
 >>>>  So, after some trial and error and a lot of Googling, I learned that 
 >>>> I
 >>>>  must include all rows from the PRIMARY KEY from left to right in my
 >>>>  query. Thus, this is the "best" I can get to work:
 >>>>
 >>>>
 >>>>  SELECT
 >>>>   *
 >>>>  FROM
 >>>>   "tagdata.central"
 >>>>  WHERE
 >>>>   "signalid" = 4002
 >>>>   AND "monthyear" = 201908
 >>>>  ORDER BY
 >>>>   "fromtime" DESC
 >>>>  LIMIT 10;
 >>>>
 >>>>
 >>>>  The "monthyear" column, I crafted like a fool by incrementing the 
 >>>> date
 >>>>  one month after another until no results could be found anymore.
 >>>>  The "signalid" I grabbed from one of the unrestricted "SELECT * 
 >>>> FROM" -
 >>>>  query results. But these can't be as easily guessed as the 
 >>>> "monthyear"
 >>>>  values could.
 >>>>
 >>>>  This is where I'm stuck!
 >>>>
 >>>>  1. This does not really feel like the ideal way to go. I think there 
 >>>> is
 >>>>  something more mature in modern IT systems. Can anyone tell me what
 >>>>  is a
 >>>>  better way to get these informations?
 >>>>
 >>>>
 >>>> You can denormalize. Because cassandra allows you to have very large
 >>>> clusters, you can make multiple tables sorted in different ways to
 >>>> enable the queries you need to run. Normal data modeling is to build
 >>>> tables based on the SELECT statements you need to do (unless you're very
 >>>> advanced, in which case you do it based on the transaction semantics of
 >>>> the INSERT/UPDATE statements, but that's probably not you).
 >>>>
 >>>> Or you can use a more flexible database.
 >>>>
 >>>>
 >>>>  2. I need a way to learn all values that are in the "monthyear" and
 >>>>  "signalid" columns in order to be able to craft that query.
 >>>>  How can I achieve that in a reasonable way? As I said: The DB is 
 >>>> round
 >>>>  about 260 GB which makes it next to impossible to just "have a look" 
 >>>> at
 >>>>  the output of "SELECT *"..
 >>>>
 >>>>
 >>>> You probably want to keep another table of monthyear + signalid pairs.
 >>>
 >>> -
 >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >>> For additional commands, e-mail: user-h...@cassandra.apache.org
 >>>
 >>>
 >>> 
 >>>
 >>> The information in this Internet Email is confidential and may be legally
 >>> privileged. It is intended solely for the addressee. Access to this Email 
 >>> by anyone
 >>> else is unauthorized. If you are not the intended recipient, any 
 >>> disclosure, copying,
 >>> distribution or any action taken or omitted to be taken in reliance on it, 
 >>> is
 >>> prohibited and may be unlawful. When addressed to our clients any opinions 
 >>> or advice
 >>> contained in this Email are subject to the terms and conditions expressed 
 >>> in any
 >>> applicable governing The Home Depot terms of business or client engagement 
 >>> letter. The
 >>> Home Depot disclaims all responsibility and liability for the accuracy and 
 >>> content of
 >>> this attachment and for any damages or losses arising from any 
 >>> inaccuracies, errors,
 >>> viruses, e.g., worms, trojan horses, etc., or other items of a destructive 
 >>> nature,
 >>> which may be contained in this attachment and shall not be liable for 
 >>> direct, indirect,
 >>> consequential or special damages in connection with this e-mail message or 
 >>> its
 >>> attachment.
 >>>
 >>> -
 >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >>> For additional commands, e-mail: user-h...@cassandra.apache.org
 >>>
 >>
 >> -
 >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 >> For additional commands, e-mail: user-h...@cassandra.apache.org

 MR> -
 MR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 MR> For additional commands, e-mail: user-h...@cassandra.apache.org



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
> What your primary key REALLY MEANS is:
> >
> > The database on reads and writes will hash(signalid+monthyear) to find
> > which hosts have the data, then
> >
> > In each data file, the data for a given (signalid,monthyear) is stored
> > sorted by fromtime and totime
> >
> > The database is already of round about 260 GB in size.
> > I now need to know what is the most recent entry in it; the correct
> > column to learn this would be "insertdate".
> >
> > In SQL I would do something like this:
> >
> > SELECT insertdate FROM tagdata.central
> > ORDER BY insertdate DESC LIMIT 1;
> >
> > In CQL, however, I just can't get it to work.
> >
> > What I have tried already is this:
> >
> > SELECT insertdate FROM "tagdata.central"
> > ORDER BY insertdate DESC LIMIT 1;
> >
> >
> > Because you didnt provide a signalid and monthyear, it doesn't know
> > which machine in your cluster to use to start the query.
> >
> >
> > But this gives me an error:
> > ERROR: ORDER BY is only supported when the partition key is
> restricted
> > by an EQ or an IN.
> >
> >
> > Because it's designed for potentially petabytes of data per cluster, it
> > doesn't believe you really want to walk all the data and order ALL of
> > it. Instead, it assumes that when you need to use an ORDER BY, you're
> > going to have some very small piece of data - confined to a single
> > signalid/monthyear pair. And even then, the ORDER is going to assume
> > that you're ordering it by the ordering keys you've defined - fromtime
> > first, and then totime.
> >
> > So you can do
> >
> >   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
> > And you can do
> >
> >   SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC
> >
> > And you can do ranges:
> >
> >   SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
> > fromtime DESC
> >
> > But you have to work within the boundaries of how the data is stored.
> > It's stored grouped by signalid+monthyear, and then sorted by fromtime,
> > and then sorted by totime.
> >
> >
> >
> > So, after some trial and error and a lot of Googling, I learned that
> I
> > must include all rows from the PRIMARY KEY from left to right in my
> > query. Thus, this is the "best" I can get to work:
> >
> >
> > SELECT
> >  *
> > FROM
> >  "tagdata.central"
> > WHERE
> >  "signalid" = 4002
> >  AND "monthyear" = 201908
> > ORDER BY
> >  "fromtime" DESC
> > LIMIT 10;
> >
> >
> > The "monthyear" column, I crafted like a fool by incrementing the
> date
> > one month after another until no results could be found anymore.
> > The "signalid" I grabbed from one of the unrestricted "SELECT *
> FROM" -
> > query results. But these can't be as easily guessed as the
> "monthyear"
> > values could.
> >
> > This is where I'm stuck!
> >
> > 1. This does not really feel like the ideal way to go. I think there
> is
> > something more mature in modern IT systems. Can anyone tell me what
> > is a
> > better way to get these informations?
> >
> >
> > You can denormalize. Because cassandra allows you to have very large
> > clusters, you can make multiple tables sorted in different ways to
> > enable the queries you need to run. Normal data modeling is to build
> > tables based on the SELECT statements you need to do (unless you're very
> > advanced, in which case you do it based on the transaction semantics of
> > the INSERT/UPDATE statements, but that's probably not you).
> >
> > Or you can use a more flexible database.
> >
> >
> > 2. I need a way to learn all values that are in the "monthyear" and
> > "signalid" columns in order to be able to craft that query.
> > How can I achieve that in a reasonable way? As I said: The DB is
> round
> > about 260 GB which makes it next to impossible to just "have a look"
> at
> > the output of "SELECT *"..
> >
> >
> > You probably want to keep another table of monthyear + signalid pairs.
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> 
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Tool for schema upgrades

2020-10-08 Thread Alex Ott
Hi

Look at https://github.com/patka/cassandra-migration - it should be good.

P.S. Here is the list of tools that I assembled over the years:

   - [ ] https://github.com/hhandoko/cassandra-migration
   - [ ] https://github.com/Contrast-Security-OSS/cassandra-migration
   - [ ] https://github.com/juxt/joplin
   - [ ] https://github.com/o19s/trireme
   - [ ] https://github.com/golang-migrate/migrate
   - [ ] https://github.com/Cobliteam/cassandra-migrate
   - [ ] https://github.com/patka/cassandra-migration
   - [ ] https://github.com/comeara/pillar

On Thu, Oct 8, 2020 at 5:45 PM Paul Chandler  wrote:

> Hi all,
>
> Can anyone recommend a tool to perform schema DDL upgrades, that follows
> best practice to ensure you don’t get schema mismatches if running multiple
> upgrade statements in one migration ?
>
> Thanks
>
> Paul
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: tombstones - however there are no deletes

2020-08-21 Thread Alex Ott
Btw, if you seen the number of tombstones that is a multiply of number of
scanned rows, like in your case - that’s a explicit signal of either null
inserts, or non frozen collections...

On Fri 21. Aug 2020 at 20:21, Attila Wind  wrote:

>
>
>
>
>
>
>
>
>
>
> right! silly me (regarding "can't have null for clustering
>
> column") :-)
>
>
> OK code is modified, we stopped using NULL on that column. In a
>
> few days we will see if this was the cause.
>
>
> Thanks for the useful info eveyrone! Helped a lot!
>
>
>
>
> Attila Wind
>
>
>
>
>
>
>
> http://www.linkedin.com/in/attilaw
>
>
> Mobile: +49 176 43556932
>
>
>
>
>
>
>
>
>
>
>
>
> 21.08.2020 11:04 keltezéssel, Alex Ott
>
> írta:
>
>
>
>
>
>
>
>
>
>
> inserting null for any column will generate the tombstone
>
> (and you can't have null for clustering column, except case
>
> when it's an empty partition with static column).
>
>
> if you're really inserting the new data, not overwriting
>
> existing one - use UNSET instead of null
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Aug 21, 2020 at 10:45
>
> AM Attila Wind   wrote:
>
>
>
>
>
>>
>>
>>
>> Thanks a lot! I will process every pointers you gave -
>>
>> appreciated!
>>
>>
>>
>>
>> 1. we do have collection column in that table but that is
>>
>> (we have only 1 column) a frozen Map - so I guess
>>
>> "Tombstones are also implicitly created any time you
>>
>> insert or update a row which has an (unfrozen) collection
>>
>> column: list<>, map<> or set<>.  This
>>
>> has to be done in order to ensure the new write replaces
>>
>> any existing collection entries." does not really apply
>>
>> here
>>
>>
>> 2. "Isn’t it so that explicitly
>>
>> setting a column to NULL also result in a tombstone"
>>
>>
>> Is this true for all columns? or just clustering key cols?
>>
>>
>> Because if for all cols (which would make sense maybe to
>>
>> me more) then we found the possible reason.. :-)
>>
>>
>> As we do have an Integer coulmn there which is actually
>>
>> NULL often (and so far in all cases)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>  Attila
>>
>> Wind
>>
>>
>>
>>
>>
>>
>>
>> http://www.linkedin.com/in/attilaw
>>
>>
>> Mobile: +49 176 43556932
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 21.08.2020 09:49 keltezéssel, Oleksandr Shulgin írta:
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 21, 2020 at 9:43 AM Tobias
>>
>> Eriksson 
>>
>> wrote:
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>>
>>> Isn’t it
>>>
>>> so that explicitly setting a column to NULL
>>>
>>> also result in a tombstone
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>> True, thanks for pointing that out!
>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Then as
>>>
>>> mentioned the use of list,set,map can also
>>>
>>> result in tombstones
>>>
>>>
>>> See
>>>
>>>
>>> https://www.instaclustr.com/cassandra-collections-hidden-tombstones-and-how-to-avoid-them/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>> And A. Ott has already mentioned both these
>>
>> possible reasons :-)
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>> Alex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
>
>
> With best wishes,Alex Ott
>
>
> http://alexott.net/
>
>
> Twitter: alexott_en (English), alexott (Russian)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: tombstones - however there are no deletes

2020-08-21 Thread Alex Ott
inserting null for any column will generate the tombstone (and you can't
have null for clustering column, except case when it's an empty partition
with static column).
if you're really inserting the new data, not overwriting existing one - use
UNSET instead of null

On Fri, Aug 21, 2020 at 10:45 AM Attila Wind  wrote:

> Thanks a lot! I will process every pointers you gave - appreciated!
>
> 1. we do have collection column in that table but that is (we have only 1
> column) a frozen Map - so I guess "Tombstones are also implicitly created
> any time you insert or update a row which has an (unfrozen) collection
> column: list<>, map<> or set<>.  This has to be done in order to ensure the
> new write replaces any existing collection entries." does not really apply
> here
>
> 2. "Isn’t it so that explicitly setting a column to NULL also result in a
> tombstone"
> Is this true for all columns? or just clustering key cols?
> Because if for all cols (which would make sense maybe to me more) then we
> found the possible reason.. :-)
> As we do have an Integer coulmn there which is actually NULL often (and so
> far in all cases)
>
>
>  Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
> 21.08.2020 09:49 keltezéssel, Oleksandr Shulgin írta:
>
> On Fri, Aug 21, 2020 at 9:43 AM Tobias Eriksson <
> tobias.eriks...@qvantel.com> wrote:
>
>> Isn’t it so that explicitly setting a column to NULL also result in a
>> tombstone
>>
>
> True, thanks for pointing that out!
>
> Then as mentioned the use of list,set,map can also result in tombstones
>>
>> See
>> https://www.instaclustr.com/cassandra-collections-hidden-tombstones-and-how-to-avoid-them/
>>
>
> And A. Ott has already mentioned both these possible reasons :-)
>
> --
> Alex
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: tombstones - however there are no deletes

2020-08-21 Thread Alex Ott
Tombstones could be not only generated by deletes. this happens when you:

   - When insert or full update of a non-frozen collection occurs, such as
   replacing the value of the column with another value like the UPDATE table
   SET field = new_value …, Cassandra inserts a tombstone marker to prevent
   possible overlap with previous data even if data did not previously exist.
   A large number of tombstones can significantly affect read performance.
   - When you insert null explicitly, instead of using UNSET for missing
   data.


On Fri, Aug 21, 2020 at 7:57 AM Attila Wind  wrote:

> Hi Cassandra Gurus,
>
> Recently I captured a very interesting warning in the logs saying
>
> 2020-08-19 08:08:32.492
> [cassandra-client-keytiles_data_webhits-nio-worker-2] WARN
> com.datastax.driver.core.RequestHandler - Query '[3 bound values] select *
> from visit_sess
> ion_by_start_time_v4 where container_id=? and first_action_time_frame_id
> >= ? and first_action_time_frame_id <= ?;' generated server side
> warning(s):
> *Read 6628 live rows and 6628 tombstone cells* for query SELECT * FROM
> keytiles_data_webhits.visit_session_by_start_time_v4 WHERE container_id =
> 5YzsPfE2Gcu8sd-76626 AND first_action_time_frame_id > 4
> 43837 AND first_action_time_frame_id <= 443670 AND user_agent_type >
> browser-mobile AND unique_webclient_id >
> 045d1683-c702-48bd-9d2b-dcf1ca87ac7c AND first_action_ts > 15978
> 15766 LIMIT 6628 (see tombstone_warn_threshold)
>
> What makes this interesting to me is the fact we never issue not even row
> level deletes but any kind of deletes against this table for now
> So I'm wondering what can result in tombstone creation in Cassandra -
> apart from explicit DELETE queries and TTL setup...
>
> My suspicion is (but I'm not sure) that as we are going with "select *"
> read strategy, then calculate everything in-memory, eventually writing back
> with kinda "update *" queries to Cassandra in this table (so not updating
> just a few columns but everything) can lead to these... Can it?
> I tried to search around this sympthom but was not successful - so decided
> to ask you guys maybe someone can give us a pointer...
>
> Some more info:
>
>- the table does not have TTL set - this mechanism is turned off
>- the LIMIT param in upper query comes from paging size
>- we are using Cassandra4 alpha3
>- we also have a few similarly built tables where we follow the above
>described "update *" policy on write path - however those tables are
>counter tables... when we mass-read them into memory we also go with
>"select *" logic reading up tons of rows. The point is we never saw such a
>warning for these counter tables however we are handling them same
>fashion... ok counter tables work differently but still interesting to me
>    why those never generated things like this
>
> thanks!
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Understanding replication

2020-09-20 Thread Alex Ott
data is always written to all replicas in the cluster.
here is good diagram how this happens in the multi-dc cluster:
https://docs.datastax.com/en/dse/6.7/dse-arch/datastax_enterprise/dbInternals/dbIntClientRequestsMultiDCWrites.html

regarding the second question - theoretically, yes, it's possible, but if
they were done for the same primary key, then they will be resolved via
data timestamp.

On Sun, Sep 20, 2020 at 5:31 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> Hello,
>
> I have a question regarding multi Datacenter replication.
> In a multi datacenter(dc-1, dc-2) if two records are written into dc-1 is
> there a guarantee that these two records replicate to dc-2 in the same
> order or is there a possibility that second insert replicate faster than
> first insert? (Assuming first record is bigger than seconds record in terms
> of packet size).
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: reverse paging state

2020-10-23 Thread Alex Ott
Hi

for that version of the driver there is no build-in functionality for the
backward paging, although it's doable:
https://stackoverflow.com/questions/50168236/cassandra-pagination-inside-partition/50172052#50172052

for driver 4.9.0 there is a wrapper class that emultates random paging,
with tradeoff for performance:
https://docs.datastax.com/en/developer/java-driver/4.9/manual/core/paging/#offset-queries

On Fri, Oct 23, 2020 at 10:00 AM Manu Chadha 
wrote:

> In Java driver 3.4.0, how does one revert the order of paging? I want to
> implement a “Back” button but I can’t figure out from the API docs if/how I
> can make Cassandra (via the Java driver) search backwards.
>
>
>
> https://docs.datastax.com/en/drivers/java/3.4/
>
>
>
> The code I have written currently is
>
>
>
> session.execute(whereClause
>   .setFetchSize(fetchSize)
>
>   .setPagingState(pagingState))
>
>
>
> Thanks
>
> Manu
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Cqlsh copy command on a larger data set

2020-07-14 Thread Alex Ott
CQLSH definitely won't work for that amount of data, so you need to use
other tools.

But before selecting them, you need to define requirements. For example:

   1. Are you copying the data into tables with exactly the same structure?
   2. Do you need to preserve metadata, like, writetime & TTL?

Depending on that, you may have following choices:

   - use sstableloader - it will preserve all metadata, like, ttl and
   writetime. You just need to copy SSTable files, or stream directly from the
   source cluster.  But this will require copying of data into tables with
   exactly same structure (and in case of UDTs, the keyspace names should be
   the same)
   - use DSBulk - it's a very effective tool for unloading & loading data
   from/to Cassandra/DSE. Use zstd compression for offloaded data to save disk
   space (see blog links below for more details).  But the preserving metadata
   could be a problem.
   - use Spark + Spark Cassandra Connector. But also, preserving the
   metadata is not an easy task, and requires programming to handle all edge
   cases (see https://datastax-oss.atlassian.net/browse/SPARKC-596 for
   details)


blog series on DSBulk:

   -
   
https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading
   - https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading
   -
   https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings
   - https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
   - https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting
   -
   
https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations


On Tue, Jul 14, 2020 at 1:47 AM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> Hello,
>
> I would like to copy some data from one cassandra cluster to another
> cassandra cluster using the CQLSH copy command. Is this the good approach
> if the dataset size on the source cluster is very high(500G - 1TB)? If not
> what is the safe approach? and are there any limitations/known issues to
> keep in mind before attempting this?
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Impact of enabling authentication on performance

2020-06-03 Thread Alex Ott
You can decrease this time for picking up the change by using lower number
for credentials_update_interval_in_ms, roles_update_interval_in_ms &
permissions_update_interval_in_ms 

Durity, Sean R  at "Tue, 2 Jun 2020 14:48:28 +" wrote:
 DSR> To flesh this out a bit, I set roles_validity_in_ms and 
permissions_validity_in_ms to
 DSR> 360 (10 minutes). The default of 2000 is far too often for my use 
cases. Usually I set
 DSR> the RF for system_auth to 3 per DC. On a larger, busier cluster I have 
set it to 6 per
 DSR> DC. NOTE: if you set the validity higher, it may take that amount of time 
before a change
 DSR> in password or table permissions is picked up (usually less).


 DSR> Sean Durity

 DSR> -Original Message-
 DSR> From: Jeff Jirsa 
 DSR> Sent: Tuesday, June 2, 2020 2:39 AM
 DSR> To: user@cassandra.apache.org
 DSR> Subject: [EXTERNAL] Re: Impact of enabling authentication on performance

 DSR> Set the Auth cache to a long validity

 DSR> Don’t go crazy with RF of system auth

 DSR> Drop bcrypt rounds if you see massive cpu spikes on reconnect storms


 >> On Jun 1, 2020, at 11:26 PM, Gil Ganz  wrote:
 >>
 >> 
 >> Hi
 >> I have a production 3.11.6 cluster which I'm might want to enable 
 >> authentication in, I'm trying to understand what will be the performance 
 >> impact, if any.
 >> I understand each use case might be different, trying to understand if 
 >> there is a common % people usually see their performance hit, or if someone 
 >> has looked into this.
 >> Gil

 DSR> -
 DSR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 DSR> For additional commands, e-mail: user-h...@cassandra.apache.org


 DSR> 

 DSR> The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

 DSR> -
 DSR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 DSR> For additional commands, e-mail: user-h...@cassandra.apache.org


-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Partition size, limits, recommendations for tables where all columns are part of the primary key

2020-06-09 Thread Alex Ott
Hi

Yes, basically rows have no cells as everything is in the partition
key/clustering columns.

You can always look unto the data using the sstabledump (this is for DSE
6.7 that I have running):

 sstabledump ac-1-bti-Data.db
[
  {
"partition" : {
  "key" : [ "977eb1f1-aa5b-11ea-b91a-db426f6f892c",
"977ed900-aa5b-11ea-b91a-db426f6f892c" ],
  "position" : 0
},
"rows" : [
  {
"type" : "row",
"position" : 78,
"clustering" : [ "test", "977ed901-aa5b-11ea-b91a-db426f6f892c" ],
"liveness_info" : { "tstamp" : "2020-06-09T14:14:54.863249Z" },
"cells" : [ ]
  }
]
  }
]

P.S. You can play with your schema, and do some performance tests using the
https://github.com/nosqlbench/


On Tue, Jun 9, 2020 at 3:51 PM Benjamin Christenson <
ben.christen...@kineticdata.com> wrote:

> Hello all, I am doing some data modeling and want to make sure that I
> understand some nuances to cell counts, partition sizes, and related
> recommendations.  Am I correct in my understanding that tables for which
> every column is in the primary key will always have 0 cells?
>
> For example, using https://cql-calculator.herokuapp.com/, I tested the
> following table definition with 100 (1 million) rows per partition and
> an average value size of 255 bytes, and it returned that there were 0 cells
> and the partition took up 32 bytes total:
>   CREATE TABLE IF NOT EXISTS widgets (
> id timeuuid,
> key_id timeuuid,
> parent_id timeuuid,
> value text,
> PRIMARY KEY ((parent_id, key_id), value, id)
>   )
>
> Obviously the total amount of disk space for this table must be more than
> 32 bytes.  In this situation, how should I be reasoning about partition
> sizes (in terms of the 2B cell limit, and 100MB-400MB partition size
> limit)?  Additionally, are there other limits / potential performance
> issues I should be concerned about?
>
> Ben Christenson
> Developer
>
> Kinetic Data, Inc.
> Your business. Your process.
> 651-556-0937  |  ben.christen...@kineticdata.com
> www.kineticdata.com  |  community.kineticdata.com
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: what is allowed and not allowed w.r.t altering cassandra table schema

2020-07-16 Thread Alex Ott
Hi

This is a quite big topic, maybe it should be a topic for a blog post, etc.
I've spent some time working with customers on that, so here is my TLDR:


   - You can add regular columns (not part of the primary key) to a table.
   You must not to add the column with the same name as another dropped column
   - this will lead to errors in commit log replay, and lead to problems with
   data in existing SSTables.
   - You can drop regular columns with some limitations:
   - we can't drop columns that are part of the primary key (if you need to
  do this, see the next section);
  - It's not possible to drop columns on tables that have materialized
  views, secondary or search indexes - if you still need to rename it, you
  need to drop view or index, and re-create it after renaming;
  - if you drop a column then re-add it, DSE does not restore the
  values written before the column was dropped;
   - It's not possible to change column type - this functionality did exist
   in some versions, but only for "compatible" data types, but it was removed
   from Cassandra as part of the CASSANDRA-12443, due the scalability
   problems. Don't try to drop the column & re-add it again with a new type -
   you'll get corrupt data. Actual changes in the column type should be done
   by:
  - adding a new column with desired type
  - running migration code that will copy data from existing columns
  - dropping the old column
  - To support process continuity, applications may work with both
  columns (read & write) during migration, and use only one after migration
  happened.
   - You can rename column that is the part of the primary key, but not the
   regular column
   - You can't change the primary key - you need to create a new table with
   desired primary key, and migrate data into it.
   - You can add new field to the UDT, and you can rename field in UDT, but
   not drop it


there are some tools to support schema evolution, for example:
https://github.com/hhandoko/cassandra-migration


On Wed, Jul 15, 2020 at 8:39 PM Manu Chadha  wrote:

> Hi
>
>
>
> What is allowed and not allowed w.r.t altering Cassandra table schema?
>
>
>
> Creating the right schema seems like the most step w.r.t using Cassandra.
> Coming from relational background, I still struggle to create schema which
> leverages duplication and per-query table (I end up creating relationships
> between tables).
>
>
>
> Even if I am able to create a schema which doesn’t have relationship
> between tables for now, my application will evolve in future and I might
> then have to change the schema to avoid creating relationships.
>
>
>
> In that respect, what would I be able to change and not change in a
> schema? If I add a new field (non-key), then for existing values I suppose
> the value of that new field will be null/empty. But if I am allowed to make
> the new field an additional key then what will be the value of this key for
> existing data?
>
>
>
> Thanks
>
> Manu
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Cqlsh copy command on a larger data set

2020-07-16 Thread Alex Ott
if you didn't export TTL explicitly, and didn't load it back, then you'll
get not expirable data.

On Thu, Jul 16, 2020 at 7:48 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> In tried verify metadata, In case of writetime it is setting it as insert
> time but the TTL value is showing as null. Is this expected? Does this mean
> this record will never expire after the insert?
> Is there any alternative to preserve the TTL ?
>
> In the new Table inserted with Cqlsh and Dsbulk
> cqlsh > SELECT ttl(secret) from ks_blah.cf_blah ;
>
>  ttl(secret)
> --
>  null
>  null
>
> (2 rows)
>
> In the old table where the data was written from application
>
> cqlsh > SELECT ttl(secret) from ks_old.cf_old ;
>
>  ttl(secret)
> 
>  4517461
>  4525958
>
> (2 rows)
>
> On Wed, Jul 15, 2020 at 1:17 PM Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
>
>> thank you
>>
>> On Wed, Jul 15, 2020 at 1:11 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Alex is referring to the "writetime" and "tttl" values for each cell.
>>> Most tools copy via CQL writes and don't by default copy those previous
>>> writetime and ttl values and instead just give a new writetime value which
>>> matches the copy time rather than initial insert time.
>>>
>>> On Wed, Jul 15, 2020 at 3:01 PM Jai Bheemsen Rao Dhanwada <
>>> jaibheem...@gmail.com> wrote:
>>>
>>>> Hello Alex,
>>>>
>>>>
>>>>- use DSBulk - it's a very effective tool for unloading & loading
>>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data to 
>>>> save
>>>>disk space (see blog links below for more details).  But the *preserving
>>>>metadata* could be a problem.
>>>>
>>>> Here what exactly do you mean by "preserving metadata" ? would you
>>>> mind explaining?
>>>>
>>>> On Tue, Jul 14, 2020 at 8:50 AM Jai Bheemsen Rao Dhanwada <
>>>> jaibheem...@gmail.com> wrote:
>>>>
>>>>> Thank you for the suggestions
>>>>>
>>>>> On Tue, Jul 14, 2020 at 1:42 AM Alex Ott  wrote:
>>>>>
>>>>>> CQLSH definitely won't work for that amount of data, so you need to
>>>>>> use other tools.
>>>>>>
>>>>>> But before selecting them, you need to define requirements. For
>>>>>> example:
>>>>>>
>>>>>>1. Are you copying the data into tables with exactly the same
>>>>>>structure?
>>>>>>2. Do you need to preserve metadata, like, writetime & TTL?
>>>>>>
>>>>>> Depending on that, you may have following choices:
>>>>>>
>>>>>>- use sstableloader - it will preserve all metadata, like, ttl
>>>>>>and writetime. You just need to copy SSTable files, or stream 
>>>>>> directly from
>>>>>>the source cluster.  But this will require copying of data into 
>>>>>> tables with
>>>>>>exactly same structure (and in case of UDTs, the keyspace names 
>>>>>> should be
>>>>>>the same)
>>>>>>- use DSBulk - it's a very effective tool for unloading & loading
>>>>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data 
>>>>>> to save
>>>>>>disk space (see blog links below for more details).  But the 
>>>>>> preserving
>>>>>>metadata could be a problem.
>>>>>>- use Spark + Spark Cassandra Connector. But also, preserving the
>>>>>>metadata is not an easy task, and requires programming to handle all 
>>>>>> edge
>>>>>>cases (see https://datastax-oss.atlassian.net/browse/SPARKC-596
>>>>>>for details)
>>>>>>
>>>>>>
>>>>>> blog series on DSBulk:
>>>>>>
>>>>>>-
>>>>>>
>>>>>> https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading
>>>>>>-
>>>>>>
>>>>>> https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading
>>>>>>-
>>>>>>
>>>>>> https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings
>>>>>>-
>>>>>>https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
>>>>>>-
>>>>>>https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting
>>>>>>-
>>>>>>
>>>>>> https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 14, 2020 at 1:47 AM Jai Bheemsen Rao Dhanwada <
>>>>>> jaibheem...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I would like to copy some data from one cassandra cluster to another
>>>>>>> cassandra cluster using the CQLSH copy command. Is this the good 
>>>>>>> approach
>>>>>>> if the dataset size on the source cluster is very high(500G - 1TB)? If 
>>>>>>> not
>>>>>>> what is the safe approach? and are there any limitations/known issues to
>>>>>>> keep in mind before attempting this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> With best wishes,Alex Ott
>>>>>> http://alexott.net/
>>>>>> Twitter: alexott_en (English), alexott (Russian)
>>>>>>
>>>>>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Cqlsh copy command on a larger data set

2020-07-16 Thread Alex Ott
look into a series of the blog posts that I sent, I think that it should be
in the 4th post

On Thu, Jul 16, 2020 at 8:27 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> okay, is there a way to export the TTL using CQLsh or DSBulk?
>
> On Thu, Jul 16, 2020 at 11:20 AM Alex Ott  wrote:
>
>> if you didn't export TTL explicitly, and didn't load it back, then you'll
>> get not expirable data.
>>
>> On Thu, Jul 16, 2020 at 7:48 PM Jai Bheemsen Rao Dhanwada <
>> jaibheem...@gmail.com> wrote:
>>
>>> In tried verify metadata, In case of writetime it is setting it as
>>> insert time but the TTL value is showing as null. Is this expected? Does
>>> this mean this record will never expire after the insert?
>>> Is there any alternative to preserve the TTL ?
>>>
>>> In the new Table inserted with Cqlsh and Dsbulk
>>> cqlsh > SELECT ttl(secret) from ks_blah.cf_blah ;
>>>
>>>  ttl(secret)
>>> --
>>>  null
>>>  null
>>>
>>> (2 rows)
>>>
>>> In the old table where the data was written from application
>>>
>>> cqlsh > SELECT ttl(secret) from ks_old.cf_old ;
>>>
>>>  ttl(secret)
>>> 
>>>  4517461
>>>  4525958
>>>
>>> (2 rows)
>>>
>>> On Wed, Jul 15, 2020 at 1:17 PM Jai Bheemsen Rao Dhanwada <
>>> jaibheem...@gmail.com> wrote:
>>>
>>>> thank you
>>>>
>>>> On Wed, Jul 15, 2020 at 1:11 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> Alex is referring to the "writetime" and "tttl" values for each cell.
>>>>> Most tools copy via CQL writes and don't by default copy those previous
>>>>> writetime and ttl values and instead just give a new writetime value which
>>>>> matches the copy time rather than initial insert time.
>>>>>
>>>>> On Wed, Jul 15, 2020 at 3:01 PM Jai Bheemsen Rao Dhanwada <
>>>>> jaibheem...@gmail.com> wrote:
>>>>>
>>>>>> Hello Alex,
>>>>>>
>>>>>>
>>>>>>- use DSBulk - it's a very effective tool for unloading & loading
>>>>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data 
>>>>>> to save
>>>>>>disk space (see blog links below for more details).  But the 
>>>>>> *preserving
>>>>>>metadata* could be a problem.
>>>>>>
>>>>>> Here what exactly do you mean by "preserving metadata" ? would you
>>>>>> mind explaining?
>>>>>>
>>>>>> On Tue, Jul 14, 2020 at 8:50 AM Jai Bheemsen Rao Dhanwada <
>>>>>> jaibheem...@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you for the suggestions
>>>>>>>
>>>>>>> On Tue, Jul 14, 2020 at 1:42 AM Alex Ott  wrote:
>>>>>>>
>>>>>>>> CQLSH definitely won't work for that amount of data, so you need to
>>>>>>>> use other tools.
>>>>>>>>
>>>>>>>> But before selecting them, you need to define requirements. For
>>>>>>>> example:
>>>>>>>>
>>>>>>>>1. Are you copying the data into tables with exactly the same
>>>>>>>>structure?
>>>>>>>>2. Do you need to preserve metadata, like, writetime & TTL?
>>>>>>>>
>>>>>>>> Depending on that, you may have following choices:
>>>>>>>>
>>>>>>>>- use sstableloader - it will preserve all metadata, like, ttl
>>>>>>>>and writetime. You just need to copy SSTable files, or stream 
>>>>>>>> directly from
>>>>>>>>the source cluster.  But this will require copying of data into 
>>>>>>>> tables with
>>>>>>>>exactly same structure (and in case of UDTs, the keyspace names 
>>>>>>>> should be
>>>>>>>>the same)
>>>>>>>>- use DSBulk - it's a very effective tool for unloading &
>>>>>>>>loading data from/to Cassandra/DSE. Use zstd compression for 
>>>

Re: Use NetworkTopologyStrategy for single data center and add data centers later

2020-12-19 Thread Alex Ott
If you're planning to have another DC, then it's better to start to use
NetworkTopologyStrategy from beginning - just specify the one DC, and when
you get another, it will be simply to expand to it (see documentation:
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsAddDCToCluster.html).
When adding new DC, for system keyspaces you can use following script to
perform adjustments:
https://github.com/DataStax-Toolkit/cassandra-dse-helper-scripts/tree/master/adjust-keyspaces
(it could be used for non-system keyspaces as well)



On Sat, Dec 19, 2020 at 10:21 AM Manu Chadha 
wrote:

> Is it possible to use NetworkTopologyStrategy when creating a keyspace and
> add data centers later?
>
> I am just starting with an MVP application and I don't expect much
> traffic or data. Thus I have created only one data center. However, I'll
> like to add more data centers later if needed
>
> I notice that the replication factor for each data center needs to be
> specified at the time of keyspace creation
>
> CREATE KEYSPACE "Excalibur"
>
>   WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 
> 2};
>
> As I only have dc1 at the moment, could I just do
>
> CREATE KEYSPACE "Excalibur"
>
>   WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};
>
> and when I have another datacenter say dc2, could I edit the Excalibur
>  keyspace?
>
> ALTER KEYSPACE "Excalibur"
>
>   WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc2' : 2};
>
>
>
> or can I start with SimpleStrategy now and change to
> NetworkTopologyStrategy later? I suspect this might not work as I think
> this needs changing snitch etc.
>
>
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Last stored value metadata table

2020-11-10 Thread Alex Ott
What about using  "per partition limit 1" on that table?

On Tue, Nov 10, 2020 at 8:39 AM Gábor Auth  wrote:

> Hi,
>
> Short story: storing time series of measurements (key(name, timestamp),
> value).
>
> The problem: get the list of the last `value` of every `name`.
>
> Is there a Cassandra friendly solution to store the last value of every
> `name` in a separate metadata table? It will come with a lot of
> tombstones... any other solution? :)
>
> --
> Bye,
> Auth Gábor
>


-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: local read from coordinator

2020-11-10 Thread Alex Ott
token-aware policy doesn't work for token range queries (at least in the
Java driver 3.x).  You need to force the driver to do the reading using a
specific token as a routing key.  Here is Java implementation of the token
range scanning algorithm that Spark uses:
https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java

I'm not aware if Python driver is able to set routing key explicitly, but
whitelist policy should help



On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
wrote:

> Yes, use a token-aware policy so the driver will pick a coordinator where
> the token (partition) exists. Cheers!
>


-- 
With best wishes,        Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Re: local read from coordinator

2020-11-11 Thread Alex Ott
if you force routing key, then the replica that owns the data will be
selected as coordinator

On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester
 wrote:

> Thanx,
>
> But i'm OK with coordinator part, actually i was looking for kind of read
> CL to force to read from the coordinator only with no other connections to
> other nodes!
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>  Forwarded message 
> From: Alex Ott 
> To: "user"
> Date: Wed, 11 Nov 2020 11:28:56 +0330
> Subject: Re: local read from coordinator
>  Forwarded message 
>
> token-aware policy doesn't work for token range queries (at least in the
> Java driver 3.x).  You need to force the driver to do the reading using a
> specific token as a routing key.  Here is Java implementation of the token
> range scanning algorithm that Spark uses:
> https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java
>
> I'm not aware if Python driver is able to set routing key explicitly, but
> whitelist policy should help
>
>
>
> On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
> wrote:
>
> Yes, use a token-aware policy so the driver will pick a coordinator where
> the token (partition) exists. Cheers!
>
>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>
>
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: local read from coordinator

2020-11-11 Thread Alex Ott
Jeff, I was talking about driver -> coordinator communication, not from
where data will be read

On Wed, Nov 11, 2020 at 3:24 PM Jeff Jirsa  wrote:

>
> This isn’t necessarily true and cassandra has no coordinator-only
> consistency level to force this behavior
>
> (The snitch is going to pick the best option for local_one reads and any
> compactions or latency deviations from load will make it likely that
> another replica is chosen in practice)
>
> On Nov 11, 2020, at 3:46 AM, Alex Ott  wrote:
>
> 
> if you force routing key, then the replica that owns the data will be
> selected as coordinator
>
> On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester
>  wrote:
>
>> Thanx,
>>
>> But i'm OK with coordinator part, actually i was looking for kind of read
>> CL to force to read from the coordinator only with no other connections to
>> other nodes!
>>
>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>
>>
>>
>>  Forwarded message 
>> From: Alex Ott 
>> To: "user"
>> Date: Wed, 11 Nov 2020 11:28:56 +0330
>> Subject: Re: local read from coordinator
>>  Forwarded message 
>>
>> token-aware policy doesn't work for token range queries (at least in the
>> Java driver 3.x).  You need to force the driver to do the reading using a
>> specific token as a routing key.  Here is Java implementation of the token
>> range scanning algorithm that Spark uses:
>> https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java
>>
>> I'm not aware if Python driver is able to set routing key explicitly, but
>> whitelist policy should help
>>
>>
>>
>> On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
>> wrote:
>>
>> Yes, use a token-aware policy so the driver will pick a coordinator where
>> the token (partition) exists. Cheers!
>>
>>
>>
>> --
>> With best wishes,Alex Ott
>> http://alexott.net/
>> Twitter: alexott_en (English), alexott (Russian)
>>
>>
>>
>>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Which open source or free tool do you use to monitor cassandra clusters?

2021-06-16 Thread Alex Ott
Look onto https://github.com/datastax/metric-collector-for-apache-cassandra

On Wed, Jun 16, 2021 at 5:21 PM Surbhi Gupta 
wrote:

> Hi,
>
> Which open source or free tool do you use to monitor cassandra clusters
> which have similar features like Opscenter?
>
> Thanks
> Surbhi
>
>

-- 
With best wishes,    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Changing num_tokens and migrating to 4.0

2021-03-20 Thread Alex Ott
if the nodes are almost the same, except the disk space, then giving them
more may make siltation worse - they will get more requests than other
nodes, and won't have resources to process them.
In Cassandra the disk size isn't the main "success" factor - it's a memory,
CPU, disk type (SSD), etc.

On Sat, Mar 20, 2021 at 5:26 PM Lapo Luchini  wrote:

> Hi, thanks for suggestions!
> I'll definitely migrate to 4.0 after all this is done, then.
>
> Old prod DC I fear can't suffer losing a node right now (a few nodes
> have the disk 70% full), but I can maybe find a third node for the new
> DC right away.
>
> BTW the new nodes have got 3× the disk space, but are not so much
> different regarding CPU and RAM: does it make any sense giving them a
> bit more num_tokens (maybe 20-30 instead of 16) than the rest of the old
> DC hosts or "asymmetrical" clusters lead to problems?
>
> No real need to do that anyways, moving from 6 nodes to (eventually) 8
> should be enough lessen the load on the disks, and before more space is
> needed I will probably have more nodes.
>
> Lapo
>
> On 2021-03-20 16:23, Alex Ott wrote:
> > I personally maybe would go following way (need to calculate how many
> > joins/decommissions will be at the end):
> >
> >   * Decommission one node from prod DC
> >   * Form new DC from two new machines and decommissioned one.
> >   * Rebuild DC from existing one, make sure that repair finished, etc.
> >   * Switch traffic
> >   * Remove old DC
> >   * Add nodes from old DC one by one into new DC
> >
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


Re: Changing num_tokens and migrating to 4.0

2021-03-20 Thread Alex Ott
The are several things to consider here:

   - You can't have DC of two nodes with RF=3...
   - Are you sure that new DC will handle all production traffic?
   - if new nodes much more powerful than other (memory/CPU/disk type) that
   could also cause unpredictable spikes when request will hit the "smaller"
   node.


I personally maybe would go following way (need to calculate how many
joins/decommissions will be at the end):

   - Decommission one node from prod DC
   - Form new DC from two new machines and decommissioned one.
   - Rebuild DC from existing one, make sure that repair finished, etc.
   - Switch traffic
   - Remove old DC
   - Add nodes from old DC one by one into new DC

Upgrade to Cassandra 4.0 should be done either prior to that, or after -
you shouldn't do it when doing bootstrapping/decomissioning...



On Sat, Mar 20, 2021 at 4:09 PM Lapo Luchini  wrote:

> I have a 6 nodes production cluster running 3.11.9 with the default
> num_tokens=256… which is fine but I later discovered is a bit of a
> hassle to do repairs and is probably better to lower that to 16.
>
> I'm adding two new nodes with much higher space storage and I was
> wondering which migration strategy is better.
>
> If I got it correct I was thinking about this:
> 1. add the 2 new nodes as a new "temporary DC", with num_token=16 RF=3
> 2. repair it all, then test it a bit
> 3. switch production applications to "DC-temp"
> 4. drop the old 6-node DC
> 5. re-create it from scratch with num_token=16 RF=3
> 6. switch production applications to "main DC" again
> 7. drop "DC-temp", eventually integrate nodes into "main DC"
>
> I'd also like to migrate from 3.11.9 to 4.0-beta2 (I'm running on
> FreeBSD so those are the options), does it make sense to do it during
> the mentioned "num_tokens migration" (at step 1, or 5) or does it make
> more sense to do it at step 8, as a in-place rolling upgrade of each of
> the 6 (or 8) nodes?
>
> Did I get it correctly?
> Can it be done "better"?
>
> Thanks in advance for any suggestion or correction!
>
> --
> Lapo Luchini
> l...@lapo.it
>
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)