Re: Problem with restoring a snapshot using sstableloader
It's a bug in the sstableloader introduced many years ago - before that, it worked as described in documentation... Oliver Herrmann at "Fri, 30 Nov 2018 17:05:43 +0100" wrote: OH> Hi, OH> I'm having some problems to restore a snapshot using sstableloader. I'm using cassandra 3.11.1 and followed the instructions for OH> a creating and restoring from this page: OH> https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/tools/toolsSStables/toolsBulkloader.html OH> 1. Called nodetool cleanup on each node OH> $ nodetool cleanup cass_testapp OH> 2. Called nodetool snapshot on each node OH> $ nodetool snapshot -t snap1 -kt cass_testapp.table3 OH> 3. Checked the data and snapshot folders: OH> $ ll /var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb OH> drwxr-xr-x 2 cassandra cassandra 6 Nov 29 03:54 backups OH> -rw-r--r-- 2 cassandra cassandra 43 Nov 30 10:21 mc-11-big-CompressionInfo.db OH> -rw-r--r-- 2 cassandra cassandra 241 Nov 30 10:21 mc-11-big-Data.db OH> -rw-r--r-- 2 cassandra cassandra 9 Nov 30 10:21 mc-11-big-Digest.crc32 OH> -rw-r--r-- 2 cassandra cassandra 16 Nov 30 10:21 mc-11-big-Filter.db OH> -rw-r--r-- 2 cassandra cassandra 21 Nov 30 10:21 mc-11-big-Index.db OH> -rw-r--r-- 2 cassandra cassandra 4938 Nov 30 10:21 mc-11-big-Statistics.db OH> -rw-r--r-- 2 cassandra cassandra 95 Nov 30 10:21 mc-11-big-Summary.db OH> -rw-r--r-- 2 cassandra cassandra 92 Nov 30 10:21 mc-11-big-TOC.txt OH> drwxr-xr-x 3 cassandra cassandra 18 Nov 30 10:30 snapshots OH> and OH> $ ll /var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb/snapshots/snap1/ OH> total 44 OH> -rw-r--r-- 1 cassandra cassandra 32 Nov 30 10:30 manifest.json OH> -rw-r--r-- 2 cassandra cassandra 43 Nov 30 10:21 mc-11-big-CompressionInfo.db OH> -rw-r--r-- 2 cassandra cassandra 241 Nov 30 10:21 mc-11-big-Data.db OH> -rw-r--r-- 2 cassandra cassandra 9 Nov 30 10:21 mc-11-big-Digest.crc32 OH> -rw-r--r-- 2 cassandra cassandra 16 Nov 30 10:21 mc-11-big-Filter.db OH> -rw-r--r-- 2 cassandra cassandra 21 Nov 30 10:21 mc-11-big-Index.db OH> -rw-r--r-- 2 cassandra cassandra 4938 Nov 30 10:21 mc-11-big-Statistics.db OH> -rw-r--r-- 2 cassandra cassandra 95 Nov 30 10:21 mc-11-big-Summary.db OH> -rw-r--r-- 2 cassandra cassandra 92 Nov 30 10:21 mc-11-big-TOC.txt OH> -rw-r--r-- 1 cassandra cassandra 1043 Nov 30 10:30 schema.cql OH> 4. Truncated the table OH> cqlsh:cass_testapp> TRUNCATE table3 ; OH> 5. Tried to restore table3 on one cassandra node OH> $ sstableloader -d localhost /var/lib/cassandra/data/cass_testapp/table3-7227e480f3b411e8941285913bce94cb/snapshots/snap1/ OH> Established connection to initial hosts OH> Opening sstables and calculating sections to stream OH> Skipping file mc-11-big-Data.db: table snapshots.table3 doesn't exist OH> Summary statistics: OH> Connections per host : 1 OH> Total files transferred : 0 OH> Total bytes transferred : 0.000KiB OH> Total duration : 2652 ms OH> Average transfer rate : 0.000KiB/s OH> Peak transfer rate : 0.000KiB/s OH> I'm always getting the message "Skipping file mc-11-big-Data.db: table snapshots.table3 doesn't exist". I also tried to rename OH> the snapshots folder into the keyspace name (cass_testapp) but then I get the message "Skipping file mc-11-big-Data.db: table OH> snap1.snap1. doesn't exist". OH> What I'm doing wrong? OH> Thanks OH> Oliver -- With best wishes,Alex Ott Solutions Architect EMEA, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Understanding output of read/write histogram using opscenter API
You can also ask at #opscenter channel at DataStax Academy Slack: http://academy.datastax.com/slack Bhardwaj, Rahul at "Tue, 4 Jun 2019 11:24:44 +" wrote: BR> Hi All, BR> Do we have any document to understand output of read/write histogram using opscenter API. We need them to ingest it to create BR> one of our dashboards. We are facing difficulty in understanding its output if we relate it with 5 values like BR> max,min,median,90th percentile,etc. Attaching one sample output. we could not find the way to get different percentile’s data BR> using API for read-histogram and write-histogram. Kindly help or provide some doc link related to its explanation. BR> Thanks and Regards, BR> Rahul Bhardwaj BR> - BR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org BR> For additional commands, e-mail: user-h...@cassandra.apache.org -- With best wishes,Alex Ott Solutions Architect EMEA, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Performance impact with ALLOW FILTERING clause.
Spark connector doesn't do the "select * from table;" - it does reads by token ranges, reading the data (see https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/CassandraPartition.scala#L14) Jacques-Henri Berthemet at "Thu, 25 Jul 2019 14:18:57 +" wrote: JB> Hi Asad, JB> That’s because of the way Spark works. Essentially, when you execute a Spark job, it pulls the full content of the datastore (Cassandra JB> in your case) in it RDDs and works with it “in memory”. While Spark uses “data locality” to read data from the nodes that have the JB> required data on its local disks, it’s still reading all data from Cassandra tables. To do so it’s sending ‘select * from Table ALLOW JB> FILTERING’ query to Cassandra. JB> From Spark you don’t have much control on the initial query to fill the RDDs, sometimes you’ll read the whole table even if you only JB> need one row. JB> Regards, JB> Jacques-Henri Berthemet JB> From: "ZAIDI, ASAD A" JB> Reply to: "user@cassandra.apache.org" JB> Date: Thursday 25 July 2019 at 15:49 JB> To: "user@cassandra.apache.org" JB> Subject: Performance impact with ALLOW FILTERING clause. JB> Hello Folks, JB> I was going thru documentation and saw at many places saying ALLOW FILTERING causes performance unpredictability. Our developers says JB> ALLOW FILTERING clause is implicitly added on bunch of queries by spark-Cassandra connector and they cannot control it; however at the JB> same time we see unpredictability in application performance – just as documentation says. JB> I’m trying to understand why would a connector add a clause in query when this can cause negative impact on database/application JB> performance. Is that data model that is driving connector make its decision and add allow filtering to query automatically or if there JB> are other reason this clause is added to the code. I’m not a developer though I want to know why developer don’t have any control on JB> this to happen. JB> I’ll appreciate your guidance here. JB> Thanks JB> Asad -- With best wishes,Alex Ott Solutions Architect EMEA, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Keyspace Clone in Existing Cluster
You can create all tables in new keyspace, copy SSTables from 1.0 to 2.0 tables & use nodetool refresh on tables in KS 2.0 to say Cassandra about them. On Tue, Oct 29, 2019 at 4:10 PM Ankit Gadhiya wrote: > Hello Folks, > > Greetings!. > > I've a requirement in my project to setup Blue-Green deployment for > Cassandra. E.x. Say My current active schema (application pointing to) is > Keyspace V1.0 and for my next release I want to setup Keysapce 2.0 (with > some structural changes) and all testing/validation would happen on it and > once successful , App would switch connection to keyspace 2.0 - This would > be generic release deployment for our project. > > One of the approach we thought of would be to Create keyspace 2.0 as clone > from Keyspace 1.0 including data using sstableloader but this would be time > consuming, also being a multi-node cluster (6+6 in each DC) - it wouldn't > be very feasible to do this manually on all the nodes for multiple tables > part of that keyspace. Was wondering if we have any other creative way to > suffice this requirement. > > Appreciate your time on this. > > > *Thanks & Regards,* > *Ankit Gadhiya* > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: COPY command with where condition
don't need to export... >> >> >> >> -- >> *De :* adrien ruffie >> *Envoyé :* vendredi 17 janvier 2020 11:39 >> *À :* Erick Ramirez ; user@cassandra.apache.org < >> user@cassandra.apache.org> >> *Objet :* RE: COPY command with where condition >> >> Thank a lot ! >> It's a good news for DSBulk ! I will take a look around this solution. >> >> best regards, >> Adrian >> -- >> *De :* Erick Ramirez >> *Envoyé :* vendredi 17 janvier 2020 10:02 >> *À :* user@cassandra.apache.org >> *Objet :* Re: COPY command with where condition >> >> The COPY command doesn't support filtering and it doesn't perform well >> for large tables. >> >> Have you considered the DSBulk tool from DataStax? Previously, it only >> worked with DataStax Enterprise but a few weeks ago, it was made free and >> works with open-source Apache Cassandra. For details, see this blogpost >> <https://www.datastax.com/blog/2019/12/tools-for-apache-cassandra>. >> Cheers! >> >> On Fri, Jan 17, 2020 at 6:57 PM adrien ruffie >> wrote: >> >> Hello all, >> >> In my company we want to export a big dataset of our cassandra's ring. >> We search to use COPY command but I don't find if and how can a WHERE >> condition can be use ? >> >> Because we need to export only several data which must be return by a >> WHERE closure, specially >> and unfortunately with ALLOW FILTERING due to several old tables which >> were poorly conceptualized... >> >> Do you know a means to do that please ? >> >> Thank all and best regards >> >> Adrian >> >> >> -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: How to know execute_async correctly?
Hi There are several things here: 1. When you're executing query via execute_async, instead of ResultSet you're getting ResponseFuture instead. Then you can use .result on it to obtain results of execution, or error. Another possibility is to attach callbacks to that Future object. See https://docs.datastax.com/en/developer/python-driver/3.20/getting_started/#asynchronous-queries for more details & examples. 2. You need to be very careful with using batches - if you just putting different queries there, then you're making your Cassandra slower, not faster. Please read https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useBatch.html to understand when you can use them effectively I recommend first to go through "Developing applications with DataStax drivers" guide (https://docs.datastax.com/en/devapp/doc/devapp/aboutDrivers.html) to get understanding how to work with Cassandra using drivers. lampahome at "Thu, 12 Dec 2019 12:17:42 +0800" wrote: l> I tried to execute async by batch in python-driver. But I don't know how to check query executing correctly. l> Code is like below: l> B = BatchStatement() l> for x in xxx: l> B.add(query, (args)) l> res = session.execute_async(B) l> B.clear() # for reusing l> r = res.result() l> ## Then how to know my query works correctly? print(r)? l> I found no doc about my question in the page of ResultSet. l> Can anyone explain? l> thx -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Disabling Swap for Cassandra
I usually recommend following document: https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configRecommendedSettings.html - it's about DSE, but applicable to OSS Cassandra as well... Kunal at "Thu, 16 Apr 2020 15:49:35 -0700" wrote: K> Hello, K> K> I need some suggestion from you all. I am new to Cassandra and was reading Cassandra best practices. On one document, it was K> mentioned that Cassandra should not be using swap, it degrades the performance. K> My question is instead of disabling swap system wide, can we force Cassandra not to use swap? Some documentation suggests to use K> memory_locking_policy in cassandra.yaml. K> How do I check if our Cassandra already has this parameter and still uses swap ? Is there any way i can check this. I already K> checked cassandra.yaml and dont see this parameter. Is there any other place i can check and confirm? K> Also, Can I set memlock parameter to unlimited (64kB default), so entire Heap (Xms = Xmx) can be locked at node startup ? Will that K> help? K> Or if you have any other suggestions, please let me know. K> K> K> Regards, K> Kunal K> -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: How to find which table partitions having the more reads per sstables ?
There is also nodetool toppartitions: https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsToppartitions.html Erick Ramirez at "Mon, 16 Mar 2020 22:44:44 +1100" wrote: ER> How to find which table partitions having the more reads per sstables in Cassandra? ER> Your question is unclear. Do you want to know which tables are read the most? If so, you'll need to run nodetool tablestats and parse/sort ER> the output to get the top tables based on read count. ER> But if you want to know which are the "hottest" partitions, you'll need to have audit logging enabled to catch the incoming CQL. See Audit ER> Logging if your cluster is running OSS C*. If your cluster is running DataStax Enterprise, see Database auditing for details. Cheers! ER> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have answers! Share your expertise on ER> https://community.datastax.com/. -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: OOM only on one datacenter nodes
Have you set -Xmx32g ? In this case you may get significantly less available memory because of switch to 64-bit references. See http://java-performance.info/over-32g-heap-java/ for details, and set slightly less than 32Gb Reid Pinchback at "Sun, 5 Apr 2020 00:50:43 +" wrote: RP> Surbi: RP> If you aren’t seeing connection activity in DC2, I’d check to see if the operations hitting DC1 are quorum ops instead of local quorum. That RP> still wouldn’t explain DC2 nodes going down, but would at least explain them doing more work than might be on your radar right now. RP> The hint replay being slow to me sounds like you could be fighting GC. RP> You mentioned bumping the DC2 nodes to 32gb. You might have already been doing this, but if not, be sure to be under 32gb, like 31gb. RP> Otherwise you’re using larger object pointers and could actually have less effective ability to allocate memory. RP> As the problem is only happening in DC2, then there has to be a thing that is true in DC2 that isn’t true in DC1. A difference in hardware, a RP> difference in O/S version, a difference in networking config or physical infrastructure, a difference in client-triggered activity, or a RP> difference in how repairs are handled. Somewhere, there is a difference. I’d start with focusing on that. RP> From: Erick Ramirez RP> Reply-To: "user@cassandra.apache.org" RP> Date: Saturday, April 4, 2020 at 8:28 PM RP> To: "user@cassandra.apache.org" RP> Subject: Re: OOM only on one datacenter nodes RP> Message from External Sender RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 nodes are taking on traffic (from some client somewhere) but you're RP> just not aware of it. The hints replay is just a side-effect of the nodes getting overloaded. RP> To rule out my hypothesis in the first instance, my recommendation is to monitor the incoming connections to the nodes in DC2. If you don't RP> have monitoring in place, you could simply run netstat at regular intervals and go from there. Cheers! RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have answers! Share your expertise on https://community.datastax.com/. -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Query data through python using IN clause
Hi Working code is below, but I want to warn you - prefer not to use IN with partition keys - because you'll have different partition key values, coordinator node will need to perform queries to other hosts that hold these partition keys, and this slow downs the operation, and adds an additional load to the coordinating node. If you execute queries in parallel (using async) for every of combination of pk1 & pk2, and then consolidate data application side - this could be faster than query with IN. Answer: You need to pass list as value of temp - IN expects list there... query = session.prepare("select * from test.table1 where pk1 IN ? and pk2=0 and ck1 > ? AND ck1 < ?;") temp = [1,2,3] import dateutil.parser ck1 = dateutil.parser.parse('2020-01-01T00:00:00Z') ck2 = dateutil.parser.parse('2021-01-01T00:00:00Z') rows = session.execute(query, (temp, ck1, ck2)) for row in rows: print row Nitan Kainth at "Wed, 1 Apr 2020 18:21:54 -0500" wrote: NK> Hi There, NK> I am trying to read data from table as below structure: NK> table1( NK> pk1 bigint, NK> pk2 bigint, NK> ck1 timestamp, NK> value text, NK> primary key((pk1,pk2),ck1); NK> query = session.prepare("select * from table1 where pk IN ? and pk2=0 and ck1 > ? AND ck1 < ?;") NK> temp = 1,2,3 NK> runq = session.execute(query2, (temp,ck1, ck1)) NK> TypeError: Received an argument of invalid type for column "in(bam_user)". Expected: , Got: NK> ; (cannot convert argument to integer) NK> I found examples for prepared statements for inserts but couldn't find any for select and not able to make it to work. NK> Any suggestions? -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Issues, understanding how CQL works
; >>>> SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC >>>> And you can do >>>> >>>> SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC >>>> >>>> And you can do ranges: >>>> >>>> SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY >>>> fromtime DESC >>>> >>>> But you have to work within the boundaries of how the data is stored. >>>> It's stored grouped by signalid+monthyear, and then sorted by fromtime, >>>> and then sorted by totime. >>>> >>>> >>>> >>>> So, after some trial and error and a lot of Googling, I learned that >>>> I >>>> must include all rows from the PRIMARY KEY from left to right in my >>>> query. Thus, this is the "best" I can get to work: >>>> >>>> >>>> SELECT >>>> * >>>> FROM >>>> "tagdata.central" >>>> WHERE >>>> "signalid" = 4002 >>>> AND "monthyear" = 201908 >>>> ORDER BY >>>> "fromtime" DESC >>>> LIMIT 10; >>>> >>>> >>>> The "monthyear" column, I crafted like a fool by incrementing the >>>> date >>>> one month after another until no results could be found anymore. >>>> The "signalid" I grabbed from one of the unrestricted "SELECT * >>>> FROM" - >>>> query results. But these can't be as easily guessed as the >>>> "monthyear" >>>> values could. >>>> >>>> This is where I'm stuck! >>>> >>>> 1. This does not really feel like the ideal way to go. I think there >>>> is >>>> something more mature in modern IT systems. Can anyone tell me what >>>> is a >>>> better way to get these informations? >>>> >>>> >>>> You can denormalize. Because cassandra allows you to have very large >>>> clusters, you can make multiple tables sorted in different ways to >>>> enable the queries you need to run. Normal data modeling is to build >>>> tables based on the SELECT statements you need to do (unless you're very >>>> advanced, in which case you do it based on the transaction semantics of >>>> the INSERT/UPDATE statements, but that's probably not you). >>>> >>>> Or you can use a more flexible database. >>>> >>>> >>>> 2. I need a way to learn all values that are in the "monthyear" and >>>> "signalid" columns in order to be able to craft that query. >>>> How can I achieve that in a reasonable way? As I said: The DB is >>>> round >>>> about 260 GB which makes it next to impossible to just "have a look" >>>> at >>>> the output of "SELECT *".. >>>> >>>> >>>> You probably want to keep another table of monthyear + signalid pairs. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: user-h...@cassandra.apache.org >>> >>> >>> >>> >>> The information in this Internet Email is confidential and may be legally >>> privileged. It is intended solely for the addressee. Access to this Email >>> by anyone >>> else is unauthorized. If you are not the intended recipient, any >>> disclosure, copying, >>> distribution or any action taken or omitted to be taken in reliance on it, >>> is >>> prohibited and may be unlawful. When addressed to our clients any opinions >>> or advice >>> contained in this Email are subject to the terms and conditions expressed >>> in any >>> applicable governing The Home Depot terms of business or client engagement >>> letter. The >>> Home Depot disclaims all responsibility and liability for the accuracy and >>> content of >>> this attachment and for any damages or losses arising from any >>> inaccuracies, errors, >>> viruses, e.g., worms, trojan horses, etc., or other items of a destructive >>> nature, >>> which may be contained in this attachment and shall not be liable for >>> direct, indirect, >>> consequential or special damages in connection with this e-mail message or >>> its >>> attachment. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: user-h...@cassandra.apache.org >>> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org MR> - MR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org MR> For additional commands, e-mail: user-h...@cassandra.apache.org -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Issues, understanding how CQL works
> What your primary key REALLY MEANS is: > > > > The database on reads and writes will hash(signalid+monthyear) to find > > which hosts have the data, then > > > > In each data file, the data for a given (signalid,monthyear) is stored > > sorted by fromtime and totime > > > > The database is already of round about 260 GB in size. > > I now need to know what is the most recent entry in it; the correct > > column to learn this would be "insertdate". > > > > In SQL I would do something like this: > > > > SELECT insertdate FROM tagdata.central > > ORDER BY insertdate DESC LIMIT 1; > > > > In CQL, however, I just can't get it to work. > > > > What I have tried already is this: > > > > SELECT insertdate FROM "tagdata.central" > > ORDER BY insertdate DESC LIMIT 1; > > > > > > Because you didnt provide a signalid and monthyear, it doesn't know > > which machine in your cluster to use to start the query. > > > > > > But this gives me an error: > > ERROR: ORDER BY is only supported when the partition key is > restricted > > by an EQ or an IN. > > > > > > Because it's designed for potentially petabytes of data per cluster, it > > doesn't believe you really want to walk all the data and order ALL of > > it. Instead, it assumes that when you need to use an ORDER BY, you're > > going to have some very small piece of data - confined to a single > > signalid/monthyear pair. And even then, the ORDER is going to assume > > that you're ordering it by the ordering keys you've defined - fromtime > > first, and then totime. > > > > So you can do > > > > SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC > > And you can do > > > > SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC > > > > And you can do ranges: > > > > SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY > > fromtime DESC > > > > But you have to work within the boundaries of how the data is stored. > > It's stored grouped by signalid+monthyear, and then sorted by fromtime, > > and then sorted by totime. > > > > > > > > So, after some trial and error and a lot of Googling, I learned that > I > > must include all rows from the PRIMARY KEY from left to right in my > > query. Thus, this is the "best" I can get to work: > > > > > > SELECT > > * > > FROM > > "tagdata.central" > > WHERE > > "signalid" = 4002 > > AND "monthyear" = 201908 > > ORDER BY > > "fromtime" DESC > > LIMIT 10; > > > > > > The "monthyear" column, I crafted like a fool by incrementing the > date > > one month after another until no results could be found anymore. > > The "signalid" I grabbed from one of the unrestricted "SELECT * > FROM" - > > query results. But these can't be as easily guessed as the > "monthyear" > > values could. > > > > This is where I'm stuck! > > > > 1. This does not really feel like the ideal way to go. I think there > is > > something more mature in modern IT systems. Can anyone tell me what > > is a > > better way to get these informations? > > > > > > You can denormalize. Because cassandra allows you to have very large > > clusters, you can make multiple tables sorted in different ways to > > enable the queries you need to run. Normal data modeling is to build > > tables based on the SELECT statements you need to do (unless you're very > > advanced, in which case you do it based on the transaction semantics of > > the INSERT/UPDATE statements, but that's probably not you). > > > > Or you can use a more flexible database. > > > > > > 2. I need a way to learn all values that are in the "monthyear" and > > "signalid" columns in order to be able to craft that query. > > How can I achieve that in a reasonable way? As I said: The DB is > round > > about 260 GB which makes it next to impossible to just "have a look" > at > > the output of "SELECT *".. > > > > > > You probably want to keep another table of monthyear + signalid pairs. > > ----- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Tool for schema upgrades
Hi Look at https://github.com/patka/cassandra-migration - it should be good. P.S. Here is the list of tools that I assembled over the years: - [ ] https://github.com/hhandoko/cassandra-migration - [ ] https://github.com/Contrast-Security-OSS/cassandra-migration - [ ] https://github.com/juxt/joplin - [ ] https://github.com/o19s/trireme - [ ] https://github.com/golang-migrate/migrate - [ ] https://github.com/Cobliteam/cassandra-migrate - [ ] https://github.com/patka/cassandra-migration - [ ] https://github.com/comeara/pillar On Thu, Oct 8, 2020 at 5:45 PM Paul Chandler wrote: > Hi all, > > Can anyone recommend a tool to perform schema DDL upgrades, that follows > best practice to ensure you don’t get schema mismatches if running multiple > upgrade statements in one migration ? > > Thanks > > Paul > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: tombstones - however there are no deletes
Btw, if you seen the number of tombstones that is a multiply of number of scanned rows, like in your case - that’s a explicit signal of either null inserts, or non frozen collections... On Fri 21. Aug 2020 at 20:21, Attila Wind wrote: > > > > > > > > > > > right! silly me (regarding "can't have null for clustering > > column") :-) > > > OK code is modified, we stopped using NULL on that column. In a > > few days we will see if this was the cause. > > > Thanks for the useful info eveyrone! Helped a lot! > > > > > Attila Wind > > > > > > > > http://www.linkedin.com/in/attilaw > > > Mobile: +49 176 43556932 > > > > > > > > > > > > > 21.08.2020 11:04 keltezéssel, Alex Ott > > írta: > > > > > > > > > > > inserting null for any column will generate the tombstone > > (and you can't have null for clustering column, except case > > when it's an empty partition with static column). > > > if you're really inserting the new data, not overwriting > > existing one - use UNSET instead of null > > > > > > > > > > > > On Fri, Aug 21, 2020 at 10:45 > > AM Attila Wind wrote: > > > > > >> >> >> >> Thanks a lot! I will process every pointers you gave - >> >> appreciated! >> >> >> >> >> 1. we do have collection column in that table but that is >> >> (we have only 1 column) a frozen Map - so I guess >> >> "Tombstones are also implicitly created any time you >> >> insert or update a row which has an (unfrozen) collection >> >> column: list<>, map<> or set<>. This >> >> has to be done in order to ensure the new write replaces >> >> any existing collection entries." does not really apply >> >> here >> >> >> 2. "Isn’t it so that explicitly >> >> setting a column to NULL also result in a tombstone" >> >> >> Is this true for all columns? or just clustering key cols? >> >> >> Because if for all cols (which would make sense maybe to >> >> me more) then we found the possible reason.. :-) >> >> >> As we do have an Integer coulmn there which is actually >> >> NULL often (and so far in all cases) >> >> >> >> >> >> >> >> >> >> Attila >> >> Wind >> >> >> >> >> >> >> >> http://www.linkedin.com/in/attilaw >> >> >> Mobile: +49 176 43556932 >> >> >> >> >> >> >> >> >> >> >> 21.08.2020 09:49 keltezéssel, Oleksandr Shulgin írta: >> >> >> >> >> >> >> >> >> On Fri, Aug 21, 2020 at 9:43 AM Tobias >> >> Eriksson >> >> wrote: >> >> >> >> >> >> >> >>> >>> >>> >>> >>> >>> Isn’t it >>> >>> so that explicitly setting a column to NULL >>> >>> also result in a tombstone >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> >> True, thanks for pointing that out! >> >> >> >> >> >> >> >> >>> >>> >>> >>> >>> >>> >>> >>> Then as >>> >>> mentioned the use of list,set,map can also >>> >>> result in tombstones >>> >>> >>> See >>> >>> >>> https://www.instaclustr.com/cassandra-collections-hidden-tombstones-and-how-to-avoid-them/ >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> >> And A. Ott has already mentioned both these >> >> possible reasons :-) >> >> >> >> >> >> >> >> -- >> >> >> Alex >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > > > > > > > > -- > > > > > > > With best wishes,Alex Ott > > > http://alexott.net/ > > > Twitter: alexott_en (English), alexott (Russian) > > > > > > > > > > > > > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: tombstones - however there are no deletes
inserting null for any column will generate the tombstone (and you can't have null for clustering column, except case when it's an empty partition with static column). if you're really inserting the new data, not overwriting existing one - use UNSET instead of null On Fri, Aug 21, 2020 at 10:45 AM Attila Wind wrote: > Thanks a lot! I will process every pointers you gave - appreciated! > > 1. we do have collection column in that table but that is (we have only 1 > column) a frozen Map - so I guess "Tombstones are also implicitly created > any time you insert or update a row which has an (unfrozen) collection > column: list<>, map<> or set<>. This has to be done in order to ensure the > new write replaces any existing collection entries." does not really apply > here > > 2. "Isn’t it so that explicitly setting a column to NULL also result in a > tombstone" > Is this true for all columns? or just clustering key cols? > Because if for all cols (which would make sense maybe to me more) then we > found the possible reason.. :-) > As we do have an Integer coulmn there which is actually NULL often (and so > far in all cases) > > > Attila Wind > > http://www.linkedin.com/in/attilaw > Mobile: +49 176 43556932 > > > 21.08.2020 09:49 keltezéssel, Oleksandr Shulgin írta: > > On Fri, Aug 21, 2020 at 9:43 AM Tobias Eriksson < > tobias.eriks...@qvantel.com> wrote: > >> Isn’t it so that explicitly setting a column to NULL also result in a >> tombstone >> > > True, thanks for pointing that out! > > Then as mentioned the use of list,set,map can also result in tombstones >> >> See >> https://www.instaclustr.com/cassandra-collections-hidden-tombstones-and-how-to-avoid-them/ >> > > And A. Ott has already mentioned both these possible reasons :-) > > -- > Alex > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: tombstones - however there are no deletes
Tombstones could be not only generated by deletes. this happens when you: - When insert or full update of a non-frozen collection occurs, such as replacing the value of the column with another value like the UPDATE table SET field = new_value …, Cassandra inserts a tombstone marker to prevent possible overlap with previous data even if data did not previously exist. A large number of tombstones can significantly affect read performance. - When you insert null explicitly, instead of using UNSET for missing data. On Fri, Aug 21, 2020 at 7:57 AM Attila Wind wrote: > Hi Cassandra Gurus, > > Recently I captured a very interesting warning in the logs saying > > 2020-08-19 08:08:32.492 > [cassandra-client-keytiles_data_webhits-nio-worker-2] WARN > com.datastax.driver.core.RequestHandler - Query '[3 bound values] select * > from visit_sess > ion_by_start_time_v4 where container_id=? and first_action_time_frame_id > >= ? and first_action_time_frame_id <= ?;' generated server side > warning(s): > *Read 6628 live rows and 6628 tombstone cells* for query SELECT * FROM > keytiles_data_webhits.visit_session_by_start_time_v4 WHERE container_id = > 5YzsPfE2Gcu8sd-76626 AND first_action_time_frame_id > 4 > 43837 AND first_action_time_frame_id <= 443670 AND user_agent_type > > browser-mobile AND unique_webclient_id > > 045d1683-c702-48bd-9d2b-dcf1ca87ac7c AND first_action_ts > 15978 > 15766 LIMIT 6628 (see tombstone_warn_threshold) > > What makes this interesting to me is the fact we never issue not even row > level deletes but any kind of deletes against this table for now > So I'm wondering what can result in tombstone creation in Cassandra - > apart from explicit DELETE queries and TTL setup... > > My suspicion is (but I'm not sure) that as we are going with "select *" > read strategy, then calculate everything in-memory, eventually writing back > with kinda "update *" queries to Cassandra in this table (so not updating > just a few columns but everything) can lead to these... Can it? > I tried to search around this sympthom but was not successful - so decided > to ask you guys maybe someone can give us a pointer... > > Some more info: > >- the table does not have TTL set - this mechanism is turned off >- the LIMIT param in upper query comes from paging size >- we are using Cassandra4 alpha3 >- we also have a few similarly built tables where we follow the above >described "update *" policy on write path - however those tables are >counter tables... when we mass-read them into memory we also go with >"select *" logic reading up tons of rows. The point is we never saw such a >warning for these counter tables however we are handling them same >fashion... ok counter tables work differently but still interesting to me > why those never generated things like this > > thanks! > -- > Attila Wind > > http://www.linkedin.com/in/attilaw > Mobile: +49 176 43556932 > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Understanding replication
data is always written to all replicas in the cluster. here is good diagram how this happens in the multi-dc cluster: https://docs.datastax.com/en/dse/6.7/dse-arch/datastax_enterprise/dbInternals/dbIntClientRequestsMultiDCWrites.html regarding the second question - theoretically, yes, it's possible, but if they were done for the same primary key, then they will be resolved via data timestamp. On Sun, Sep 20, 2020 at 5:31 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > Hello, > > I have a question regarding multi Datacenter replication. > In a multi datacenter(dc-1, dc-2) if two records are written into dc-1 is > there a guarantee that these two records replicate to dc-2 in the same > order or is there a possibility that second insert replicate faster than > first insert? (Assuming first record is bigger than seconds record in terms > of packet size). > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: reverse paging state
Hi for that version of the driver there is no build-in functionality for the backward paging, although it's doable: https://stackoverflow.com/questions/50168236/cassandra-pagination-inside-partition/50172052#50172052 for driver 4.9.0 there is a wrapper class that emultates random paging, with tradeoff for performance: https://docs.datastax.com/en/developer/java-driver/4.9/manual/core/paging/#offset-queries On Fri, Oct 23, 2020 at 10:00 AM Manu Chadha wrote: > In Java driver 3.4.0, how does one revert the order of paging? I want to > implement a “Back” button but I can’t figure out from the API docs if/how I > can make Cassandra (via the Java driver) search backwards. > > > > https://docs.datastax.com/en/drivers/java/3.4/ > > > > The code I have written currently is > > > > session.execute(whereClause > .setFetchSize(fetchSize) > > .setPagingState(pagingState)) > > > > Thanks > > Manu > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Cqlsh copy command on a larger data set
CQLSH definitely won't work for that amount of data, so you need to use other tools. But before selecting them, you need to define requirements. For example: 1. Are you copying the data into tables with exactly the same structure? 2. Do you need to preserve metadata, like, writetime & TTL? Depending on that, you may have following choices: - use sstableloader - it will preserve all metadata, like, ttl and writetime. You just need to copy SSTable files, or stream directly from the source cluster. But this will require copying of data into tables with exactly same structure (and in case of UDTs, the keyspace names should be the same) - use DSBulk - it's a very effective tool for unloading & loading data from/to Cassandra/DSE. Use zstd compression for offloaded data to save disk space (see blog links below for more details). But the preserving metadata could be a problem. - use Spark + Spark Cassandra Connector. But also, preserving the metadata is not an easy task, and requires programming to handle all edge cases (see https://datastax-oss.atlassian.net/browse/SPARKC-596 for details) blog series on DSBulk: - https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading - https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading - https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings - https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading - https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting - https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations On Tue, Jul 14, 2020 at 1:47 AM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > Hello, > > I would like to copy some data from one cassandra cluster to another > cassandra cluster using the CQLSH copy command. Is this the good approach > if the dataset size on the source cluster is very high(500G - 1TB)? If not > what is the safe approach? and are there any limitations/known issues to > keep in mind before attempting this? > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Impact of enabling authentication on performance
You can decrease this time for picking up the change by using lower number for credentials_update_interval_in_ms, roles_update_interval_in_ms & permissions_update_interval_in_ms Durity, Sean R at "Tue, 2 Jun 2020 14:48:28 +" wrote: DSR> To flesh this out a bit, I set roles_validity_in_ms and permissions_validity_in_ms to DSR> 360 (10 minutes). The default of 2000 is far too often for my use cases. Usually I set DSR> the RF for system_auth to 3 per DC. On a larger, busier cluster I have set it to 6 per DSR> DC. NOTE: if you set the validity higher, it may take that amount of time before a change DSR> in password or table permissions is picked up (usually less). DSR> Sean Durity DSR> -Original Message- DSR> From: Jeff Jirsa DSR> Sent: Tuesday, June 2, 2020 2:39 AM DSR> To: user@cassandra.apache.org DSR> Subject: [EXTERNAL] Re: Impact of enabling authentication on performance DSR> Set the Auth cache to a long validity DSR> Don’t go crazy with RF of system auth DSR> Drop bcrypt rounds if you see massive cpu spikes on reconnect storms >> On Jun 1, 2020, at 11:26 PM, Gil Ganz wrote: >> >> >> Hi >> I have a production 3.11.6 cluster which I'm might want to enable >> authentication in, I'm trying to understand what will be the performance >> impact, if any. >> I understand each use case might be different, trying to understand if >> there is a common % people usually see their performance hit, or if someone >> has looked into this. >> Gil DSR> - DSR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org DSR> For additional commands, e-mail: user-h...@cassandra.apache.org DSR> DSR> The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment. DSR> - DSR> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org DSR> For additional commands, e-mail: user-h...@cassandra.apache.org -- With best wishes,Alex Ott Principal Architect, DataStax http://datastax.com/ - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Partition size, limits, recommendations for tables where all columns are part of the primary key
Hi Yes, basically rows have no cells as everything is in the partition key/clustering columns. You can always look unto the data using the sstabledump (this is for DSE 6.7 that I have running): sstabledump ac-1-bti-Data.db [ { "partition" : { "key" : [ "977eb1f1-aa5b-11ea-b91a-db426f6f892c", "977ed900-aa5b-11ea-b91a-db426f6f892c" ], "position" : 0 }, "rows" : [ { "type" : "row", "position" : 78, "clustering" : [ "test", "977ed901-aa5b-11ea-b91a-db426f6f892c" ], "liveness_info" : { "tstamp" : "2020-06-09T14:14:54.863249Z" }, "cells" : [ ] } ] } ] P.S. You can play with your schema, and do some performance tests using the https://github.com/nosqlbench/ On Tue, Jun 9, 2020 at 3:51 PM Benjamin Christenson < ben.christen...@kineticdata.com> wrote: > Hello all, I am doing some data modeling and want to make sure that I > understand some nuances to cell counts, partition sizes, and related > recommendations. Am I correct in my understanding that tables for which > every column is in the primary key will always have 0 cells? > > For example, using https://cql-calculator.herokuapp.com/, I tested the > following table definition with 100 (1 million) rows per partition and > an average value size of 255 bytes, and it returned that there were 0 cells > and the partition took up 32 bytes total: > CREATE TABLE IF NOT EXISTS widgets ( > id timeuuid, > key_id timeuuid, > parent_id timeuuid, > value text, > PRIMARY KEY ((parent_id, key_id), value, id) > ) > > Obviously the total amount of disk space for this table must be more than > 32 bytes. In this situation, how should I be reasoning about partition > sizes (in terms of the 2B cell limit, and 100MB-400MB partition size > limit)? Additionally, are there other limits / potential performance > issues I should be concerned about? > > Ben Christenson > Developer > > Kinetic Data, Inc. > Your business. Your process. > 651-556-0937 | ben.christen...@kineticdata.com > www.kineticdata.com | community.kineticdata.com > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: what is allowed and not allowed w.r.t altering cassandra table schema
Hi This is a quite big topic, maybe it should be a topic for a blog post, etc. I've spent some time working with customers on that, so here is my TLDR: - You can add regular columns (not part of the primary key) to a table. You must not to add the column with the same name as another dropped column - this will lead to errors in commit log replay, and lead to problems with data in existing SSTables. - You can drop regular columns with some limitations: - we can't drop columns that are part of the primary key (if you need to do this, see the next section); - It's not possible to drop columns on tables that have materialized views, secondary or search indexes - if you still need to rename it, you need to drop view or index, and re-create it after renaming; - if you drop a column then re-add it, DSE does not restore the values written before the column was dropped; - It's not possible to change column type - this functionality did exist in some versions, but only for "compatible" data types, but it was removed from Cassandra as part of the CASSANDRA-12443, due the scalability problems. Don't try to drop the column & re-add it again with a new type - you'll get corrupt data. Actual changes in the column type should be done by: - adding a new column with desired type - running migration code that will copy data from existing columns - dropping the old column - To support process continuity, applications may work with both columns (read & write) during migration, and use only one after migration happened. - You can rename column that is the part of the primary key, but not the regular column - You can't change the primary key - you need to create a new table with desired primary key, and migrate data into it. - You can add new field to the UDT, and you can rename field in UDT, but not drop it there are some tools to support schema evolution, for example: https://github.com/hhandoko/cassandra-migration On Wed, Jul 15, 2020 at 8:39 PM Manu Chadha wrote: > Hi > > > > What is allowed and not allowed w.r.t altering Cassandra table schema? > > > > Creating the right schema seems like the most step w.r.t using Cassandra. > Coming from relational background, I still struggle to create schema which > leverages duplication and per-query table (I end up creating relationships > between tables). > > > > Even if I am able to create a schema which doesn’t have relationship > between tables for now, my application will evolve in future and I might > then have to change the schema to avoid creating relationships. > > > > In that respect, what would I be able to change and not change in a > schema? If I add a new field (non-key), then for existing values I suppose > the value of that new field will be null/empty. But if I am allowed to make > the new field an additional key then what will be the value of this key for > existing data? > > > > Thanks > > Manu > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Cqlsh copy command on a larger data set
if you didn't export TTL explicitly, and didn't load it back, then you'll get not expirable data. On Thu, Jul 16, 2020 at 7:48 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > In tried verify metadata, In case of writetime it is setting it as insert > time but the TTL value is showing as null. Is this expected? Does this mean > this record will never expire after the insert? > Is there any alternative to preserve the TTL ? > > In the new Table inserted with Cqlsh and Dsbulk > cqlsh > SELECT ttl(secret) from ks_blah.cf_blah ; > > ttl(secret) > -- > null > null > > (2 rows) > > In the old table where the data was written from application > > cqlsh > SELECT ttl(secret) from ks_old.cf_old ; > > ttl(secret) > > 4517461 > 4525958 > > (2 rows) > > On Wed, Jul 15, 2020 at 1:17 PM Jai Bheemsen Rao Dhanwada < > jaibheem...@gmail.com> wrote: > >> thank you >> >> On Wed, Jul 15, 2020 at 1:11 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Alex is referring to the "writetime" and "tttl" values for each cell. >>> Most tools copy via CQL writes and don't by default copy those previous >>> writetime and ttl values and instead just give a new writetime value which >>> matches the copy time rather than initial insert time. >>> >>> On Wed, Jul 15, 2020 at 3:01 PM Jai Bheemsen Rao Dhanwada < >>> jaibheem...@gmail.com> wrote: >>> >>>> Hello Alex, >>>> >>>> >>>>- use DSBulk - it's a very effective tool for unloading & loading >>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data to >>>> save >>>>disk space (see blog links below for more details). But the *preserving >>>>metadata* could be a problem. >>>> >>>> Here what exactly do you mean by "preserving metadata" ? would you >>>> mind explaining? >>>> >>>> On Tue, Jul 14, 2020 at 8:50 AM Jai Bheemsen Rao Dhanwada < >>>> jaibheem...@gmail.com> wrote: >>>> >>>>> Thank you for the suggestions >>>>> >>>>> On Tue, Jul 14, 2020 at 1:42 AM Alex Ott wrote: >>>>> >>>>>> CQLSH definitely won't work for that amount of data, so you need to >>>>>> use other tools. >>>>>> >>>>>> But before selecting them, you need to define requirements. For >>>>>> example: >>>>>> >>>>>>1. Are you copying the data into tables with exactly the same >>>>>>structure? >>>>>>2. Do you need to preserve metadata, like, writetime & TTL? >>>>>> >>>>>> Depending on that, you may have following choices: >>>>>> >>>>>>- use sstableloader - it will preserve all metadata, like, ttl >>>>>>and writetime. You just need to copy SSTable files, or stream >>>>>> directly from >>>>>>the source cluster. But this will require copying of data into >>>>>> tables with >>>>>>exactly same structure (and in case of UDTs, the keyspace names >>>>>> should be >>>>>>the same) >>>>>>- use DSBulk - it's a very effective tool for unloading & loading >>>>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data >>>>>> to save >>>>>>disk space (see blog links below for more details). But the >>>>>> preserving >>>>>>metadata could be a problem. >>>>>>- use Spark + Spark Cassandra Connector. But also, preserving the >>>>>>metadata is not an easy task, and requires programming to handle all >>>>>> edge >>>>>>cases (see https://datastax-oss.atlassian.net/browse/SPARKC-596 >>>>>>for details) >>>>>> >>>>>> >>>>>> blog series on DSBulk: >>>>>> >>>>>>- >>>>>> >>>>>> https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading >>>>>>- >>>>>> >>>>>> https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading >>>>>>- >>>>>> >>>>>> https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings >>>>>>- >>>>>>https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading >>>>>>- >>>>>>https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting >>>>>>- >>>>>> >>>>>> https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations >>>>>> >>>>>> >>>>>> On Tue, Jul 14, 2020 at 1:47 AM Jai Bheemsen Rao Dhanwada < >>>>>> jaibheem...@gmail.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I would like to copy some data from one cassandra cluster to another >>>>>>> cassandra cluster using the CQLSH copy command. Is this the good >>>>>>> approach >>>>>>> if the dataset size on the source cluster is very high(500G - 1TB)? If >>>>>>> not >>>>>>> what is the safe approach? and are there any limitations/known issues to >>>>>>> keep in mind before attempting this? >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> With best wishes,Alex Ott >>>>>> http://alexott.net/ >>>>>> Twitter: alexott_en (English), alexott (Russian) >>>>>> >>>>> -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Cqlsh copy command on a larger data set
look into a series of the blog posts that I sent, I think that it should be in the 4th post On Thu, Jul 16, 2020 at 8:27 PM Jai Bheemsen Rao Dhanwada < jaibheem...@gmail.com> wrote: > okay, is there a way to export the TTL using CQLsh or DSBulk? > > On Thu, Jul 16, 2020 at 11:20 AM Alex Ott wrote: > >> if you didn't export TTL explicitly, and didn't load it back, then you'll >> get not expirable data. >> >> On Thu, Jul 16, 2020 at 7:48 PM Jai Bheemsen Rao Dhanwada < >> jaibheem...@gmail.com> wrote: >> >>> In tried verify metadata, In case of writetime it is setting it as >>> insert time but the TTL value is showing as null. Is this expected? Does >>> this mean this record will never expire after the insert? >>> Is there any alternative to preserve the TTL ? >>> >>> In the new Table inserted with Cqlsh and Dsbulk >>> cqlsh > SELECT ttl(secret) from ks_blah.cf_blah ; >>> >>> ttl(secret) >>> -- >>> null >>> null >>> >>> (2 rows) >>> >>> In the old table where the data was written from application >>> >>> cqlsh > SELECT ttl(secret) from ks_old.cf_old ; >>> >>> ttl(secret) >>> >>> 4517461 >>> 4525958 >>> >>> (2 rows) >>> >>> On Wed, Jul 15, 2020 at 1:17 PM Jai Bheemsen Rao Dhanwada < >>> jaibheem...@gmail.com> wrote: >>> >>>> thank you >>>> >>>> On Wed, Jul 15, 2020 at 1:11 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> Alex is referring to the "writetime" and "tttl" values for each cell. >>>>> Most tools copy via CQL writes and don't by default copy those previous >>>>> writetime and ttl values and instead just give a new writetime value which >>>>> matches the copy time rather than initial insert time. >>>>> >>>>> On Wed, Jul 15, 2020 at 3:01 PM Jai Bheemsen Rao Dhanwada < >>>>> jaibheem...@gmail.com> wrote: >>>>> >>>>>> Hello Alex, >>>>>> >>>>>> >>>>>>- use DSBulk - it's a very effective tool for unloading & loading >>>>>>data from/to Cassandra/DSE. Use zstd compression for offloaded data >>>>>> to save >>>>>>disk space (see blog links below for more details). But the >>>>>> *preserving >>>>>>metadata* could be a problem. >>>>>> >>>>>> Here what exactly do you mean by "preserving metadata" ? would you >>>>>> mind explaining? >>>>>> >>>>>> On Tue, Jul 14, 2020 at 8:50 AM Jai Bheemsen Rao Dhanwada < >>>>>> jaibheem...@gmail.com> wrote: >>>>>> >>>>>>> Thank you for the suggestions >>>>>>> >>>>>>> On Tue, Jul 14, 2020 at 1:42 AM Alex Ott wrote: >>>>>>> >>>>>>>> CQLSH definitely won't work for that amount of data, so you need to >>>>>>>> use other tools. >>>>>>>> >>>>>>>> But before selecting them, you need to define requirements. For >>>>>>>> example: >>>>>>>> >>>>>>>>1. Are you copying the data into tables with exactly the same >>>>>>>>structure? >>>>>>>>2. Do you need to preserve metadata, like, writetime & TTL? >>>>>>>> >>>>>>>> Depending on that, you may have following choices: >>>>>>>> >>>>>>>>- use sstableloader - it will preserve all metadata, like, ttl >>>>>>>>and writetime. You just need to copy SSTable files, or stream >>>>>>>> directly from >>>>>>>>the source cluster. But this will require copying of data into >>>>>>>> tables with >>>>>>>>exactly same structure (and in case of UDTs, the keyspace names >>>>>>>> should be >>>>>>>>the same) >>>>>>>>- use DSBulk - it's a very effective tool for unloading & >>>>>>>>loading data from/to Cassandra/DSE. Use zstd compression for >>>
Re: Use NetworkTopologyStrategy for single data center and add data centers later
If you're planning to have another DC, then it's better to start to use NetworkTopologyStrategy from beginning - just specify the one DC, and when you get another, it will be simply to expand to it (see documentation: https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsAddDCToCluster.html). When adding new DC, for system keyspaces you can use following script to perform adjustments: https://github.com/DataStax-Toolkit/cassandra-dse-helper-scripts/tree/master/adjust-keyspaces (it could be used for non-system keyspaces as well) On Sat, Dec 19, 2020 at 10:21 AM Manu Chadha wrote: > Is it possible to use NetworkTopologyStrategy when creating a keyspace and > add data centers later? > > I am just starting with an MVP application and I don't expect much > traffic or data. Thus I have created only one data center. However, I'll > like to add more data centers later if needed > > I notice that the replication factor for each data center needs to be > specified at the time of keyspace creation > > CREATE KEYSPACE "Excalibur" > > WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : > 2}; > > As I only have dc1 at the moment, could I just do > > CREATE KEYSPACE "Excalibur" > > WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3}; > > and when I have another datacenter say dc2, could I edit the Excalibur > keyspace? > > ALTER KEYSPACE "Excalibur" > > WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc2' : 2}; > > > > or can I start with SimpleStrategy now and change to > NetworkTopologyStrategy later? I suspect this might not work as I think > this needs changing snitch etc. > > > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Last stored value metadata table
What about using "per partition limit 1" on that table? On Tue, Nov 10, 2020 at 8:39 AM Gábor Auth wrote: > Hi, > > Short story: storing time series of measurements (key(name, timestamp), > value). > > The problem: get the list of the last `value` of every `name`. > > Is there a Cassandra friendly solution to store the last value of every > `name` in a separate metadata table? It will come with a lot of > tombstones... any other solution? :) > > -- > Bye, > Auth Gábor > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: local read from coordinator
token-aware policy doesn't work for token range queries (at least in the Java driver 3.x). You need to force the driver to do the reading using a specific token as a routing key. Here is Java implementation of the token range scanning algorithm that Spark uses: https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java I'm not aware if Python driver is able to set routing key explicitly, but whitelist policy should help On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez wrote: > Yes, use a token-aware policy so the driver will pick a coordinator where > the token (partition) exists. Cheers! > -- With best wishes, Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Re: local read from coordinator
if you force routing key, then the replica that owns the data will be selected as coordinator On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester wrote: > Thanx, > > But i'm OK with coordinator part, actually i was looking for kind of read > CL to force to read from the coordinator only with no other connections to > other nodes! > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > > Forwarded message > From: Alex Ott > To: "user" > Date: Wed, 11 Nov 2020 11:28:56 +0330 > Subject: Re: local read from coordinator > Forwarded message > > token-aware policy doesn't work for token range queries (at least in the > Java driver 3.x). You need to force the driver to do the reading using a > specific token as a routing key. Here is Java implementation of the token > range scanning algorithm that Spark uses: > https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java > > I'm not aware if Python driver is able to set routing key explicitly, but > whitelist policy should help > > > > On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez > wrote: > > Yes, use a token-aware policy so the driver will pick a coordinator where > the token (partition) exists. Cheers! > > > > -- > With best wishes,Alex Ott > http://alexott.net/ > Twitter: alexott_en (English), alexott (Russian) > > > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: local read from coordinator
Jeff, I was talking about driver -> coordinator communication, not from where data will be read On Wed, Nov 11, 2020 at 3:24 PM Jeff Jirsa wrote: > > This isn’t necessarily true and cassandra has no coordinator-only > consistency level to force this behavior > > (The snitch is going to pick the best option for local_one reads and any > compactions or latency deviations from load will make it likely that > another replica is chosen in practice) > > On Nov 11, 2020, at 3:46 AM, Alex Ott wrote: > > > if you force routing key, then the replica that owns the data will be > selected as coordinator > > On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester > wrote: > >> Thanx, >> >> But i'm OK with coordinator part, actually i was looking for kind of read >> CL to force to read from the coordinator only with no other connections to >> other nodes! >> >> Sent using Zoho Mail <https://www.zoho.com/mail/> >> >> >> >> Forwarded message >> From: Alex Ott >> To: "user" >> Date: Wed, 11 Nov 2020 11:28:56 +0330 >> Subject: Re: local read from coordinator >> Forwarded message >> >> token-aware policy doesn't work for token range queries (at least in the >> Java driver 3.x). You need to force the driver to do the reading using a >> specific token as a routing key. Here is Java implementation of the token >> range scanning algorithm that Spark uses: >> https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java >> >> I'm not aware if Python driver is able to set routing key explicitly, but >> whitelist policy should help >> >> >> >> On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez >> wrote: >> >> Yes, use a token-aware policy so the driver will pick a coordinator where >> the token (partition) exists. Cheers! >> >> >> >> -- >> With best wishes,Alex Ott >> http://alexott.net/ >> Twitter: alexott_en (English), alexott (Russian) >> >> >> >> > > -- > With best wishes,Alex Ott > http://alexott.net/ > Twitter: alexott_en (English), alexott (Russian) > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Which open source or free tool do you use to monitor cassandra clusters?
Look onto https://github.com/datastax/metric-collector-for-apache-cassandra On Wed, Jun 16, 2021 at 5:21 PM Surbhi Gupta wrote: > Hi, > > Which open source or free tool do you use to monitor cassandra clusters > which have similar features like Opscenter? > > Thanks > Surbhi > > -- With best wishes, Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Changing num_tokens and migrating to 4.0
if the nodes are almost the same, except the disk space, then giving them more may make siltation worse - they will get more requests than other nodes, and won't have resources to process them. In Cassandra the disk size isn't the main "success" factor - it's a memory, CPU, disk type (SSD), etc. On Sat, Mar 20, 2021 at 5:26 PM Lapo Luchini wrote: > Hi, thanks for suggestions! > I'll definitely migrate to 4.0 after all this is done, then. > > Old prod DC I fear can't suffer losing a node right now (a few nodes > have the disk 70% full), but I can maybe find a third node for the new > DC right away. > > BTW the new nodes have got 3× the disk space, but are not so much > different regarding CPU and RAM: does it make any sense giving them a > bit more num_tokens (maybe 20-30 instead of 16) than the rest of the old > DC hosts or "asymmetrical" clusters lead to problems? > > No real need to do that anyways, moving from 6 nodes to (eventually) 8 > should be enough lessen the load on the disks, and before more space is > needed I will probably have more nodes. > > Lapo > > On 2021-03-20 16:23, Alex Ott wrote: > > I personally maybe would go following way (need to calculate how many > > joins/decommissions will be at the end): > > > > * Decommission one node from prod DC > > * Form new DC from two new machines and decommissioned one. > > * Rebuild DC from existing one, make sure that repair finished, etc. > > * Switch traffic > > * Remove old DC > > * Add nodes from old DC one by one into new DC > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)
Re: Changing num_tokens and migrating to 4.0
The are several things to consider here: - You can't have DC of two nodes with RF=3... - Are you sure that new DC will handle all production traffic? - if new nodes much more powerful than other (memory/CPU/disk type) that could also cause unpredictable spikes when request will hit the "smaller" node. I personally maybe would go following way (need to calculate how many joins/decommissions will be at the end): - Decommission one node from prod DC - Form new DC from two new machines and decommissioned one. - Rebuild DC from existing one, make sure that repair finished, etc. - Switch traffic - Remove old DC - Add nodes from old DC one by one into new DC Upgrade to Cassandra 4.0 should be done either prior to that, or after - you shouldn't do it when doing bootstrapping/decomissioning... On Sat, Mar 20, 2021 at 4:09 PM Lapo Luchini wrote: > I have a 6 nodes production cluster running 3.11.9 with the default > num_tokens=256… which is fine but I later discovered is a bit of a > hassle to do repairs and is probably better to lower that to 16. > > I'm adding two new nodes with much higher space storage and I was > wondering which migration strategy is better. > > If I got it correct I was thinking about this: > 1. add the 2 new nodes as a new "temporary DC", with num_token=16 RF=3 > 2. repair it all, then test it a bit > 3. switch production applications to "DC-temp" > 4. drop the old 6-node DC > 5. re-create it from scratch with num_token=16 RF=3 > 6. switch production applications to "main DC" again > 7. drop "DC-temp", eventually integrate nodes into "main DC" > > I'd also like to migrate from 3.11.9 to 4.0-beta2 (I'm running on > FreeBSD so those are the options), does it make sense to do it during > the mentioned "num_tokens migration" (at step 1, or 5) or does it make > more sense to do it at step 8, as a in-place rolling upgrade of each of > the 6 (or 8) nodes? > > Did I get it correctly? > Can it be done "better"? > > Thanks in advance for any suggestion or correction! > > -- > Lapo Luchini > l...@lapo.it > > > ----- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)