Re: Issues, understanding how CQL works
Hi List, I just wanted to share what I learned due to my investigations in the meantime here, including the solution to my initial question: 1) "I need to know the latest "fromtime" line in the whole DB" 2) "I need a way to learn all values that are in the "monthyear and "signalid" columns ..." To answer 2) first: This turns out to be one single Query: SELECT DISTINCT signalid,monthyear FROM central This provides a list consisting of just these two value pairs, but all of them. With these lists available, it is also possible to find the "highest" fromtime - lines per pair with the following query: SELECT max(fromtime) AS maxfromtime FROM central WHERE signalid={signalid} AND monthyear={monthyear} Wrapping this in a simple "for"-loop in any programming language, this results in a - comparatively - short list, from which finding the "highest" fromtime - integer is easy. Thanks for everyone who helped me! Best regards, Marc Richter On 22.04.20 17:57, Marc Richter wrote: Seems as if sstable2json is deprecated; see [1] and [2]. So, dsbulk [3] it is, I guess. I downloaded it and crafted the following commandline from the docs [4] for my use case: $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \ --driver.advanced.auth-provider.class PlainTextAuthProvider \ -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \ --connector.json.compression gzip -url /path/to/big/storage This seems to result in multiple JSON files compressed with GZIP; seems to be exactly what I needed to help me in this case! There's only one thing that I do not really understand what it means: Besides the GZIP archives, it also creates two logfiles. One of them (unload-errors.log) contains some Java stacks. I do not understand what those lines are supposed to say: (Added it to pastebin to not render the mail unreadable): https://pastebin.com/WpYvqxAA What are those lines supposed to tell me? Best regards, Marc Richter [1] https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html [2] https://issues.apache.org/jira/browse/CASSANDRA-9618 [3] https://downloads.datastax.com/#bulk-loader [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html On 22.04.20 16:15, Marc Richter wrote: This sounds like a promising way; thank you for bringing this up! I will see if I can manage it with this approach. Best regards, Marc Richter On 22.04.20 15:38, Durity, Sean R wrote: I thought this might be a single-time use case request. I think my first approach would be to use something like dsbulk to unload the data and then reload it into a table designed for the query you want to do (as long as you have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv files, so you could move that data to any kind of database, if you chose. An alternative approach would be to use a driver that supports paging (I think this would be most of them) and write a program to walk the data set and output what you need in whatever format you need. Or, since this is a single node scenario, you could try sstable2json to export the sstables (files on disk) into JSON, if that is a more workable format for you. Sean Durity – Staff Systems Engineer, Cassandra -Original Message- From: Marc Richter Sent: Wednesday, April 22, 2020 6:22 AM To: user@cassandra.apache.org Subject: [EXTERNAL] Re: Issues, understanding how CQL works Hi Jeff, thank you for your exhaustive and verbose answer! Also, a very big "Thank you!" to all the other replyers; I hope you understand that I summarize all your feedback in this single answer. From what I understand from your answers, Cassandra seems to be optimized to store (and read) data in only exactly that way that the data structure has been designed for. That makes it very inflexible, but allows it to do that single job very effectively for a trade-off. I also understand, the more I dig into Cassandra, that the team I am supporting is using Cassandra kind of wrong; they for example do have only one node and so do not use neither the load-balancing, nor the redundancy-capabilities Cassandra offers. Thus, maybe relevant side-note: All the data resides on just one single node; maybe that info is important, because we know on which node the data is (I know that Cassandra internally is applying the same Hashing - Voodoo as if there were 1k nodes, but maybe this is important anyways). Anyways: I do not really care if a query or effort to find this information is sub-optimal or very "expensive" in means of effectivity or system-load, since this isn't something that I need to extract on a regular basis, but only once. Due to that, it doesn't need to be optimal or effective; I also do not care if it blocks the node for several hours, since Cassandra is only working on this single
Re: Issues, understanding how CQL works
Seems as if sstable2json is deprecated; see [1] and [2]. So, dsbulk [3] it is, I guess. I downloaded it and crafted the following commandline from the docs [4] for my use case: $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \ --driver.advanced.auth-provider.class PlainTextAuthProvider \ -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \ --connector.json.compression gzip -url /path/to/big/storage This seems to result in multiple JSON files compressed with GZIP; seems to be exactly what I needed to help me in this case! There's only one thing that I do not really understand what it means: Besides the GZIP archives, it also creates two logfiles. One of them (unload-errors.log) contains some Java stacks. I do not understand what those lines are supposed to say: (Added it to pastebin to not render the mail unreadable): https://pastebin.com/WpYvqxAA What are those lines supposed to tell me? Best regards, Marc Richter [1] https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html [2] https://issues.apache.org/jira/browse/CASSANDRA-9618 [3] https://downloads.datastax.com/#bulk-loader [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html On 22.04.20 16:15, Marc Richter wrote: This sounds like a promising way; thank you for bringing this up! I will see if I can manage it with this approach. Best regards, Marc Richter On 22.04.20 15:38, Durity, Sean R wrote: I thought this might be a single-time use case request. I think my first approach would be to use something like dsbulk to unload the data and then reload it into a table designed for the query you want to do (as long as you have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv files, so you could move that data to any kind of database, if you chose. An alternative approach would be to use a driver that supports paging (I think this would be most of them) and write a program to walk the data set and output what you need in whatever format you need. Or, since this is a single node scenario, you could try sstable2json to export the sstables (files on disk) into JSON, if that is a more workable format for you. Sean Durity – Staff Systems Engineer, Cassandra -Original Message----- From: Marc Richter Sent: Wednesday, April 22, 2020 6:22 AM To: user@cassandra.apache.org Subject: [EXTERNAL] Re: Issues, understanding how CQL works Hi Jeff, thank you for your exhaustive and verbose answer! Also, a very big "Thank you!" to all the other replyers; I hope you understand that I summarize all your feedback in this single answer. From what I understand from your answers, Cassandra seems to be optimized to store (and read) data in only exactly that way that the data structure has been designed for. That makes it very inflexible, but allows it to do that single job very effectively for a trade-off. I also understand, the more I dig into Cassandra, that the team I am supporting is using Cassandra kind of wrong; they for example do have only one node and so do not use neither the load-balancing, nor the redundancy-capabilities Cassandra offers. Thus, maybe relevant side-note: All the data resides on just one single node; maybe that info is important, because we know on which node the data is (I know that Cassandra internally is applying the same Hashing - Voodoo as if there were 1k nodes, but maybe this is important anyways). Anyways: I do not really care if a query or effort to find this information is sub-optimal or very "expensive" in means of effectivity or system-load, since this isn't something that I need to extract on a regular basis, but only once. Due to that, it doesn't need to be optimal or effective; I also do not care if it blocks the node for several hours, since Cassandra is only working on this single request. I really need this info (most recent "insertdate") only once. Is, considering this, a way to do that? > Because you didnt provide a signalid and monthyear, it doesn't know > which machine in your cluster to use to start the query. I know this already; thanks for confirming that I got this correct! But what do I do then if I do not know all "signalid"s? How to learn them? Is it maybe possible to get a full list of all "signalid"s? Or is it possible to "re-arrange" the data in the cluster or something that enables me to learn what's the most recent "insertdate"? I really do not care if I need to do some expensive copy-all-data - move, but I do not know about what is possible and how to do that. Best regards, Marc Richter On 21.04.20 19:20, Jeff Jirsa wrote: On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote: Hi everyone, I'm very new to Cassandra. I have, however, some experience with SQL. The biggest thing to remember is that Cassandra is designed to scale out
Re: Issues, understanding how CQL works
This sounds like a promising way; thank you for bringing this up! I will see if I can manage it with this approach. Best regards, Marc Richter On 22.04.20 15:38, Durity, Sean R wrote: I thought this might be a single-time use case request. I think my first approach would be to use something like dsbulk to unload the data and then reload it into a table designed for the query you want to do (as long as you have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv files, so you could move that data to any kind of database, if you chose. An alternative approach would be to use a driver that supports paging (I think this would be most of them) and write a program to walk the data set and output what you need in whatever format you need. Or, since this is a single node scenario, you could try sstable2json to export the sstables (files on disk) into JSON, if that is a more workable format for you. Sean Durity – Staff Systems Engineer, Cassandra -Original Message- From: Marc Richter Sent: Wednesday, April 22, 2020 6:22 AM To: user@cassandra.apache.org Subject: [EXTERNAL] Re: Issues, understanding how CQL works Hi Jeff, thank you for your exhaustive and verbose answer! Also, a very big "Thank you!" to all the other replyers; I hope you understand that I summarize all your feedback in this single answer. From what I understand from your answers, Cassandra seems to be optimized to store (and read) data in only exactly that way that the data structure has been designed for. That makes it very inflexible, but allows it to do that single job very effectively for a trade-off. I also understand, the more I dig into Cassandra, that the team I am supporting is using Cassandra kind of wrong; they for example do have only one node and so do not use neither the load-balancing, nor the redundancy-capabilities Cassandra offers. Thus, maybe relevant side-note: All the data resides on just one single node; maybe that info is important, because we know on which node the data is (I know that Cassandra internally is applying the same Hashing - Voodoo as if there were 1k nodes, but maybe this is important anyways). Anyways: I do not really care if a query or effort to find this information is sub-optimal or very "expensive" in means of effectivity or system-load, since this isn't something that I need to extract on a regular basis, but only once. Due to that, it doesn't need to be optimal or effective; I also do not care if it blocks the node for several hours, since Cassandra is only working on this single request. I really need this info (most recent "insertdate") only once. Is, considering this, a way to do that? > Because you didnt provide a signalid and monthyear, it doesn't know > which machine in your cluster to use to start the query. I know this already; thanks for confirming that I got this correct! But what do I do then if I do not know all "signalid"s? How to learn them? Is it maybe possible to get a full list of all "signalid"s? Or is it possible to "re-arrange" the data in the cluster or something that enables me to learn what's the most recent "insertdate"? I really do not care if I need to do some expensive copy-all-data - move, but I do not know about what is possible and how to do that. Best regards, Marc Richter On 21.04.20 19:20, Jeff Jirsa wrote: On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote: Hi everyone, I'm very new to Cassandra. I have, however, some experience with SQL. The biggest thing to remember is that Cassandra is designed to scale out to massive clusters - like thousands of instances. To do that, you can't assume it's ever ok to read all of the data, because that doesn't scale. So cassandra takes shortcuts / optimizations to make it possible to ADDRESS all of that data, but not SCAN it. I need to extract some information from a Cassandra database that has the following table definition: CREATE TABLE tagdata.central ( signalid int, monthyear int, fromtime bigint, totime bigint, avg decimal, insertdate bigint, max decimal, min decimal, readings text, PRIMARY KEY (( signalid, monthyear ), fromtime, totime) ) What your primary key REALLY MEANS is: The database on reads and writes will hash(signalid+monthyear) to find which hosts have the data, then In each data file, the data for a given (signalid,monthyear) is stored sorted by fromtime and totime The database is already of round about 260 GB in size. I now need to know what is the most recent entry in it; the correct column to learn this would be "insertdate". In SQL I would do something like this: SELECT insertdate FROM tagdata.central ORDER BY insertdate DESC LIMIT 1; In CQL, however, I just can't get it to work.
Re: Issues, understanding how CQL works
Hi Jeff, thank you for your exhaustive and verbose answer! Also, a very big "Thank you!" to all the other replyers; I hope you understand that I summarize all your feedback in this single answer. From what I understand from your answers, Cassandra seems to be optimized to store (and read) data in only exactly that way that the data structure has been designed for. That makes it very inflexible, but allows it to do that single job very effectively for a trade-off. I also understand, the more I dig into Cassandra, that the team I am supporting is using Cassandra kind of wrong; they for example do have only one node and so do not use neither the load-balancing, nor the redundancy-capabilities Cassandra offers. Thus, maybe relevant side-note: All the data resides on just one single node; maybe that info is important, because we know on which node the data is (I know that Cassandra internally is applying the same Hashing - Voodoo as if there were 1k nodes, but maybe this is important anyways). Anyways: I do not really care if a query or effort to find this information is sub-optimal or very "expensive" in means of effectivity or system-load, since this isn't something that I need to extract on a regular basis, but only once. Due to that, it doesn't need to be optimal or effective; I also do not care if it blocks the node for several hours, since Cassandra is only working on this single request. I really need this info (most recent "insertdate") only once. Is, considering this, a way to do that? > Because you didnt provide a signalid and monthyear, it doesn't know > which machine in your cluster to use to start the query. I know this already; thanks for confirming that I got this correct! But what do I do then if I do not know all "signalid"s? How to learn them? Is it maybe possible to get a full list of all "signalid"s? Or is it possible to "re-arrange" the data in the cluster or something that enables me to learn what's the most recent "insertdate"? I really do not care if I need to do some expensive copy-all-data - move, but I do not know about what is possible and how to do that. Best regards, Marc Richter On 21.04.20 19:20, Jeff Jirsa wrote: On Tue, Apr 21, 2020 at 6:20 AM Marc Richter <mailto:m...@marc-richter.info>> wrote: Hi everyone, I'm very new to Cassandra. I have, however, some experience with SQL. The biggest thing to remember is that Cassandra is designed to scale out to massive clusters - like thousands of instances. To do that, you can't assume it's ever ok to read all of the data, because that doesn't scale. So cassandra takes shortcuts / optimizations to make it possible to ADDRESS all of that data, but not SCAN it. I need to extract some information from a Cassandra database that has the following table definition: CREATE TABLE tagdata.central ( signalid int, monthyear int, fromtime bigint, totime bigint, avg decimal, insertdate bigint, max decimal, min decimal, readings text, PRIMARY KEY (( signalid, monthyear ), fromtime, totime) ) What your primary key REALLY MEANS is: The database on reads and writes will hash(signalid+monthyear) to find which hosts have the data, then In each data file, the data for a given (signalid,monthyear) is stored sorted by fromtime and totime The database is already of round about 260 GB in size. I now need to know what is the most recent entry in it; the correct column to learn this would be "insertdate". In SQL I would do something like this: SELECT insertdate FROM tagdata.central ORDER BY insertdate DESC LIMIT 1; In CQL, however, I just can't get it to work. What I have tried already is this: SELECT insertdate FROM "tagdata.central" ORDER BY insertdate DESC LIMIT 1; Because you didnt provide a signalid and monthyear, it doesn't know which machine in your cluster to use to start the query. But this gives me an error: ERROR: ORDER BY is only supported when the partition key is restricted by an EQ or an IN. Because it's designed for potentially petabytes of data per cluster, it doesn't believe you really want to walk all the data and order ALL of it. Instead, it assumes that when you need to use an ORDER BY, you're going to have some very small piece of data - confined to a single signalid/monthyear pair. And even then, the ORDER is going to assume that you're ordering it by the ordering keys you've defined - fromtime first, and then totime. So you can do SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC And you can do SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC And you can do ranges: SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY fromtime DESC But you have to work within the b
Issues, understanding how CQL works
Hi everyone, I'm very new to Cassandra. I have, however, some experience with SQL. I need to extract some information from a Cassandra database that has the following table definition: CREATE TABLE tagdata.central ( signalid int, monthyear int, fromtime bigint, totime bigint, avg decimal, insertdate bigint, max decimal, min decimal, readings text, PRIMARY KEY (( signalid, monthyear ), fromtime, totime) ) The database is already of round about 260 GB in size. I now need to know what is the most recent entry in it; the correct column to learn this would be "insertdate". In SQL I would do something like this: SELECT insertdate FROM tagdata.central ORDER BY insertdate DESC LIMIT 1; In CQL, however, I just can't get it to work. What I have tried already is this: SELECT insertdate FROM "tagdata.central" ORDER BY insertdate DESC LIMIT 1; But this gives me an error: ERROR: ORDER BY is only supported when the partition key is restricted by an EQ or an IN. So, after some trial and error and a lot of Googling, I learned that I must include all rows from the PRIMARY KEY from left to right in my query. Thus, this is the "best" I can get to work: SELECT * FROM "tagdata.central" WHERE "signalid" = 4002 AND "monthyear" = 201908 ORDER BY "fromtime" DESC LIMIT 10; The "monthyear" column, I crafted like a fool by incrementing the date one month after another until no results could be found anymore. The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" - query results. But these can't be as easily guessed as the "monthyear" values could. This is where I'm stuck! 1. This does not really feel like the ideal way to go. I think there is something more mature in modern IT systems. Can anyone tell me what is a better way to get these informations? 2. I need a way to learn all values that are in the "monthyear" and "signalid" columns in order to be able to craft that query. How can I achieve that in a reasonable way? As I said: The DB is round about 260 GB which makes it next to impossible to just "have a look" at the output of "SELECT *".. Thanks for your help! Best regards, Marc Richter - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Unable to integrate jmx_prometheus_javaagent
Nevermind, after hours of investigation, I found the solution myself just after having the mail sent to the list ... Even though some resources on the web highlight the importance to wrap what follows "-javaagent:" between "", this seems to be the issue; note that the log complains about it could not find: "/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar Note the leading double-quote here ... Removing the quotes makes it work like a charm. Sorry for bothering! BR, Marc On 25.07.19 18:02, Marc Richter wrote: Hi everyone, I have an existing Cassandra node (3.7). Now, I'd like to be able to grab metrics from it for my Prometheus + Grafana based monitoring. I downloaded "jmx_prometheus_javaagent-0.12.0.jar" from [1], copied it to "/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar". I also downloaded "cassandra.yml" from [2] and saved it to "/etc/cassandra/prometheus/jmx_prometheus_javaagent_cassandra.yml". Next, I appended the following to my cassandra-env.sh: ``` PROMETHEUS_AGENT='-javaagent:"/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar=7070:/etc/cassandra/prometheus/jmx_prometheus_javaagent_cassandra.yml"' JVM_OPTS="$JVM_OPTS $PROMETHEUS_AGENT" ``` When I now try to start my Cassandra node, it fails and writes this to my logfile: ``` Error opening zip file or JAR manifest missing : "/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar Error occurred during initialization of VM agent library failed to init: instrument ``` I'm using the official Cassandra Docker image [3], tag 3.7. I found the steps I did here in many online resources. I could not find any issue which matches what I'm facing. Does anybody have an idea? BR, Marc [1] https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar [2] https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/cassandra.yml [3] https://hub.docker.com/_/cassandra - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Unable to integrate jmx_prometheus_javaagent
Hi everyone, I have an existing Cassandra node (3.7). Now, I'd like to be able to grab metrics from it for my Prometheus + Grafana based monitoring. I downloaded "jmx_prometheus_javaagent-0.12.0.jar" from [1], copied it to "/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar". I also downloaded "cassandra.yml" from [2] and saved it to "/etc/cassandra/prometheus/jmx_prometheus_javaagent_cassandra.yml". Next, I appended the following to my cassandra-env.sh: ``` PROMETHEUS_AGENT='-javaagent:"/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar=7070:/etc/cassandra/prometheus/jmx_prometheus_javaagent_cassandra.yml"' JVM_OPTS="$JVM_OPTS $PROMETHEUS_AGENT" ``` When I now try to start my Cassandra node, it fails and writes this to my logfile: ``` Error opening zip file or JAR manifest missing : "/etc/cassandra/prometheus/jmx_prometheus_javaagent-0.12.0.jar Error occurred during initialization of VM agent library failed to init: instrument ``` I'm using the official Cassandra Docker image [3], tag 3.7. I found the steps I did here in many online resources. I could not find any issue which matches what I'm facing. Does anybody have an idea? BR, Marc [1] https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar [2] https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/cassandra.yml [3] https://hub.docker.com/_/cassandra - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Securing cluster communication
Hi everyone, I'm completely new to Cassandra DB, so please do not roast me for asking obvious stuff. I managed to setup one Cassandra node and enter some data to it, successfully. Next, I installed a second node, which connects to that first one via port 7000 and sync all that data from it. This worked fine as well. But doing so, it leaves me puzzled a bit because of the security aspect of this: Neither did I need to authenticate to the seeding (first) node, nor did I find a resource which describes how to secure that cluster communication by implementing some kind of authentication, which prevents everyone on the same net to connect to the nodes. How is this dealt with in Cassandra? Is setting up firewalls the only way to allow only some nodes to connect to the ports 7000/7001? BR, Marc - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org