Re: Issues, understanding how CQL works

2020-04-21 Thread Voytek Jarnot
As I learned the hard way (and has already been implied), design your
tables to support your queries.

We have, for example, 9 tables storing the same data, because users wish to
query in different ways. Could be several more tables (if one was being a
purist), but indexes get us the rest of the way there.

On Tue, Apr 21, 2020 at 8:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>
> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>
> Thanks for your help!
>
> Best regards,
> Marc Richter
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Issues, understanding how CQL works

2020-04-21 Thread Jeff Jirsa
On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>

The biggest thing to remember is that Cassandra is designed to scale out to
massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale. So
cassandra takes shortcuts / optimizations to make it possible to ADDRESS
all of that data, but not SCAN it.


>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
>
What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime



> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>

Because you didnt provide a signalid and monthyear, it doesn't know which
machine in your cluster to use to start the query.


>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>

Because it's designed for potentially petabytes of data per cluster, it
doesn't believe you really want to walk all the data and order ALL of it.
Instead, it assumes that when you need to use an ORDER BY, you're going to
have some very small piece of data - confined to a single
signalid/monthyear pair. And even then, the ORDER is going to assume that
you're ordering it by the ordering keys you've defined - fromtime  first,
and then totime.

So you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

 SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

 SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY
fromtime DESC

But you have to work within the boundaries of how the data is stored. It's
stored grouped by signalid+monthyear, and then sorted by fromtime, and then
sorted by totime.



> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>

You can denormalize. Because cassandra allows you to have very large
clusters, you can make multiple tables sorted in different ways to enable
the queries you need to run. Normal data modeling is to build tables based
on the SELECT statements you need to do (unless you're very advanced, in
which case you do it based on the transaction semantics of the
INSERT/UPDATE statements, but that's probably not you).

Or you can use a more flexible database.


>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>

You probably want to keep another table of monthyear + signalid pairs.


Re: Issues, understanding how CQL works

2020-04-21 Thread Reid Pinchback
Marc, have you had any exposure to DynamoDB at all?  The API approach is 
different, but the fundamental concepts are similar.  That’s actually a better 
reference point to have than an RDBMS, because really it’s a small subset of 
usage patterns that would overlap with CQL.  If you were, for example, dealing 
with databases that did a lot of table partitions and supported apps that 
focused bulk loads and analytics on a partition level, then you would be in a 
space somewhat similar to C*.

C* is at its best when your common usage pattern, at least on reads, is 
effectively “I want a bunch of stuff, so you may as well give it to me by the 
bunch… what I do with the bunch after is my problem”.  That’s very different 
from an RDBMS, which historically has always tried to find some balance between 
minimizing disk I/O and network I/O… but if it takes developers a lot more head 
scratching to get there, it was considered an acceptable investment to help 
scale the usage of an expensive resource.

As a result, language features for the two cases are quite different.

From: Elliott Sims 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, April 21, 2020 at 12:13 PM
To: "user@cassandra.apache.org" 
Subject: Re: Issues, understanding how CQL works

Message from External Sender
The short answer is that CQL isn't SQL.  It looks a bit like it, but the 
structure of the data is totally different.  Essentially (ignoring secondary 
indexes, which have some issues in practice and I think are generally not 
recommended) the only way to look the data up is by the partition key.  
Anything else is a full-table scan and if you need more querying flexibility 
Cassandra is probably not your best option.   With only 260GB, I think I'd lean 
towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter 
mailto:m...@marc-richter.info>> wrote:
Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.

I need to extract some information from a Cassandra database that has
the following table definition:

CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)

The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct
column to learn this would be "insertdate".

In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;

But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted
by an EQ or an IN.

So, after some trial and error and a lot of Googling, I learned that I
must include all rows from the PRIMARY KEY from left to right in my
query. Thus, this is the "best" I can get to work:


SELECT
*
FROM
"tagdata.central"
WHERE
"signalid" = 4002
AND "monthyear" = 201908
ORDER BY
"fromtime" DESC
LIMIT 10;


The "monthyear" column, I crafted like a fool by incrementing the date
one month after another until no results could be found anymore.
The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
query results. But these can't be as easily guessed as the "monthyear"
values could.

This is where I'm stuck!

1. This does not really feel like the ideal way to go. I think there is
something more mature in modern IT systems. Can anyone tell me what is a
better way to get these informations?

2. I need a way to learn all values that are in the "monthyear" and
"signalid" columns in order to be able to craft that query.
How can I achieve that in a reasonable way? As I said: The DB is round
about 260 GB which makes it next to impossible to just "have a look" at
the output of "SELECT *"..

Thanks for your help!

Best regards,
Marc Richter


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org


Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

2020-04-21 Thread Reid Pinchback
Note that from a performance standpoint, it’s hard to see a reason to care 
about releasing the memory unless you are co-tenanting C* with something else 
that’s significant in its memory demands, and significant on a schedule 
anti-correlated with when C* needs that memory.

If you aren’t doing that, then conceivably the only other time you’d care is if 
you are seeing read or write stalls on disk I/O because O/S buffer cache is too 
small.  But if you were getting a lot of impact from stalls, then it would mean 
C* was very busy… and if it’s very busy then it’s likely using it’s buffers as 
they are intended.

From: HImanshu Sharma 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, April 18, 2020 at 2:06 AM
To: "user@cassandra.apache.org" 
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
From the codebase as much I understood, if once a buffer is being allocated, 
then it is not freed and added to a recyclable pool. When a new request comes 
effort is made to fetch memory from recyclable pool and if is not available new 
allocation request is made. And while allocating a new request if memory limit 
is breached then we get this oom error.

I would like to know is my understanding correct
If what I am thinking is correct, is there way we can get this buffer pool 
reduced when there is low traffic because what I have observed in my system 
this memory remains static even if there is no traffic.

Regards
Manish

On Sat, Apr 18, 2020 at 11:13 AM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
Like most things, it depends on (a) what you're allowing and (b) how much your 
nodes require. MaxDirectMemorySize is the upper-bound for off-heap memory used 
for the direct byte buffer. C* uses it for Netty so if your nodes are busy 
servicing requests, they'd have more IO threads consuming memory.

During low traffic periods, there's less memory allocated to service requests 
and they eventually get freed up by GC tasks. But if traffic volumes are high, 
memory doesn't get freed up quick enough so the max is reached. When this 
happens, you'll see OOMs like "OutOfMemoryError: Direct buffer memory" show up 
in the logs.

You can play around with different values but make sure you test it 
exhaustively before trying it out in production. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.


Re: Issues, understanding how CQL works

2020-04-21 Thread Elliott Sims
The short answer is that CQL isn't SQL.  It looks a bit like it, but the
structure of the data is totally different.  Essentially (ignoring
secondary indexes, which have some issues in practice and I think are
generally not recommended) the only way to look the data up is by the
partition key.  Anything else is a full-table scan and if you need more
querying flexibility Cassandra is probably not your best option.   With
only 260GB, I think I'd lean towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>
> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>
> Thanks for your help!
>
> Best regards,
> Marc Richter
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Write to SSTables to do really fast initial load of database (e.g. for migration)

2020-04-21 Thread Eric Evans
On Tue, Apr 21, 2020 at 4:16 AM Erick Ramirez 
wrote:

> If you're asking about CQLSSTableWriter to create your own SSTables which
> you then load with the sstableloader utility, then it's still pretty much
> the same. BUT...
>
> The use-case for that has pretty much evaporated since the fastest way of
> bulk loading data (in my opinion ) is using the DataStax Bulk Loader [1].
> It's the fastest because:
>

To be clear: CQLSSTableWriter and sstableloader are the canonical way to
write and bulk load SSTables, for the Apache Cassandra project.


>
>- you don't need to write code to use it
>- you can load any data in CSV or JSON format
>- you directly load to your cluster bypassing the sstableloader step
>completely
>
> Check it out and see if it meets your requirements. I think you'll find it
> will save you a lot of time in the long run. Cheers!
>
> [1] https://www.datastax.com/blog/2019/12/tools-for-apache-cassandra
>
> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
> have answers! Share your expertise on https://community.datastax.com/.
>
>
And likewise, this mailing list is the correct place to ask questions for
the Apache Cassandra project.


-- 
Eric Evans
john.eric.ev...@gmail.com


Issues, understanding how CQL works

2020-04-21 Thread Marc Richter

Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.

I need to extract some information from a Cassandra database that has 
the following table definition:


CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)

The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct 
column to learn this would be "insertdate".


In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;

But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted 
by an EQ or an IN.


So, after some trial and error and a lot of Googling, I learned that I 
must include all rows from the PRIMARY KEY from left to right in my 
query. Thus, this is the "best" I can get to work:



SELECT
*
FROM
"tagdata.central"
WHERE
"signalid" = 4002
AND "monthyear" = 201908
ORDER BY
"fromtime" DESC
LIMIT 10;


The "monthyear" column, I crafted like a fool by incrementing the date 
one month after another until no results could be found anymore.
The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" - 
query results. But these can't be as easily guessed as the "monthyear" 
values could.


This is where I'm stuck!

1. This does not really feel like the ideal way to go. I think there is 
something more mature in modern IT systems. Can anyone tell me what is a 
better way to get these informations?


2. I need a way to learn all values that are in the "monthyear" and 
"signalid" columns in order to be able to craft that query.
How can I achieve that in a reasonable way? As I said: The DB is round 
about 260 GB which makes it next to impossible to just "have a look" at 
the output of "SELECT *"..


Thanks for your help!

Best regards,
Marc Richter


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Write to SSTables to do really fast initial load of database (e.g. for migration)

2020-04-21 Thread Erick Ramirez
If you're asking about CQLSSTableWriter to create your own SSTables which
you then load with the sstableloader utility, then it's still pretty much
the same. BUT...

The use-case for that has pretty much evaporated since the fastest way of
bulk loading data (in my opinion ) is using the DataStax Bulk Loader [1].
It's the fastest because:

   - you don't need to write code to use it
   - you can load any data in CSV or JSON format
   - you directly load to your cluster bypassing the sstableloader step
   completely

Check it out and see if it meets your requirements. I think you'll find it
will save you a lot of time in the long run. Cheers!

[1] https://www.datastax.com/blog/2019/12/tools-for-apache-cassandra

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
have answers! Share your expertise on https://community.datastax.com/.


Write to SSTables to do really fast initial load of database (e.g. for migration)

2020-04-21 Thread Tobias Eriksson
Hi
 I would like to evaluate if writing to my “own” SSTables files and then do 
import into Cassandra
My goal is to beat the performance of traditional CQL INSERTS (yes even with 
prepared statements)
I have been trying to find code examples for how to write to SSTables but could 
only really find example of Cassandra 2.x
https://www.datastax.com/blog/2014/09/using-cassandra-bulk-loader-updated
What is dated back to 2014….

Has anyone experience with writing your own SSTables and doing really fast 
imports
Any links / article / blog posts / ….

Using Cassandra 3.11

-Tobias