Re: Write to SSTables to do really fast initial load of database (e.g. for migration)

2020-04-22 Thread Tobias Eriksson
Thanx all for the good tips
-Tobias


From: Eric Evans 
Reply to: "user@cassandra.apache.org" 
Date: Tuesday, 21 April 2020 at 16:02
To: "user@cassandra.apache.org" 
Subject: Re: Write to SSTables to do really fast initial load of database (e.g. 
for migration)



On Tue, Apr 21, 2020 at 4:16 AM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
If you're asking about CQLSSTableWriter to create your own SSTables which you 
then load with the sstableloader utility, then it's still pretty much the same. 
BUT...

The use-case for that has pretty much evaporated since the fastest way of bulk 
loading data (in my opinion ) is using the DataStax Bulk Loader [1]. It's the 
fastest because:

To be clear: CQLSSTableWriter and sstableloader are the canonical way to write 
and bulk load SSTables, for the Apache Cassandra project.


  *   you don't need to write code to use it
  *   you can load any data in CSV or JSON format
  *   you directly load to your cluster bypassing the sstableloader step 
completely
Check it out and see if it meets your requirements. I think you'll find it will 
save you a lot of time in the long run. Cheers!

[1] https://www.datastax.com/blog/2019/12/tools-for-apache-cassandra

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on https://community.datastax.com/.


And likewise, this mailing list is the correct place to ask questions for the 
Apache Cassandra project.


--
Eric Evans
john.eric.ev...@gmail.com


Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
not directly related, but you can try to use zstd as compression - in my
tests it performed faster offload, with slightly worse compression ratio

Marc Richter  at "Wed, 22 Apr 2020 17:57:44 +0200" wrote:
 MR> Seems as if sstable2json is deprecated; see [1] and [2].

 MR> So, dsbulk [3] it is, I guess.

 MR> I downloaded it and crafted the following commandline from the docs [4] 
for my use case:

 MR> $ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
 MR>   --driver.advanced.auth-provider.class PlainTextAuthProvider \
 MR>   -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
 MR>   --connector.json.compression gzip -url /path/to/big/storage

 MR> This seems to result in multiple JSON files compressed with GZIP; seems to 
be exactly what
 MR> I needed to help me in this case!

 MR> There's only one thing that I do not really understand what it means:
 MR> Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log)
 MR> contains some Java stacks. I do not understand what those lines are 
supposed to say:

 MR> (Added it to pastebin to not render the mail unreadable):

 MR> https://pastebin.com/WpYvqxAA

 MR> What are those lines supposed to tell me?
 MR> Marc Richter

 MR> [1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html
 MR> [2] https://issues.apache.org/jira/browse/CASSANDRA-9618
 MR> [3] https://downloads.datastax.com/#bulk-loader
 MR> [4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

 MR> On 22.04.20 16:15, Marc Richter wrote:
 >> This sounds like a promising way; thank you for bringing this up!
 >>
 >> I will see if I can manage it with this approach.
 >>
 >> Best regards,
 >> Marc Richter
 >>
 >>
 >>
 >> On 22.04.20 15:38, Durity, Sean R wrote:
 >>> I thought this might be a single-time use case request. I think my first 
 >>> approach would
 >>> be to use something like dsbulk to unload the data and then reload it into 
 >>> a table
 >>> designed for the query you want to do (as long as you have adequate disk 
 >>> space). I
 >>> think like a DBA/admin first. Dsbulk creates csv files, so you could move 
 >>> that data to
 >>> any kind of database, if you chose.
 >>>
 >>> An alternative approach would be to use a driver that supports paging (I 
 >>> think this
 >>> would be most of them) and write a program to walk the data set and output 
 >>> what you
 >>> need in whatever format you need.
 >>>
 >>> Or, since this is a single node scenario, you could try sstable2json to 
 >>> export the
 >>> sstables (files on disk) into JSON, if that is a more workable format for 
 >>> you.
 >>>
 >>> Sean Durity – Staff Systems Engineer, Cassandra
 >>>
 >>> -Original Message-
 >>> From: Marc Richter 
 >>> Sent: Wednesday, April 22, 2020 6:22 AM
 >>> To: user@cassandra.apache.org
 >>> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
 >>>
 >>> Hi Jeff,
 >>>
 >>> thank you for your exhaustive and verbose answer!
 >>> Also, a very big "Thank you!" to all the other replyers; I hope you
 >>> understand that I summarize all your feedback in this single answer.
 >>>
 >>>   From what I understand from your answers, Cassandra seems to be
 >>> optimized to store (and read) data in only exactly that way that the
 >>> data structure has been designed for. That makes it very inflexible, but
 >>> allows it to do that single job very effectively for a trade-off.
 >>>
 >>> I also understand, the more I dig into Cassandra, that the team I am
 >>> supporting is using Cassandra kind of wrong; they for example do have
 >>> only one node and so do not use neither the load-balancing, nor the
 >>> redundancy-capabilities Cassandra offers.
 >>> Thus, maybe relevant side-note: All the data resides on just one single
 >>> node; maybe that info is important, because we know on which node the
 >>> data is (I know that Cassandra internally is applying the same Hashing -
 >>> Voodoo as if there were 1k nodes, but maybe this is important anyways).
 >>>
 >>> Anyways: I do not really care if a query or effort to find this
 >>> information is sub-optimal or very "expensive" in means of effectivity
 >>> or system-load, since this isn't something that I need to extract on a
 >>> regular basis, but only once. Due to that, it doesn't need to be optimal
 >>> or effective; I also do not care if it blocks the node for several
 >>> hours, since Cassandra is only working on this single request. I really
 >>> need this info (most recent "insertdate") only once.
 >>> Is, considering this, a way to do that?
 >>>
 >>>   > Because you didnt provide a signalid and monthyear, it doesn't know
 >>>   > which machine in your cluster to use to start the query.
 >>>
 >>> I know this already; thanks for confirming that I got this correct! But
 >>> what do I do then if I do not know all "signalid"s? How to learn them?
 >>>
 >>> Is it maybe possible to get a full list of all "signalid"s? Or is it
 >>> possible to "re-arrange" the data in the 

Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

Seems as if sstable2json is deprecated; see [1] and [2].

So, dsbulk [3] it is, I guess.

I downloaded it and crafted the following commandline from the docs [4] 
for my use case:


$ ../dsbulk-1.5.0/bin/dsbulk unload -h '["MY_CASSANDRA_IP"]' \
  --driver.advanced.auth-provider.class PlainTextAuthProvider \
  -u cassandra -p MY_PASSWORD -k tagdata -t central -c json \
  --connector.json.compression gzip -url /path/to/big/storage

This seems to result in multiple JSON files compressed with GZIP; seems 
to be exactly what I needed to help me in this case!


There's only one thing that I do not really understand what it means:
Besides the GZIP archives, it also creates two logfiles. One of them 
(unload-errors.log) contains some Java stacks. I do not understand what 
those lines are supposed to say:


(Added it to pastebin to not render the mail unreadable):

https://pastebin.com/WpYvqxAA

What are those lines supposed to tell me?

Best regards,
Marc Richter




[1] 
https://docs.datastax.com/en/cassandra-oss/2.2/cassandra/tools/toolsSSTable2Json.html

[2] https://issues.apache.org/jira/browse/CASSANDRA-9618
[3] https://downloads.datastax.com/#bulk-loader
[4] https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkRef.html

On 22.04.20 16:15, Marc Richter wrote:

This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:
I thought this might be a single-time use case request. I think my 
first approach would be to use something like dsbulk to unload the 
data and then reload it into a table designed for the query you want 
to do (as long as you have adequate disk space). I think like a 
DBA/admin first. Dsbulk creates csv files, so you could move that data 
to any kind of database, if you chose.


An alternative approach would be to use a driver that supports paging 
(I think this would be most of them) and write a program to walk the 
data set and output what you need in whatever format you need.


Or, since this is a single node scenario, you could try sstable2json 
to export the sstables (files on disk) into JSON, if that is a more 
workable format for you.


Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

  > Because you didnt provide a signalid and monthyear, it doesn't know
  > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote:

 Hi everyone,

 I'm very new to Cassandra. I have, however, some experience with 
SQL.



The biggest thing to remember is that Cassandra is designed to scale out
to massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok 

Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

2020-04-22 Thread Reid Pinchback
If the memory wasn’t being used, and it got pushed to swap, then the right 
thing happened.  It’s a common misconception that swap is bad.  The use of swap 
isn’t bad.  What is bad is if you find data churning in and out of swap space a 
lot so that your latency increases either due to the page faults or due to 
contention between swap activity and other disk I/O.  For the case it sounds 
like we’ve been discussing, where the buffers aren’t in use, basically all that 
would happen is that memory garbage would be shoved out of the way.  Honestly 
the thought I’d had in mind when you first described this would be to 
intentionally use cgroups to twiddle swappiness so that a short-term co-tenant 
load could be prioritized and shove stale C* memory out of the way, then 
twiddle the settings back when you prefer C* to be the winner in resource 
demand.

From: manish khandelwal 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, April 22, 2020 at 7:23 AM
To: "user@cassandra.apache.org" 
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
I am running spark (max heap 4G) and a java application (4G) with my Cassandra 
server (8G).

After heavy loading, if I run a spark process some main memory is pushed into 
swap. But if a restart Cassandra and execute the spark process memory is not 
pushed into the swap.

Idea behind asking the above question was is -XX:MaxDirectMemorySize is the 
right knob to use to contain the off heap memory. I understand that I have to 
test as Eric said that I might get outOfMemoryError issue. Or are there any 
other better options available for handling such situations?



On Tue, Apr 21, 2020 at 9:52 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Note that from a performance standpoint, it’s hard to see a reason to care 
about releasing the memory unless you are co-tenanting C* with something else 
that’s significant in its memory demands, and significant on a schedule 
anti-correlated with when C* needs that memory.

If you aren’t doing that, then conceivably the only other time you’d care is if 
you are seeing read or write stalls on disk I/O because O/S buffer cache is too 
small.  But if you were getting a lot of impact from stalls, then it would mean 
C* was very busy… and if it’s very busy then it’s likely using it’s buffers as 
they are intended.

From: HImanshu Sharma 
mailto:himanshusharma0...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Saturday, April 18, 2020 at 2:06 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
From the codebase as much I understood, if once a buffer is being allocated, 
then it is not freed and added to a recyclable pool. When a new request comes 
effort is made to fetch memory from recyclable pool and if is not available new 
allocation request is made. And while allocating a new request if memory limit 
is breached then we get this oom error.

I would like to know is my understanding correct
If what I am thinking is correct, is there way we can get this buffer pool 
reduced when there is low traffic because what I have observed in my system 
this memory remains static even if there is no traffic.

Regards
Manish

On Sat, Apr 18, 2020 at 11:13 AM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
Like most things, it depends on (a) what you're allowing and (b) how much your 
nodes require. MaxDirectMemorySize is the upper-bound for off-heap memory used 
for the direct byte buffer. C* uses it for Netty so if your nodes are busy 
servicing requests, they'd have more IO threads consuming memory.

During low traffic periods, there's less memory allocated to service requests 
and they eventually get freed up by GC tasks. But if traffic volumes are high, 
memory doesn't get freed up quick enough so the max is reached. When this 
happens, you'll see OOMs like "OutOfMemoryError: Direct buffer memory" show up 
in the logs.

You can play around with different values but make sure you test it 
exhaustively before trying it out in production. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.


Re: Issues, understanding how CQL works

2020-04-22 Thread Aakash Pandhi
Marc,
In DSE CQL offers option called CAPTURE, which can save output of query to a 
directed file. May be you can use that option to save all values you need in 
that file to see all signalids or whichever columns you need. File may grow big 
based on your dataset, so I am not sure what limit it imposes on file size. But 
if you are selecting 1 or 2 columns it should be fine I assume. 
Here is syntax
CAPTURE

| 
| 
|  | 
CAPTURE

Appends query results to a file.
 |

 |

 |


Sincerely,

Aakash Pandhi
 

On Wednesday, April 22, 2020, 08:38:38 AM CDT, Durity, Sean R 
 wrote:  
 
 I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

 From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

 > Because you didnt provide a signalid and monthyear, it doesn't know
 > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:
>
>
> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  > wrote:
>
>    Hi everyone,
>
>    I'm very new to Cassandra. I have, however, some experience with SQL.
>
>
> The biggest thing to remember is that Cassandra is designed to scale out
> to massive clusters - like thousands of instances. To do that, you can't
> assume it's ever ok to read all of the data, because that doesn't scale.
> So cassandra takes shortcuts / optimizations to make it possible to
> ADDRESS all of that data, but not SCAN it.
>
>
>    I need to extract some information from a Cassandra database that has
>    the following table definition:
>
>    CREATE TABLE tagdata.central (
>    signalid int,
>    monthyear int,
>    fromtime bigint,
>    totime bigint,
>    avg decimal,
>    insertdate bigint,
>    max decimal,
>    min decimal,
>    readings text,
>    PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
>    )
>
>
> What your primary key REALLY MEANS is:
>
> The database on reads and writes will hash(signalid+monthyear) to find
> which hosts have the data, then
>
> In each data file, the data for a given (signalid,monthyear) is stored
> sorted by fromtime and totime
>
>    The database is already of round 

Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

This sounds like a promising way; thank you for bringing this up!

I will see if I can manage it with this approach.

Best regards,
Marc Richter



On 22.04.20 15:38, Durity, Sean R wrote:

I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

  From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

  > Because you didnt provide a signalid and monthyear, it doesn't know
  > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter mailto:m...@marc-richter.info>> wrote:

 Hi everyone,

 I'm very new to Cassandra. I have, however, some experience with SQL.


The biggest thing to remember is that Cassandra is designed to scale out
to massive clusters - like thousands of instances. To do that, you can't
assume it's ever ok to read all of the data, because that doesn't scale.
So cassandra takes shortcuts / optimizations to make it possible to
ADDRESS all of that data, but not SCAN it.


 I need to extract some information from a Cassandra database that has
 the following table definition:

 CREATE TABLE tagdata.central (
 signalid int,
 monthyear int,
 fromtime bigint,
 totime bigint,
 avg decimal,
 insertdate bigint,
 max decimal,
 min decimal,
 readings text,
 PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
 )


What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find
which hosts have the data, then

In each data file, the data for a given (signalid,monthyear) is stored
sorted by fromtime and totime

 The database is already of round about 260 GB in size.
 I now need to know what is the most recent entry in it; the correct
 column to learn this would be "insertdate".

 In SQL I would do something like this:

 SELECT insertdate FROM tagdata.central
 ORDER BY insertdate DESC LIMIT 1;

 In CQL, however, I just can't get it to work.

 What I have tried already is this:

 SELECT insertdate FROM "tagdata.central"
 ORDER BY 

Re: Issues, understanding how CQL works

2020-04-22 Thread Alex Ott
DSBulk also works with JSON...
if transformations of data are complex, I would go with Spark running in
local mode, and process data...

On Wed, Apr 22, 2020 at 3:38 PM Durity, Sean R 
wrote:

> I thought this might be a single-time use case request. I think my first
> approach would be to use something like dsbulk to unload the data and then
> reload it into a table designed for the query you want to do (as long as
> you have adequate disk space). I think like a DBA/admin first. Dsbulk
> creates csv files, so you could move that data to any kind of database, if
> you chose.
>
> An alternative approach would be to use a driver that supports paging (I
> think this would be most of them) and write a program to walk the data set
> and output what you need in whatever format you need.
>
> Or, since this is a single node scenario, you could try sstable2json to
> export the sstables (files on disk) into JSON, if that is a more workable
> format for you.
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -Original Message-
> From: Marc Richter 
> Sent: Wednesday, April 22, 2020 6:22 AM
> To: user@cassandra.apache.org
> Subject: [EXTERNAL] Re: Issues, understanding how CQL works
>
> Hi Jeff,
>
> thank you for your exhaustive and verbose answer!
> Also, a very big "Thank you!" to all the other replyers; I hope you
> understand that I summarize all your feedback in this single answer.
>
>  From what I understand from your answers, Cassandra seems to be
> optimized to store (and read) data in only exactly that way that the
> data structure has been designed for. That makes it very inflexible, but
> allows it to do that single job very effectively for a trade-off.
>
> I also understand, the more I dig into Cassandra, that the team I am
> supporting is using Cassandra kind of wrong; they for example do have
> only one node and so do not use neither the load-balancing, nor the
> redundancy-capabilities Cassandra offers.
> Thus, maybe relevant side-note: All the data resides on just one single
> node; maybe that info is important, because we know on which node the
> data is (I know that Cassandra internally is applying the same Hashing -
> Voodoo as if there were 1k nodes, but maybe this is important anyways).
>
> Anyways: I do not really care if a query or effort to find this
> information is sub-optimal or very "expensive" in means of effectivity
> or system-load, since this isn't something that I need to extract on a
> regular basis, but only once. Due to that, it doesn't need to be optimal
> or effective; I also do not care if it blocks the node for several
> hours, since Cassandra is only working on this single request. I really
> need this info (most recent "insertdate") only once.
> Is, considering this, a way to do that?
>
>  > Because you didnt provide a signalid and monthyear, it doesn't know
>  > which machine in your cluster to use to start the query.
>
> I know this already; thanks for confirming that I got this correct! But
> what do I do then if I do not know all "signalid"s? How to learn them?
>
> Is it maybe possible to get a full list of all "signalid"s? Or is it
> possible to "re-arrange" the data in the cluster or something that
> enables me to learn what's the most recent "insertdate"?
> I really do not care if I need to do some expensive copy-all-data -
> move, but I do not know about what is possible and how to do that.
>
> Best regards,
> Marc Richter
>
> On 21.04.20 19:20, Jeff Jirsa wrote:
> >
> >
> > On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  > > wrote:
> >
> > Hi everyone,
> >
> > I'm very new to Cassandra. I have, however, some experience with SQL.
> >
> >
> > The biggest thing to remember is that Cassandra is designed to scale out
> > to massive clusters - like thousands of instances. To do that, you can't
> > assume it's ever ok to read all of the data, because that doesn't scale.
> > So cassandra takes shortcuts / optimizations to make it possible to
> > ADDRESS all of that data, but not SCAN it.
> >
> >
> > I need to extract some information from a Cassandra database that has
> > the following table definition:
> >
> > CREATE TABLE tagdata.central (
> > signalid int,
> > monthyear int,
> > fromtime bigint,
> > totime bigint,
> > avg decimal,
> > insertdate bigint,
> > max decimal,
> > min decimal,
> > readings text,
> > PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> > )
> >
> >
> > What your primary key REALLY MEANS is:
> >
> > The database on reads and writes will hash(signalid+monthyear) to find
> > which hosts have the data, then
> >
> > In each data file, the data for a given (signalid,monthyear) is stored
> > sorted by fromtime and totime
> >
> > The database is already of round about 260 GB in size.
> > I now need to know what is the most recent entry in it; the correct
> > column to learn this would be "insertdate".
> >
> > In SQL I would do 

RE: Issues, understanding how CQL works

2020-04-22 Thread Durity, Sean R
I thought this might be a single-time use case request. I think my first 
approach would be to use something like dsbulk to unload the data and then 
reload it into a table designed for the query you want to do (as long as you 
have adequate disk space). I think like a DBA/admin first. Dsbulk creates csv 
files, so you could move that data to any kind of database, if you chose.

An alternative approach would be to use a driver that supports paging (I think 
this would be most of them) and write a program to walk the data set and output 
what you need in whatever format you need.

Or, since this is a single node scenario, you could try sstable2json to export 
the sstables (files on disk) into JSON, if that is a more workable format for 
you.

Sean Durity – Staff Systems Engineer, Cassandra

-Original Message-
From: Marc Richter 
Sent: Wednesday, April 22, 2020 6:22 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Issues, understanding how CQL works

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you
understand that I summarize all your feedback in this single answer.

 From what I understand from your answers, Cassandra seems to be
optimized to store (and read) data in only exactly that way that the
data structure has been designed for. That makes it very inflexible, but
allows it to do that single job very effectively for a trade-off.

I also understand, the more I dig into Cassandra, that the team I am
supporting is using Cassandra kind of wrong; they for example do have
only one node and so do not use neither the load-balancing, nor the
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single
node; maybe that info is important, because we know on which node the
data is (I know that Cassandra internally is applying the same Hashing -
Voodoo as if there were 1k nodes, but maybe this is important anyways).

Anyways: I do not really care if a query or effort to find this
information is sub-optimal or very "expensive" in means of effectivity
or system-load, since this isn't something that I need to extract on a
regular basis, but only once. Due to that, it doesn't need to be optimal
or effective; I also do not care if it blocks the node for several
hours, since Cassandra is only working on this single request. I really
need this info (most recent "insertdate") only once.
Is, considering this, a way to do that?

 > Because you didnt provide a signalid and monthyear, it doesn't know
 > which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But
what do I do then if I do not know all "signalid"s? How to learn them?

Is it maybe possible to get a full list of all "signalid"s? Or is it
possible to "re-arrange" the data in the cluster or something that
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data -
move, but I do not know about what is possible and how to do that.

Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:
>
>
> On Tue, Apr 21, 2020 at 6:20 AM Marc Richter  > wrote:
>
> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
>
> The biggest thing to remember is that Cassandra is designed to scale out
> to massive clusters - like thousands of instances. To do that, you can't
> assume it's ever ok to read all of the data, because that doesn't scale.
> So cassandra takes shortcuts / optimizations to make it possible to
> ADDRESS all of that data, but not SCAN it.
>
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
>
> What your primary key REALLY MEANS is:
>
> The database on reads and writes will hash(signalid+monthyear) to find
> which hosts have the data, then
>
> In each data file, the data for a given (signalid,monthyear) is stored
> sorted by fromtime and totime
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
>
> Because you didnt provide a signalid and monthyear, it doesn't know
> which machine in your cluster to 

Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

2020-04-22 Thread manish khandelwal
I am running spark (max heap 4G) and a java application (4G) with my
Cassandra server (8G).

After heavy loading, if I run a spark process some main memory is pushed
into swap. But if a restart Cassandra and execute the spark process memory
is not pushed into the swap.

Idea behind asking the above question was is -XX:MaxDirectMemorySize is the
right knob to use to contain the off heap memory. I understand that I have
to test as Eric said that I might get outOfMemoryError issue. Or are there
any other better options available for handling such situations?



On Tue, Apr 21, 2020 at 9:52 PM Reid Pinchback 
wrote:

> Note that from a performance standpoint, it’s hard to see a reason to care
> about releasing the memory unless you are co-tenanting C* with something
> else that’s significant in its memory demands, and significant on a
> schedule anti-correlated with when C* needs that memory.
>
>
>
> If you aren’t doing that, then conceivably the only other time you’d care
> is if you are seeing read or write stalls on disk I/O because O/S buffer
> cache is too small.  But if you were getting a lot of impact from stalls,
> then it would mean C* was very busy… and if it’s very busy then it’s likely
> using it’s buffers as they are intended.
>
>
>
> *From: *HImanshu Sharma 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Saturday, April 18, 2020 at 2:06 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Impact of setting low value for flag
> -XX:MaxDirectMemorySize
>
>
>
> *Message from External Sender*
>
> From the codebase as much I understood, if once a buffer is being
> allocated, then it is not freed and added to a recyclable pool. When a new
> request comes effort is made to fetch memory from recyclable pool and if is
> not available new allocation request is made. And while allocating a new
> request if memory limit is breached then we get this oom error.
>
>
>
> I would like to know is my understanding correct
>
> If what I am thinking is correct, is there way we can get this buffer pool
> reduced when there is low traffic because what I have observed in my system
> this memory remains static even if there is no traffic.
>
>
>
> Regards
>
> Manish
>
>
>
> On Sat, Apr 18, 2020 at 11:13 AM Erick Ramirez 
> wrote:
>
> Like most things, it depends on (a) what you're allowing and (b) how much
> your nodes require. MaxDirectMemorySize is the upper-bound for off-heap
> memory used for the direct byte buffer. C* uses it for Netty so if your
> nodes are busy servicing requests, they'd have more IO threads consuming
> memory.
>
>
>
> During low traffic periods, there's less memory allocated to service
> requests and they eventually get freed up by GC tasks. But if traffic
> volumes are high, memory doesn't get freed up quick enough so the max is
> reached. When this happens, you'll see OOMs like "OutOfMemoryError:
> Direct buffer memory" show up in the logs.
>
>
>
> You can play around with different values but make sure you test it
> exhaustively before trying it out in production. Cheers!
>
>
>
> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax
> have answers! Share your expertise on https://community.datastax.com/
> 
> .
>
>


Re: Issues, understanding how CQL works

2020-04-22 Thread Marc Richter

Hi Jeff,

thank you for your exhaustive and verbose answer!
Also, a very big "Thank you!" to all the other replyers; I hope you 
understand that I summarize all your feedback in this single answer.


From what I understand from your answers, Cassandra seems to be 
optimized to store (and read) data in only exactly that way that the 
data structure has been designed for. That makes it very inflexible, but 
allows it to do that single job very effectively for a trade-off.


I also understand, the more I dig into Cassandra, that the team I am 
supporting is using Cassandra kind of wrong; they for example do have 
only one node and so do not use neither the load-balancing, nor the 
redundancy-capabilities Cassandra offers.
Thus, maybe relevant side-note: All the data resides on just one single 
node; maybe that info is important, because we know on which node the 
data is (I know that Cassandra internally is applying the same Hashing - 
Voodoo as if there were 1k nodes, but maybe this is important anyways).


Anyways: I do not really care if a query or effort to find this 
information is sub-optimal or very "expensive" in means of effectivity 
or system-load, since this isn't something that I need to extract on a 
regular basis, but only once. Due to that, it doesn't need to be optimal 
or effective; I also do not care if it blocks the node for several 
hours, since Cassandra is only working on this single request. I really 
need this info (most recent "insertdate") only once.

Is, considering this, a way to do that?

> Because you didnt provide a signalid and monthyear, it doesn't know
> which machine in your cluster to use to start the query.

I know this already; thanks for confirming that I got this correct! But 
what do I do then if I do not know all "signalid"s? How to learn them?


Is it maybe possible to get a full list of all "signalid"s? Or is it 
possible to "re-arrange" the data in the cluster or something that 
enables me to learn what's the most recent "insertdate"?
I really do not care if I need to do some expensive copy-all-data - 
move, but I do not know about what is possible and how to do that.


Best regards,
Marc Richter

On 21.04.20 19:20, Jeff Jirsa wrote:



On Tue, Apr 21, 2020 at 6:20 AM Marc Richter > wrote:


Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.


The biggest thing to remember is that Cassandra is designed to scale out 
to massive clusters - like thousands of instances. To do that, you can't 
assume it's ever ok to read all of the data, because that doesn't scale. 
So cassandra takes shortcuts / optimizations to make it possible to 
ADDRESS all of that data, but not SCAN it.



I need to extract some information from a Cassandra database that has
the following table definition:

CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)


What your primary key REALLY MEANS is:

The database on reads and writes will hash(signalid+monthyear) to find 
which hosts have the data, then


In each data file, the data for a given (signalid,monthyear) is stored 
sorted by fromtime and totime


The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct
column to learn this would be "insertdate".

In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;


Because you didnt provide a signalid and monthyear, it doesn't know 
which machine in your cluster to use to start the query.



But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted
by an EQ or an IN.


Because it's designed for potentially petabytes of data per cluster, it 
doesn't believe you really want to walk all the data and order ALL of 
it. Instead, it assumes that when you need to use an ORDER BY, you're 
going to have some very small piece of data - confined to a single 
signalid/monthyear pair. And even then, the ORDER is going to assume 
that you're ordering it by the ordering keys you've defined - fromtime  
first, and then totime.


So you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime ASC
And you can do

  SELECT ... WHERE signalid=? and monthyear=? ORDER BY fromtime DESC

And you can do ranges:

  SELECT ... WHERE signalid=? and monthyear=? AND fromtime >= ? ORDER BY 
fromtime DESC


But you have to work within the boundaries of how the data is stored. 
It's stored grouped by signalid+monthyear, and then sorted by fromtime, 
and 

Re: Issues, understanding how CQL works

2020-04-22 Thread Pekka Enberg
Hi Marc,

On Tue, Apr 21, 2020 at 4:20 PM Marc Richter  wrote:

> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>

As others have already pointed out, you need to design your data model to
support the queries you need. CQL is not SQL and you cannot query the data
in arbitrary ways if your data model does not support it, at least not
efficiently.

Although the context is DynamoDB, I have found the following presentation
by Rick Houlihan to be pretty good on this topic and more or less
applicable to Cassandra. The part about "NoSQL data modeling" starts at
22:45 mark:

https://www.youtube.com/watch?v=HaEPXoXVf2k=youtu.be=1363

Hope this helps!

- Pekka