Re: [HELP] Cassandra 4.1.1 Repeated Bootstrapping Failure

2023-09-11 Thread Bowen Song via user

Hi Scott,


Thank you for pointing this out. I found it too, but I deemed it to be 
irrelevant because the following reasons:


 * it was fixed in 4.1.1, as you have correctly pointed out; and
 * the error message is slightly different, "writevAddresses" vs
   "writeAddress"; and
 * it actually got stuck for 15 minutes without any logs related to the
   streaming, but in my case everything worked fine up until it
   suddenly times out.

Therefore I did not mention it in the email.


Regards,

Bowen


On 11/09/2023 22:24, C. Scott Andreas wrote:

Bowen, thanks for reaching out.

My mind immediately jumped to a ticket which has very similar 
pathology: "CASSANDRA-18110 
<https://issues.apache.org/jira/browse/CASSANDRA-18110>: Streaming 
progress virtual table lock contention can trigger TCP_USER_TIMEOUT 
and fail streaming" -- but I see this was fixed in 4.1.1.


On Sep 11, 2023, at 2:09 PM, Bowen Song via user 
 wrote:



  *Description*

When adding a new node to an existing cluster, the new node 
bootstrapping fails with the 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out" error from the streaming source node. 
Resuming the bootstrap with "nodetool bootstrap resume" works, but 
the resumed bootstrap can fail too. We often need to run "nodetool 
bootstrap resume" a couple of times to complete the bootstrapping on 
a joining node.



  Steps that produced the error

(I'm hesitant to say "step to reproduce", because I failed to 
reproduce the error on a testing cluster)
Install Cassandra 4.1.1 on new servers, using two of the existing 
nodes as seed nodes, start the new node and let it join the cluster. 
Watch the logs.



  Environment

All nodes, existing or new, have the same software versions as below.

Cassandra: version 4.1.1
Java: OpenJDK 11
OS: Debian 11

Existing nodes each has 1TB SSD, 64GB memory and 6 cores CPU, and 
num_tokens is set to 4
New nodes each has 2TB SSD, 128GB memory and 16 cores CPU, and 
num_tokens is set to 8


Cassandra is in a single DC, single rack setup with about 130 nodes, 
and all non-system keyspaces have RF=3


Relevant config options:

stream_throughput_outbound: 15MiB/s
  streaming_connections_per_host: 2
  auto_bootstrap: not set, default to true
  internode_tcp_user_timeout: not set, default to 30 seconds
  internode_streaming_tcp_user_timeout: not set, default to 5 minutes
  streaming_keep_alive_period: not set, default to 5 minutes
  streaming_state_expires: not set, default to 3 days
  streaming_state_size: not set, default to 40MiB
  streaming_stats_enabled: not set, default to true
  uuid_sstable_identifiers_enabled: true (turned on after
upgraded to 4.1 last year)


  What we have tried

*Tried*: checking the hardware and network
*Result*: everything appears to be fine

*Tried*: Google searching for the error message 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out"
*Result*: only one matching result was found, and it points to 
CASSANDRA-16143 
<https://issues.apache.org/jira/browse/CASSANDRA-16143>. That 
certainly doesn't apply in our case, as it was fixed in 4.0, and I 
also don't believe our data centre grade SSDs are that slow.


*Tried*: reducing the stream_throughput_outbound from 30 to 15 MiB/s
*Result*: did not help, no sign of any improvement

*Tried*: analyse the logs from the joining node and the streaming 
source nodes
*Result*: the error says the write connection timed out on the 
sending end, but a few seconds before that, both sending and 
receiving ends of the connection were still communicating with each 
other. I couldn't make sense of it.


*Tried*: bootstrapping a different node of the same spec
*Result*: same error reproduced

*Tried*: attempting to reproduce the error on a testing cluster
*Result*: unable to reproduce this error on a smaller testing cluster 
with less nodes, less powerful hardware, same Cassandar, Java and OS 
version, same config, same schema, less data and same mixed number of 
vnodes.


*Tried*: keep retrying with "nodetool bootstrap resume"
*Result*: this works and unblocked us from adding new nodes to the 
cluster, but this obviously is not how it should be done.



  What do I expect from posting this

I'm suspecting that this is a bug in Cassandra, but lack the evidence 
to support that, and lacks the expertise in debugging Cassandra (or 
any other Java application).
It would be much appreciated if anyone could offer me some help on 
this, or point me to a direction that may lead to the solution.



  Relevant logs

Note: IP address, keyspace and table names are reducted. The IP 
address ending in 111 is the joining node, and the IP address ending 
in 182 was one of the streaming source node.


The logs from the joining node (IP: xxx.xxx

Re: [HELP] Cassandra 4.1.1 Repeated Bootstrapping Failure

2023-09-11 Thread C. Scott Andreas

Bowen, thanks for reaching out.My mind immediately jumped to a ticket which has very similar 
pathology: "CASSANDRA-18110: Streaming progress virtual table lock contention can trigger 
TCP_USER_TIMEOUT and fail streaming" -- but I see this was fixed in 4.1.1.On Sep 11, 2023, 
at 2:09 PM, Bowen Song via user  wrote:DescriptionWhen adding 
a new node to an existing cluster, the new node
 bootstrapping fails with the
 "io.netty.channel.unix.Errors$NativeIoException: writeAddress(..)
 failed: Connection timed out" error from the streaming source
 node. Resuming the bootstrap with "nodetool bootstrap resume"
 works, but the resumed bootstrap can fail too. We often need to
 run "nodetool bootstrap resume" a couple of times to complete
 the bootstrapping on a joining node.Steps that produced the error(I'm hesitant to 
say "step to reproduce", because I failed to
 reproduce the error on a testing cluster) Install Cassandra 4.1.1 on new 
servers, using two of the existing
 nodes as seed nodes, start the new node and let it join the
 cluster. Watch the logs. EnvironmentAll nodes, existing or new, have the 
same software versions as
 below.Cassandra: version 4.1.1 Java: OpenJDK 11 OS: Debian 11Existing 
nodes each has 1TB SSD, 64GB memory and 6 cores CPU, and
 num_tokens is set to 4 New nodes each has 2TB SSD, 128GB memory and 16 
cores CPU, and
 num_tokens is set to 8  Cassandra is in a single DC, single rack setup 
with about 130
 nodes, and all non-system keyspaces have RF=3  Relevant config options:  
stream_throughput_outbound: 15MiB/s   streaming_connections_per_host: 2   
auto_bootstrap: not set, default to true   internode_tcp_user_timeout: not set, 
default to 30 seconds   internode_streaming_tcp_user_timeout: not set, default 
to 5
 minutes   streaming_keep_alive_period: not set, default to 5 minutes   
streaming_state_expires: not set, default to 3 days   streaming_state_size: not 
set, default to 40MiB   streaming_stats_enabled: not set, default to true   
uuid_sstable_identifiers_enabled: true (turned on after
 upgraded to 4.1 last year)What we have triedTried: checking the 
hardware and network Result: everything appears to be fine  Tried: Google 
searching for the error message
 "io.netty.channel.unix.Errors$NativeIoException: writeAddress(..)
 failed: Connection timed out" Result: only one matching result was found, 
and it points
 to CASSANDRA-16143.
 That certainly doesn't apply in our case, as it was fixed in 4.0,
 and I also don't believe our data centre grade SSDs are that slow.  Tried: 
reducing the stream_throughput_outbound from 30 to
 15 MiB/s Result: did not help, no sign of any improvement  Tried: analyse 
the logs from the joining node and the
 streaming source nodes Result: the error says the write connection timed 
out on
 the sending end, but a few seconds before that, both sending and
 receiving ends of the connection were still communicating with
 each other. I couldn't make sense of it.  Tried: bootstrapping a different 
node of the same spec Result: same error reproduced  Tried: attempting to 
reproduce the error on a testing
 cluster Result: unable to reproduce this error on a smaller testing
 cluster with less nodes, less powerful hardware, same Cassandar,
 Java and OS version, same config, same schema, less data and same
 mixed number of vnodes.  Tried: keep retrying with "nodetool bootstrap 
resume" Result: this works and unblocked us from adding new nodes
 to the cluster, but this obviously is not how it should be done. What do I 
expect from posting thisI'm suspecting that this is a bug in Cassandra, but 
lack the
 evidence to support that, and lacks the expertise in debugging
 Cassandra (or any other Java application). It would be much appreciated if 
anyone could offer me some help on
 this, or point me to a direction that may lead to the solution. Relevant 
logsNote: IP address, keyspace and table names are reducted. The IP
 address ending in 111 is the joining node, and the IP address
 ending in 182 was one of the streaming source node.The logs from the 
joining node (IP: xxx.xxx.xxx.111):DEBUG
 [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
 2023-09-09 15:59:13,555 StreamDeserializingTask.java:74 -
 [Stream #69de5e80-4f21-11ee-abc5-1de0bb481b0e channel:
 e0e09450] Received Prepare SYNACK ( 440 files} INFO  
[Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
 2023-09-09 15:59:13,556 StreamResultFuture.java:187 - [Stream
 #69de5e80-4f21-11ee-abc5-1de0bb481b0e ID#0] Prepare completed.
 Receiving 440 files(38.941GiB), sending 0 files(0.000KiB) DEBUG 
[Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
 2023-09-09 15:59:13,556 StreamCoordinator.java:148 -
 Connecting next se

[HELP] Cassandra 4.1.1 Repeated Bootstrapping Failure

2023-09-11 Thread Bowen Song via user


 *Description*

When adding a new node to an existing cluster, the new node 
bootstrapping fails with the 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out" error from the streaming source node. 
Resuming the bootstrap with "nodetool bootstrap resume" works, but the 
resumed bootstrap can fail too. We often need to run "nodetool bootstrap 
resume" a couple of times to complete the bootstrapping on a joining node.



 Steps that produced the error

(I'm hesitant to say "step to reproduce", because I failed to reproduce 
the error on a testing cluster)
Install Cassandra 4.1.1 on new servers, using two of the existing nodes 
as seed nodes, start the new node and let it join the cluster. Watch the 
logs.



 Environment

All nodes, existing or new, have the same software versions as below.

   Cassandra: version 4.1.1
   Java: OpenJDK 11
   OS: Debian 11

Existing nodes each has 1TB SSD, 64GB memory and 6 cores CPU, and 
num_tokens is set to 4
New nodes each has 2TB SSD, 128GB memory and 16 cores CPU, and 
num_tokens is set to 8


Cassandra is in a single DC, single rack setup with about 130 nodes, and 
all non-system keyspaces have RF=3


Relevant config options:

  stream_throughput_outbound: 15MiB/s
  streaming_connections_per_host: 2
  auto_bootstrap: not set, default to true
  internode_tcp_user_timeout: not set, default to 30 seconds
  internode_streaming_tcp_user_timeout: not set, default to 5 minutes
  streaming_keep_alive_period: not set, default to 5 minutes
  streaming_state_expires: not set, default to 3 days
  streaming_state_size: not set, default to 40MiB
  streaming_stats_enabled: not set, default to true
  uuid_sstable_identifiers_enabled: true (turned on after upgraded
   to 4.1 last year)


 What we have tried

*Tried*: checking the hardware and network
*Result*: everything appears to be fine

*Tried*: Google searching for the error message 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out"
*Result*: only one matching result was found, and it points to 
CASSANDRA-16143 <https://issues.apache.org/jira/browse/CASSANDRA-16143>. 
That certainly doesn't apply in our case, as it was fixed in 4.0, and I 
also don't believe our data centre grade SSDs are that slow.


*Tried*: reducing the stream_throughput_outbound from 30 to 15 MiB/s
*Result*: did not help, no sign of any improvement

*Tried*: analyse the logs from the joining node and the streaming source 
nodes
*Result*: the error says the write connection timed out on the sending 
end, but a few seconds before that, both sending and receiving ends of 
the connection were still communicating with each other. I couldn't make 
sense of it.


*Tried*: bootstrapping a different node of the same spec
*Result*: same error reproduced

*Tried*: attempting to reproduce the error on a testing cluster
*Result*: unable to reproduce this error on a smaller testing cluster 
with less nodes, less powerful hardware, same Cassandar, Java and OS 
version, same config, same schema, less data and same mixed number of 
vnodes.


*Tried*: keep retrying with "nodetool bootstrap resume"
*Result*: this works and unblocked us from adding new nodes to the 
cluster, but this obviously is not how it should be done.



 What do I expect from posting this

I'm suspecting that this is a bug in Cassandra, but lack the evidence to 
support that, and lacks the expertise in debugging Cassandra (or any 
other Java application).
It would be much appreciated if anyone could offer me some help on this, 
or point me to a direction that may lead to the solution.



 Relevant logs

Note: IP address, keyspace and table names are reducted. The IP address 
ending in 111 is the joining node, and the IP address ending in 182 was 
one of the streaming source node.


The logs from the joining node (IP: xxx.xxx.xxx.111):

   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,555 StreamDeserializingTask.java:74 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e channel: e0e09450] Received
   Prepare SYNACK ( 440 files}
   INFO  [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamResultFuture.java:187 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e ID#0] Prepare completed.
   Receiving 440 files(38.941GiB), sending 0 files(0.000KiB)
   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamCoordinator.java:148 - Connecting next
   session 69de5e80-4f21-11ee-abc5-1de0bb481b0e with /95.217.36.91:7000.
   INFO  [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamSession.java:368 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e] Starting streaming to
   95.217.36.91:7000
   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e0945

Re: Help determining pending compactions

2022-11-07 Thread Richard Hesse
Thanks for the tip Eric. We're actually on 3.2 and the issue isn't with the
Reaper. The issue is with Cassandra. It will report that a table has
pending compactions, but it will never actually start compacting. The
pending number stays at that level until we run a manual compaction.

-richard


On Mon, Nov 7, 2022 at 8:53 AM Eric Ferrenbach <
eric.ferrenb...@milliporesigma.com> wrote:

> We had issues where Reaper would never actually start some repairs.  The
> GUI would say RUNNING but the progress would be 0/.
>
> Datastax support said there is a bug and recommended upgrading to 3.2.
>
> Upgrading Reaper to 3.2 resolved our issue.
>
>
>
> Hope this helps.
>
> Eric
>
>
>
>
>
>
>
> *From:* Richard Hesse 
> *Sent:* Sunday, October 30, 2022 12:07 PM
> *To:* user@cassandra.apache.org
> *Subject:* Help determining pending compactions
>
>
>
> [WARNING - EXTERNAL EMAIL] Do not open links or attachments unless you
> recognize the sender of this email. If you are unsure please click the
> button "Report suspicious email"
>
>
>
> Hi, I'm hoping to get some help with a vexing issue with one of our
> keyspaces. During Reaper repair sessions, one keyspace will end up with
> hanging, non-started compactions. That is, the number of compactions as
> reported by nodetool compactionstats stays flat and there are no running
> compactions. Is there a way to determine which tables Cassandra is stuck on
> here?
>
>
>
> Looking at graphs of pending compactions during Reaper sessions, the
> number of compactions will shoot up (as expected). The number will work its
> way down, and sometimes it will stop and plateau at a fixed level. A full
> compaction will get things going again, but we prefer to avoid those (even
> with the -s option).
>
>
>
> I realize there are various compaction tuning parameters regarding minimum
> age, tombstone percentage, etc but I need to know which sstables to look at
> first before blindly changing values. These are leveled compaction strategy
> tables FWIW.
>
>
>
> TIA!
>
>
>
> -richard
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click emdgroup.com/disclaimer
> <https://www.emdgroup.com/en/legal-disclaimer/mail-disclaimer.html> to
> access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak
> versions of this disclaimer.
>
>
>
> Please find our Privacy Statement information by clicking here: 
> emdgroup.com/privacy-statement
> (U.S.) <https://www.emdgroup.com/en/privacy-statement.html> or 
> emdserono.com/privacy-statement
> (Canada) <https://www.emdserono.ca/ca-en/privacy-statement.html>
>


RE: Help determining pending compactions

2022-11-07 Thread Eric Ferrenbach
We had issues where Reaper would never actually start some repairs.  The GUI 
would say RUNNING but the progress would be 0/.
Datastax support said there is a bug and recommended upgrading to 3.2.
Upgrading Reaper to 3.2 resolved our issue.

Hope this helps.
Eric



From: Richard Hesse 
Sent: Sunday, October 30, 2022 12:07 PM
To: user@cassandra.apache.org
Subject: Help determining pending compactions


[WARNING - EXTERNAL EMAIL] Do not open links or attachments unless you 
recognize the sender of this email. If you are unsure please click the button 
"Report suspicious email"

Hi, I'm hoping to get some help with a vexing issue with one of our keyspaces. 
During Reaper repair sessions, one keyspace will end up with hanging, 
non-started compactions. That is, the number of compactions as reported by 
nodetool compactionstats stays flat and there are no running compactions. Is 
there a way to determine which tables Cassandra is stuck on here?

Looking at graphs of pending compactions during Reaper sessions, the number of 
compactions will shoot up (as expected). The number will work its way down, and 
sometimes it will stop and plateau at a fixed level. A full compaction will get 
things going again, but we prefer to avoid those (even with the -s option).

I realize there are various compaction tuning parameters regarding minimum age, 
tombstone percentage, etc but I need to know which sstables to look at first 
before blindly changing values. These are leveled compaction strategy tables 
FWIW.

TIA!

-richard



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click 
emdgroup.com/disclaimer<https://www.emdgroup.com/en/legal-disclaimer/mail-disclaimer.html>
 to access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak 
versions of this disclaimer.



Please find our Privacy Statement information by clicking here: 
emdgroup.com/privacy-statement 
(U.S.)<https://www.emdgroup.com/en/privacy-statement.html> or 
emdserono.com/privacy-statement 
(Canada)<https://www.emdserono.ca/ca-en/privacy-statement.html>


Re: Help determining pending compactions

2022-10-30 Thread Richard Hesse
Sorry about that. 4.0.6

On Sun, Oct 30, 2022, 11:19 AM Dinesh Joshi  wrote:

> It would be helpful if you could tell us what version of Cassandra you’re
> using?
>
> Dinesh
>
> > On Oct 30, 2022, at 10:07 AM, Richard Hesse  wrote:
> >
> > 
> > Hi, I'm hoping to get some help with a vexing issue with one of our
> keyspaces. During Reaper repair sessions, one keyspace will end up with
> hanging, non-started compactions. That is, the number of compactions as
> reported by nodetool compactionstats stays flat and there are no running
> compactions. Is there a way to determine which tables Cassandra is stuck on
> here?
> >
> > Looking at graphs of pending compactions during Reaper sessions, the
> number of compactions will shoot up (as expected). The number will work its
> way down, and sometimes it will stop and plateau at a fixed level. A full
> compaction will get things going again, but we prefer to avoid those (even
> with the -s option).
> >
> > I realize there are various compaction tuning parameters regarding
> minimum age, tombstone percentage, etc but I need to know which sstables to
> look at first before blindly changing values. These are leveled compaction
> strategy tables FWIW.
> >
> > TIA!
> >
> > -richard
>


Re: Help determining pending compactions

2022-10-30 Thread Dinesh Joshi
It would be helpful if you could tell us what version of Cassandra you’re using?

Dinesh

> On Oct 30, 2022, at 10:07 AM, Richard Hesse  wrote:
> 
> 
> Hi, I'm hoping to get some help with a vexing issue with one of our 
> keyspaces. During Reaper repair sessions, one keyspace will end up with 
> hanging, non-started compactions. That is, the number of compactions as 
> reported by nodetool compactionstats stays flat and there are no running 
> compactions. Is there a way to determine which tables Cassandra is stuck on 
> here?
> 
> Looking at graphs of pending compactions during Reaper sessions, the number 
> of compactions will shoot up (as expected). The number will work its way 
> down, and sometimes it will stop and plateau at a fixed level. A full 
> compaction will get things going again, but we prefer to avoid those (even 
> with the -s option).
> 
> I realize there are various compaction tuning parameters regarding minimum 
> age, tombstone percentage, etc but I need to know which sstables to look at 
> first before blindly changing values. These are leveled compaction strategy 
> tables FWIW.
> 
> TIA!
> 
> -richard


Help determining pending compactions

2022-10-30 Thread Richard Hesse
Hi, I'm hoping to get some help with a vexing issue with one of our
keyspaces. During Reaper repair sessions, one keyspace will end up with
hanging, non-started compactions. That is, the number of compactions as
reported by nodetool compactionstats stays flat and there are no running
compactions. Is there a way to determine which tables Cassandra is stuck on
here?

Looking at graphs of pending compactions during Reaper sessions, the number
of compactions will shoot up (as expected). The number will work its way
down, and sometimes it will stop and plateau at a fixed level. A full
compaction will get things going again, but we prefer to avoid those (even
with the -s option).

I realize there are various compaction tuning parameters regarding minimum
age, tombstone percentage, etc but I need to know which sstables to look at
first before blindly changing values. These are leveled compaction strategy
tables FWIW.

TIA!

-richard


Re: Need urgent help in cassandra modelling

2022-03-19 Thread MyWorld
Anyone have any clue?


On Wed, Mar 9, 2022 at 7:01 PM MyWorld  wrote:

> Hi all,
> Some problems with the display. Resending my query-
>
> I am modelling a table for a shopping site where we store products for
> customers and their data in json. Max prods for a customer is 10k.
>
> We initially designed this table with the architecture below:
> cust_prods(cust_id bigint PK, prod_id bigint CK, prod_data text).
> cust_id is partition key, prod_id is clustering key and combination of
> these is unique.
>
> Now, we have a requirement to store one ordering column "prod_order
> (bigint)" which is pre-calculated from service end after complex
> calculation.
> Further we now want to want to display products in pagination (100 per
> page in order of prod_order)
>
> Please suggest the new architecture for the table "cust_prods" so that :
> 1. We could be still able to update a single prod info based on cust_id +
> prod_id
> 2. We could store data in order of "prod_order" so that fetching in
> limit(pagination) becomes easy.
>
> Note : prod_order could change for a product frequently
>
> Regards,
> Ashish
>
> On Wed, Mar 9, 2022 at 6:55 PM MyWorld  wrote:
>
>> Hi all,
>>
>> I am modelling a table for a shopping site where we store products for
>> customers and their data in json. Max prods for a customer is 10k.
>>
>> >>We initially designed this table with the architecture below:
>> cust_prods(cust_id bigint PK, prod_id bigint CK, prod_data text).
>> cust_id is partition key, prod_id is clustering key and combination of
>> these is unique.
>>
>> >>Now, we have a requirement to store one ordering column "prod_order
>> (bigint)" which is pre-calculated from service end after complex
>> calculation.
>> Further we now want to want to display products in pagination (100 per
>> page in order of prod_order)
>>
>> Please suggest the new architecture for the table "cust_prods" so that :
>> 1. We could be still able to update a single prod info based on cust_id +
>> prod_id
>> 2. We could store data in order of "prod_order" so that fetching in
>> limit(pagination) becomes easy.
>>
>> Note : prod_order could change for a product frequently
>>
>> Regards,
>> Ashish
>>
>


Re: Need urgent help in cassandra modelling

2022-03-09 Thread MyWorld
Hi all,
Some problems with the display. Resending my query-

I am modelling a table for a shopping site where we store products for
customers and their data in json. Max prods for a customer is 10k.

We initially designed this table with the architecture below:
cust_prods(cust_id bigint PK, prod_id bigint CK, prod_data text).
cust_id is partition key, prod_id is clustering key and combination of
these is unique.

Now, we have a requirement to store one ordering column "prod_order
(bigint)" which is pre-calculated from service end after complex
calculation.
Further we now want to want to display products in pagination (100 per page
in order of prod_order)

Please suggest the new architecture for the table "cust_prods" so that :
1. We could be still able to update a single prod info based on cust_id +
prod_id
2. We could store data in order of "prod_order" so that fetching in
limit(pagination) becomes easy.

Note : prod_order could change for a product frequently

Regards,
Ashish

On Wed, Mar 9, 2022 at 6:55 PM MyWorld  wrote:

> Hi all,
>
> I am modelling a table for a shopping site where we store products for
> customers and their data in json. Max prods for a customer is 10k.
>
> >>We initially designed this table with the architecture below:
> cust_prods(cust_id bigint PK, prod_id bigint CK, prod_data text).
> cust_id is partition key, prod_id is clustering key and combination of
> these is unique.
>
> >>Now, we have a requirement to store one ordering column "prod_order
> (bigint)" which is pre-calculated from service end after complex
> calculation.
> Further we now want to want to display products in pagination (100 per
> page in order of prod_order)
>
> Please suggest the new architecture for the table "cust_prods" so that :
> 1. We could be still able to update a single prod info based on cust_id +
> prod_id
> 2. We could store data in order of "prod_order" so that fetching in
> limit(pagination) becomes easy.
>
> Note : prod_order could change for a product frequently
>
> Regards,
> Ashish
>


Need urgent help in cassandra modelling

2022-03-09 Thread MyWorld
Hi all,

I am modelling a table for a shopping site where we store products for
customers and their data in json. Max prods for a customer is 10k.

>>We initially designed this table with the architecture below:
cust_prods(cust_id bigint PK, prod_id bigint CK, prod_data text).
cust_id is partition key, prod_id is clustering key and combination of
these is unique.

>>Now, we have a requirement to store one ordering column "prod_order
(bigint)" which is pre-calculated from service end after complex
calculation.
Further we now want to want to display products in pagination (100 per page
in order of prod_order)

Please suggest the new architecture for the table "cust_prods" so that :
1. We could be still able to update a single prod info based on cust_id +
prod_id
2. We could store data in order of "prod_order" so that fetching in
limit(pagination) becomes easy.

Note : prod_order could change for a product frequently

Regards,
Ashish


Re: TWCS repair and compact help

2021-06-29 Thread Kane Wilson
>
> Oh.  So our data is all messed up now because of the “nodetool compact” I
> ran.
>
>
>
> Hi Erick.  Thanks for the quick reply.
>
>
>
> I just want to be sure about compact.  I saw Cassandra will do compaction
> by itself even when I do not run “nodetool compact” manually (nodetool
> compactionstats always has some compaction running).  So this automatic
> compact by Cassandra will clean up the tombstoned data files?
>

They won't "compact", but will rather just be deleted once all the data in
the file passes its expiration time.


> Another question I have is, is there a way to un-mess my messed up data
> now?
>

Not really. The easiest way would be to re-insert all your data. If you're
not having any read performance issues you might be better just waiting the
7 days until the large SSTable is dropped.

-- 
raft.so - Cassandra consulting, support, and managed services


RE: TWCS repair and compact help

2021-06-29 Thread Eric Wong
Oh.  So our data is all messed up now because of the "nodetool compact" I ran.

Hi Erick.  Thanks for the quick reply.

I just want to be sure about compact.  I saw Cassandra will do compaction by 
itself even when I do not run "nodetool compact" manually (nodetool 
compactionstats always has some compaction running).  So this automatic compact 
by Cassandra will clean up the tombstoned data files?

Another question I have is, is there a way to un-mess my messed up data now?

Thanks,
Eric


From: Erick Ramirez 
Sent: Tuesday, June 29, 2021 6:34 PM
To: user@cassandra.apache.org
Subject: Re: TWCS repair and compact help

You definitely shouldn't perform manual compactions -- you should let the 
normal compaction tasks take care of it. It is unnecessary to manually run 
compactions since it creates more problems than it solves as I've explained in 
this post -- 
https://community.datastax.com/questions/6396/<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommunity.datastax.com%2Fquestions%2F6396%2F=04%7C01%7C%7C752550b0150740d8738008d93ae97621%7C621e62193ff94100b0e6fbe83e87c529%7C1%7C0%7C637605596658021039%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=pGXhTowWOcVC9Ag7zAkT85RuRzUdQFdner747sj737s%3D=0>.
 Cheers!


Re: TWCS repair and compact help

2021-06-29 Thread Gábor Auth
Hi,

On Tue, Jun 29, 2021 at 12:34 PM Erick Ramirez 
wrote:

> You definitely shouldn't perform manual compactions -- you should let the
> normal compaction tasks take care of it. It is unnecessary to manually run
> compactions since it creates more problems than it solves as I've explained
> in this post -- https://community.datastax.com/questions/6396/. Cheers!
>

Same issue here... Iwant to replace SizeTieredCompactionStrategy to
TimeWindowCompactionStrategy but I cannot achieve to split of the existing
SSTables to daily SSTables. Any idea about it? :)

-- 
Bye,
Auth Gábor (https://iotguru.cloud)


Re: TWCS repair and compact help

2021-06-29 Thread Erick Ramirez
You definitely shouldn't perform manual compactions -- you should let the
normal compaction tasks take care of it. It is unnecessary to manually run
compactions since it creates more problems than it solves as I've explained
in this post -- https://community.datastax.com/questions/6396/. Cheers!


TWCS repair and compact help

2021-06-29 Thread Eric Wong
Hi:

We need some help on cassandra repair and compact for a table that uses TWCS.  
We are running cassandra 4.0-rc1.  A database called test_db, biggest table 
"minute_rate", storing time-series data.  It has the following configuration:

CREATE TABLE test_db.minute_rate (
market smallint,
sin bigint,
field smallint,
slot timestamp,
close frozen,
high frozen,
low frozen,
open frozen,
PRIMARY KEY ((market, sin, field), slot)
) WITH CLUSTERING ORDER BY (slot ASC)
AND additional_write_policy = '99p'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND cdc = false
AND comment = ''
AND compaction = {'class': 
'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
'compaction_window_size': '4', 'compaction_window_unit': 'HOURS', 
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 604800
AND extensions = {}
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p';


minute_rate is configured to use TWCS.  We can see that it creates about 8 data 
files per day.  Each data file is relatively small size around 2GB each.  so 
far so good.

However, when "nodetool compact test_db minute_rate" ran on the weekend, it 
seems to consolidate all small data files into one big files.  nodetool compact 
after 2 or 3 weeks we ended up having bigger and bigger data files.  eating up 
all disk space on the machine.

>From what I understand about TWCS, cassandra will simply drop the data files 
>when the records inside the data file is older than default_time_to_live of 
>604800 (7 days).  But somehow this is not what we are seeing.  When I 
>sstablemetadata the oldest data file (sample below), I can see tombstone drop 
>time all got updated.  Resulting in the data file never get removed.

This lead me to think I am configuring things the wrong way.  So I want to know 
when using TWCS, do we need to repair and compact?

I saw in cassandra-reaper (we use reaper for repair) it is configured to skip 
TWCS.

Should I stop running "nodetool compact test_db minute_rate"?  If without 
"nodetool compact", will cassandra clean up the tombstoned data file?

Thanks,
Eric



# sstablemetadata na-6681303-big-Data.db
SSTable: 
/var/lib/cassandra/data/test_db/minute_rate-d7955270f31d11ea88fabb8dcc37b800/na-6681303-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.01
Minimum timestamp: 1622528285761740 (06/01/2021 06:18:05)
Maximum timestamp: 1624699757821614 (06/26/2021 09:29:17)
SSTable min local deletion time: 1624767191 (06/27/2021 04:13:11)
SSTable max local deletion time: 2147483647 (no tombstones)
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 0.2753232939720765
TTL min: 0
TTL max: 604800 (7 days)
First token: -9223371506914883753 (3870:16432974:7)
Last token: 9223358359954725918 (505:23187788:6)
minClusteringValues: [2021-04-12T16:54:00.000Z]
maxClusteringValues: [2021-06-26T09:28:00.000Z]
Estimated droppable tombstones: 0.3683520383378445
SSTable Level: 0
Repaired at: 0
Pending repair: --
Replay positions covered: {}
totalColumnsSet: 17179881008
totalRows: 4413102532
Estimated tombstone drop times:
   Drop Time| Count  (%)  Histogram
   1624768080 (06/27/2021 04:28:00) | 2 (  0)
   1624771440 (06/27/2021 05:24:00) |98 (  0)
   1624777020 (06/27/2021 06:57:00) |85 (  0)
   1624781640 (06/27/2021 08:14:00) |74 (  0)
   1624786080 (06/27/2021 09:28:00) |74 (  0)
   1624790280 (06/27/2021 10:38:00) |66 (  0)
   1624794900 (06/27/2021 11:55:00) |87 (  0)
   1624800060 (06/27/2021 13:21:00) |83 (  0)
   1624804680 (06/27/2021 14:38:00) |108064 (  0)
   1624809180 (06/27/2021 15:53:00) |304148 (  0)
   1624812540 (06/27/2021 16:49:00) |133188 (  0)
   1624819440 (06/27/2021 18:44:00) |88 (  0)
   1624824060 (06/27/2021 20:01:00) |73 (  0)
   1624828080 (06/27/2021 21:08:00) |66 (  0)
   1624832520 (06/27/2021 22:22:00) | 1 (  0)
   1624835880 (06/27/2021 23:18:00) |  23578916 (  0) o
   1624839720 (06/28/2021 00:22:00) |  21783899 (  0) o
   1624843740 (06/28/2021 01:29:00) |  22758204 (  0) o
   1624848120 (06/28/2021 02:42:00) |  25237306 (  0) o
   1624853520 (06/28/2021 04:12:00) |  44003185 (  0) O.
   1624858080 (06/28/2021 05:28:00) | 145977595 (  0) O
   1624862460 (06/28/2021 06:41:00) | 331875915 (  1) OOOo
   1624866060 (06/28/2021 07:41:00) | 463284230 (  2) .
   1624869540 (06/28/2021 08:39:00) | 455732185 ( 

Is there any canssandra genius who can help me to solve the suddenly starup error of "Error: Could not find or load main class -ea"?

2019-12-24 Thread Nimbus Lin
To Cassandra's developers and users:
CC dimo:

 Firstly thanks dimo for his guiding, but as my former mail show, there is 
no -ea variable in the cassandra's startup program and configuration.
And the centos6.9 OS's env also don't have -ea variable. And the cassandra 
startup fail, so I cann't use jinfo to check the cassandra's startup JVM 
environment.
 Since the error  suddenly happen after my customizing cassandra code even 
with the redownload origninal cassandra 3.11.4/3, I resend this error 
to dev mailing list. And the error happen as below:
,My os environment is a vmware's centos6.9 running in  windows10,  I can 
run cassandra 3.11.4's git-clone-src by using "bin/cassandra", but after I 
changed some codes by Eclipse,
 and compile without any error, not only the runnable source version, but also 
the redownloading 3.11.4-bin.tar.gz and 3.11.3 from official website fail to 
startup, 
they suddenly can't run by command of "./bin/cassandra", the steps and logs are 
as below:

[gloCalHelp.com@gloCalHelp5 apache-cassandra-3.11.4]$ ./bin/cassandra & [1] 
5872 [gloCalHelp.com@gloCalHelp5 apache-cassandra-3.11.4]$ classname is+ 
org.apache.cassandra.service.CassandraDaemon +CLASSPATH 
is+./bin/../conf:./bin/../build/classes/main:./bin/../build/classes/thrift:./bin/../lib/airline-0.6.jar:./bin/../lib/antlr-runtime-3.5.2.jar:./bin/../lib/apache-cassandra-3.11.4.jar:./bin/../lib/apache-cassandra-thrift-3.11.4.jar:./bin/../lib/asm-5.0.4.jar:./bin/../lib/caffeine-2.2.6.jar:./bin/../lib/cassandra-driver-core-3.0.1-shaded.jar:./bin/../lib/commons-cli-1.1.jar:./bin/../lib/commons-codec-1.9.jar:./bin/../lib/commons-lang3-3.1.jar:./bin/../lib/commons-math3-3.2.jar:./bin/../lib/compress-lzf-0.8.4.jar:./bin/../lib/concurrentlinkedhashmap-lru-1.4.jar:./bin/../lib/concurrent-trees-2.4.0.jar:./bin/../lib/disruptor-3.0.1.jar:./bin/../lib/ecj-4.4.2.jar:./bin/../lib/guava-18.0.jar:./bin/../lib/HdrHistogram-2.1.9.jar:./bin/../lib/high-scale-lib-1.0.6.jar:./bin/../lib/hppc-0.5.4.jar:./bin/../lib/jackson-core-asl-1.9.13.jar:./bin/../lib/jackson-mapper-asl-1.9.13.jar:./bin/../lib/jamm-0.3.0.jar:./bin/../lib/javax.inject.jar:./bin/../lib/jbcrypt-0.3m.jar:./bin/../lib/jcl-over-slf4j-1.7.7.jar:./bin/../lib/jctools-core-1.2.1.jar:./bin/../lib/jflex-1.6.0.jar:./bin/../lib/jna-4.2.2.jar:./bin/../lib/joda-time-2.4.jar:./bin/../lib/json-simple-1.1.jar:./bin/../lib/jstackjunit-0.0.1.jar:./bin/../lib/libthrift-0.9.2.jar:./bin/../lib/log4j-over-slf4j-1.7.7.jar:./bin/../lib/logback-classic-1.1.3.jar:./bin/../lib/logback-core-1.1.3.jar:./bin/../lib/lz4-1.3.0.jar:./bin/../lib/metrics-core-3.1.5.jar:./bin/../lib/metrics-jvm-3.1.5.jar:./bin/../lib/metrics-logback-3.1.5.jar:./bin/../lib/netty-all-4.0.44.Final.jar:./bin/../lib/ohc-core-0.4.4.jar:./bin/../lib/ohc-core-j8-0.4.4.jar:./bin/../lib/reporter-config3-3.0.3.jar:./bin/../lib/reporter-config-base-3.0.3.jar:./bin/../lib/sigar-1.6.4.jar:./bin/../lib/slf4j-api-1.7.7.jar:./bin/../lib/snakeyaml-1.11.jar:./bin/../lib/snappy-java-1.1.1.7.jar:./bin/../lib/snowball-stemmer-1.3.0.581.1.jar:./bin/../lib/ST4-4.0.8.jar:./bin/../lib/stream-2.5.2.jar:./bin/../lib/thrift-server-0.3.7.jar:./bin/../lib/jsr223//.jar

Error: Could not find or load main class -ea
[1]+ Done ./bin/cassandra
[gloCalHelp.com@gloCalHelp5 apache-cassandra-3.11.4]$ free -m
 total used free shared buffers cached Mem:
4567 801 3766 5 20 190 -/+ buffers/cache: 590 3977 Swap: 1031 0 1031
and the main class CassandraDaemon and classpath are there as 
":./bin/../lib/apache-cassandra-3.11.4.jar:",

I would very appreciate your guiding. Thank you ahead.

Sincerely yours,
Georgelin
www_8ems_...@sina.com
website: gloCalHelp.com
mobile:0086 180 5986 1565


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Need help on dealing with Cassandra robustness and zombie data

2019-07-01 Thread Jeff Jirsa
What you’re describing is likely impossible to do in cassandra the way you’re 
thinking

The only practical way to do it is extending gcgs and making the tombstone 
reads less expensive (ordering the clustering columns so you’re not scanning 
the tombstones, or breaking the partitions into buckets so you can cap the 
tombstones per partition), or using a strongly consistent database (I’d 
probably use something like sharded MySQL or similar) 

> On Jul 1, 2019, at 6:45 AM, yuping wang  wrote:
> 
> Thank you; very helpful.
> But we do have some difficulties
> #1 Cassandra process itself didn’t go down when marked as “DN”... (the node 
> itself might just be temporary having some hiccup and not reachable )... so 
> would not auto-start still help?
> #2 we can’t set longer gc grace because we are very sensitive to latency ... 
> and we have a lot data in and data out... so we can’t afford keep that large 
> tombstone
> #3 the question what is the reliable way to detect change of node status? We 
> tried to use a crontab job to poll nodestatus every 5 minutes... but we still 
> end up missing some change of status especially if the node is bouncing up 
> and down... also by the time we detect and try to replace node permanently, 
> we might already exceeded that grace period.
> 
> Thanks again,
> Yuping 
> 
> On Jul 1, 2019, at 9:02 AM, Rhys Campbell 
>  wrote:
> 
> #1 Set the cassandra service to not auto-start.
> #2 Longer gc_grace time would help
> #3 Rebootstrap?
> 
> If the node doesn't come back within gc_grace,_seconds, remove the node, wipe 
> it, and bootstrap it again.
> 
> https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html
> 
> 
> 
> yuping wang  schrieb am Mo., 1. Juli 2019, 13:33:
>> Hi all,
>> 
>>   Sorry for the interruption. But I need help.
>> 
>> 
>>Due to specific reasons of our use case,  we have gc grace on the order 
>> of 10 minutes instead of default 10 days. Since we have a large amount of 
>> nodes in our Cassandra fleet, not surprisingly, we encounter occasionally  
>> node status going from up to down and up again. The problem is when the down 
>> node rejoins the cluster after 15 minutes, it automatically adds already 
>> deleted data back and causing zombie data.
>> our questions:
>> Is there a way to not allow a down node to rejoin the cluster?
>> or is there a way to configure rejoining node not adding stale data back 
>> regardless of how long the node is down before rejoining
>> or is there a way to auto clean up the data when rejoining ?
>> We know adding those data back is a conservative approach to avoid data loss 
>> but in our specific case, we are not worried about deleted data being 
>> revived we don’t have such use case. We really need a non-defaul option 
>> to never add back deleted data on rejoining nodes.
>> this functionality will ultimately be a deciding factor on whether we can 
>> continue with Cassandra.
>> 
>> Thanks again,


Re: Need help on dealing with Cassandra robustness and zombie data

2019-07-01 Thread yuping wang
Thank you; very helpful.
But we do have some difficulties
#1 Cassandra process itself didn’t go down when marked as “DN”... (the node 
itself might just be temporary having some hiccup and not reachable )... so 
would not auto-start still help?
#2 we can’t set longer gc grace because we are very sensitive to latency ... 
and we have a lot data in and data out... so we can’t afford keep that large 
tombstone
#3 the question what is the reliable way to detect change of node status? We 
tried to use a crontab job to poll nodestatus every 5 minutes... but we still 
end up missing some change of status especially if the node is bouncing up and 
down... also by the time we detect and try to replace node permanently, we 
might already exceeded that grace period.

Thanks again,
Yuping 

On Jul 1, 2019, at 9:02 AM, Rhys Campbell 
 wrote:

#1 Set the cassandra service to not auto-start.
#2 Longer gc_grace time would help
#3 Rebootstrap?

If the node doesn't come back within gc_grace,_seconds, remove the node, wipe 
it, and bootstrap it again.

https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html



yuping wang  schrieb am Mo., 1. Juli 2019, 13:33:
> Hi all,
> 
>   Sorry for the interruption. But I need help.
> 
> 
>Due to specific reasons of our use case,  we have gc grace on the order of 
> 10 minutes instead of default 10 days. Since we have a large amount of nodes 
> in our Cassandra fleet, not surprisingly, we encounter occasionally  node 
> status going from up to down and up again. The problem is when the down node 
> rejoins the cluster after 15 minutes, it automatically adds already deleted 
> data back and causing zombie data.
> our questions:
> Is there a way to not allow a down node to rejoin the cluster?
> or is there a way to configure rejoining node not adding stale data back 
> regardless of how long the node is down before rejoining
> or is there a way to auto clean up the data when rejoining ?
> We know adding those data back is a conservative approach to avoid data loss 
> but in our specific case, we are not worried about deleted data being 
> revived we don’t have such use case. We really need a non-defaul option 
> to never add back deleted data on rejoining nodes.
> this functionality will ultimately be a deciding factor on whether we can 
> continue with Cassandra.
> 
> Thanks again,


Re: Need help on dealing with Cassandra robustness and zombie data

2019-07-01 Thread Rhys Campbell
#1 Set the cassandra service to not auto-start.
#2 Longer gc_grace time would help
#3 Rebootstrap?

If the node doesn't come back within gc_grace,_seconds, remove the node,
wipe it, and bootstrap it again.

https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_about_deletes_c.html



yuping wang  schrieb am Mo., 1. Juli 2019, 13:33:

> Hi all,
>
>   Sorry for the interruption. But I need help.
>
>
>Due to specific reasons of our use case,  we have gc grace on the order
> of 10 minutes instead of default 10 days. Since we have a large amount of
> nodes in our Cassandra fleet, not surprisingly, we encounter occasionally
>  node status going from up to down and up again. The problem is when the
> down node rejoins the cluster after 15 minutes, it automatically adds
> already deleted data back and causing zombie data.
>
> our questions:
>
>1. Is there a way to not allow a down node to rejoin the cluster?
>2. or is there a way to configure rejoining node not adding stale data
>back regardless of how long the node is down before rejoining
>3. or is there a way to auto clean up the data when rejoining ?
>
> We know adding those data back is a conservative approach to avoid data
> loss but in our specific case, we are not worried about deleted data being
> revived we don’t have such use case. We really need a non-defaul option
> to never add back deleted data on rejoining nodes.
>
> this functionality will ultimately be a deciding factor on whether we can
> continue with Cassandra.
>
>
> Thanks again,
>


Need help on dealing with Cassandra robustness and zombie data

2019-07-01 Thread yuping wang
Hi all,

  Sorry for the interruption. But I need help.


   Due to specific reasons of our use case,  we have gc grace on the order of 
10 minutes instead of default 10 days. Since we have a large amount of nodes in 
our Cassandra fleet, not surprisingly, we encounter occasionally  node status 
going from up to down and up again. The problem is when the down node rejoins 
the cluster after 15 minutes, it automatically adds already deleted data back 
and causing zombie data.
our questions:
Is there a way to not allow a down node to rejoin the cluster?
or is there a way to configure rejoining node not adding stale data back 
regardless of how long the node is down before rejoining
or is there a way to auto clean up the data when rejoining ?
We know adding those data back is a conservative approach to avoid data loss 
but in our specific case, we are not worried about deleted data being 
revived we don’t have such use case. We really need a non-defaul option to 
never add back deleted data on rejoining nodes.
this functionality will ultimately be a deciding factor on whether we can 
continue with Cassandra.

Thanks again,

Re: Ansible scripts for Cassandra to help with automation needs

2019-02-14 Thread Abdul Patel
One idea will be to rolling restart of complete cluster , that script will
be huge help.
Just read a blog too that last pickle group has come up with a tool called
'cstart' something which can help in rolling restart.


On Thursday, February 14, 2019, Jeff Jirsa  wrote:

>
>
>
> On Feb 13, 2019, at 9:51 PM, Kenneth Brotman 
> wrote:
>
> I want to generate a variety of Ansible scripts to share with the Apache
> Cassandra community.  I’ll put them in a Github repository.  Just email me
> offline what scripts would help the most.
>
>
>
> Does this exist already?  I can’t find it.  Let me know if it does.
>
>
> Not aware of any repo that does this, but it’s a good idea
>
>
>
> If not, let’s put it together for the community.  Maybe we’ll end up with
> a download right on the Apache Cassandra web site or packaged with future
> releases of Cassandra.
>
>
>
> Kenneth Brotman
>
>
>
> P.S.  Terraform is next!
>
>


Re: Ansible scripts for Cassandra to help with automation needs

2019-02-13 Thread Jeff Jirsa



> On Feb 13, 2019, at 9:51 PM, Kenneth Brotman  
> wrote:
> 
> I want to generate a variety of Ansible scripts to share with the Apache 
> Cassandra community.  I’ll put them in a Github repository.  Just email me 
> offline what scripts would help the most.
>  
> Does this exist already?  I can’t find it.  Let me know if it does.

Not aware of any repo that does this, but it’s a good idea

>  
> If not, let’s put it together for the community.  Maybe we’ll end up with a 
> download right on the Apache Cassandra web site or packaged with future 
> releases of Cassandra.
>  
> Kenneth Brotman
>  
> P.S.  Terraform is next!


Ansible scripts for Cassandra to help with automation needs

2019-02-13 Thread Kenneth Brotman
I want to generate a variety of Ansible scripts to share with the Apache
Cassandra community.  I'll put them in a Github repository.  Just email me
offline what scripts would help the most. 

 

Does this exist already?  I can't find it.  Let me know if it does.

 

If not, let's put it together for the community.  Maybe we'll end up with a
download right on the Apache Cassandra web site or packaged with future
releases of Cassandra.

 

Kenneth Brotman

 

P.S.  Terraform is next!



RE: Help with sudden spike in read requests

2019-02-01 Thread Kenneth Brotman
If it’s a legacy write table why does it write 10% of the time?  Maybe it’s the 
design of the big legacy table you mentioned.  It could be so many things.  

 

Is it the same time of day? 

Same days of the week or month?  

Are there analytics run at that time?  

What are you using for monitoring and how did you find out it was happening?  

Is this a DSE cluster or OSS Cassandra cluster?

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 10:48 AM
To: user@cassandra.apache.org
Subject: Re: Help with sudden spike in read requests

 

We migrated one of the application from on-Prem to aws; the queries are very 
light, more like registration info;

 

Queries from the new app is via pk of data type, “text”, no cc (this table has 
about 200 rows; however the legacy table (more like reference table) has 
several million rows, about 800 sstables per node, using lcs (9:1, read-write 
ratio)

Subroto 


On Feb 1, 2019, at 10:33 AM, Kenneth Brotman  
wrote:

Do you have that many queries?  You could just review them and your data model 
to see if there was an error of some kind.  How long has it been happening?  
What changed since it started happening?

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 10:13 AM
To: user@cassandra.apache.org
Subject: Re: Help with sudden spike in read requests

 

Vnode is 256

C*: 3.0.15 on m4.4xlarge gp2 vol

 

There are 2 more DCs on bare metal (raid 10 and older machines) attached to 
this cluster and we have not seen this behavior on on-prem servers 

 

If this event is triggered by some bad query/queries, what is the best way to 
trap it?

Subroto 


On Feb 1, 2019, at 8:55 AM, Kenneth Brotman  
wrote:

If you had a query that went across the partitions and especially if you had 
vNodes set high, that would do it.

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 8:45 AM
To: User cassandra.apache.org <http://cassandraapache.org> 
Subject: Help with sudden spike in read requests

 

In our production cluster, we observed sudden spike (over 160 MB/s) in read 
requests on *all* Cassandra nodes for a very short period (less than a min); 
this event happens few times a day.

 

I am not able to get to the bottom of this issue, nothing interesting in 
system.log or from app level; repair was not running

 

Does anyone have any thoughts on what could have triggered this event? Under 
what condition C* (if it is tied to c*) will trigger this type of event?

 

Thanks!

 

Subroto



Re: Help with sudden spike in read requests

2019-02-01 Thread Subroto Barua
We migrated one of the application from on-Prem to aws; the queries are very 
light, more like registration info;

Queries from the new app is via pk of data type, “text”, no cc (this table has 
about 200 rows; however the legacy table (more like reference table) has 
several million rows, about 800 sstables per node, using lcs (9:1, read-write 
ratio)

Subroto 

> On Feb 1, 2019, at 10:33 AM, Kenneth Brotman  
> wrote:
> 
> Do you have that many queries?  You could just review them and your data 
> model to see if there was an error of some kind.  How long has it been 
> happening?  What changed since it started happening?
>  
> Kenneth Brotman
>  
> From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
> Sent: Friday, February 01, 2019 10:13 AM
> To: user@cassandra.apache.org
> Subject: Re: Help with sudden spike in read requests
>  
> Vnode is 256
> C*: 3.0.15 on m4.4xlarge gp2 vol
>  
> There are 2 more DCs on bare metal (raid 10 and older machines) attached to 
> this cluster and we have not seen this behavior on on-prem servers 
>  
> If this event is triggered by some bad query/queries, what is the best way to 
> trap it?
> 
> Subroto 
> 
> On Feb 1, 2019, at 8:55 AM, Kenneth Brotman  
> wrote:
> 
> If you had a query that went across the partitions and especially if you had 
> vNodes set high, that would do it.
>  
> Kenneth Brotman
>  
> From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
> Sent: Friday, February 01, 2019 8:45 AM
> To: User cassandra.apache.org
> Subject: Help with sudden spike in read requests
>  
> In our production cluster, we observed sudden spike (over 160 MB/s) in read 
> requests on *all* Cassandra nodes for a very short period (less than a min); 
> this event happens few times a day.
>  
> I am not able to get to the bottom of this issue, nothing interesting in 
> system.log or from app level; repair was not running
>  
> Does anyone have any thoughts on what could have triggered this event? Under 
> what condition C* (if it is tied to c*) will trigger this type of event?
>  
> Thanks!
>  
> Subroto


RE: Help with sudden spike in read requests

2019-02-01 Thread Kenneth Brotman
Do you have that many queries?  You could just review them and your data model 
to see if there was an error of some kind.  How long has it been happening?  
What changed since it started happening?

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 10:13 AM
To: user@cassandra.apache.org
Subject: Re: Help with sudden spike in read requests

 

Vnode is 256

C*: 3.0.15 on m4.4xlarge gp2 vol

 

There are 2 more DCs on bare metal (raid 10 and older machines) attached to 
this cluster and we have not seen this behavior on on-prem servers 

 

If this event is triggered by some bad query/queries, what is the best way to 
trap it?

Subroto 


On Feb 1, 2019, at 8:55 AM, Kenneth Brotman  
wrote:

If you had a query that went across the partitions and especially if you had 
vNodes set high, that would do it.

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 8:45 AM
To: User cassandra.apache.org <http://cassandraapache.org> 
Subject: Help with sudden spike in read requests

 

In our production cluster, we observed sudden spike (over 160 MB/s) in read 
requests on *all* Cassandra nodes for a very short period (less than a min); 
this event happens few times a day.

 

I am not able to get to the bottom of this issue, nothing interesting in 
system.log or from app level; repair was not running

 

Does anyone have any thoughts on what could have triggered this event? Under 
what condition C* (if it is tied to c*) will trigger this type of event?

 

Thanks!

 

Subroto



Re: Help with sudden spike in read requests

2019-02-01 Thread Subroto Barua
Vnode is 256
C*: 3.0.15 on m4.4xlarge gp2 vol

There are 2 more DCs on bare metal (raid 10 and older machines) attached to 
this cluster and we have not seen this behavior on on-prem servers 

If this event is triggered by some bad query/queries, what is the best way to 
trap it?

Subroto 

> On Feb 1, 2019, at 8:55 AM, Kenneth Brotman  
> wrote:
> 
> If you had a query that went across the partitions and especially if you had 
> vNodes set high, that would do it.
>  
> Kenneth Brotman
>  
> From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
> Sent: Friday, February 01, 2019 8:45 AM
> To: User cassandra.apache.org
> Subject: Help with sudden spike in read requests
>  
> In our production cluster, we observed sudden spike (over 160 MB/s) in read 
> requests on *all* Cassandra nodes for a very short period (less than a min); 
> this event happens few times a day.
>  
> I am not able to get to the bottom of this issue, nothing interesting in 
> system.log or from app level; repair was not running
>  
> Does anyone have any thoughts on what could have triggered this event? Under 
> what condition C* (if it is tied to c*) will trigger this type of event?
>  
> Thanks!
>  
> Subroto


RE: Help with sudden spike in read requests

2019-02-01 Thread Kenneth Brotman
If you had a query that went across the partitions and especially if you had 
vNodes set high, that would do it.

 

Kenneth Brotman

 

From: Subroto Barua [mailto:sbarua...@yahoo.com.INVALID] 
Sent: Friday, February 01, 2019 8:45 AM
To: User cassandra.apache.org
Subject: Help with sudden spike in read requests

 

In our production cluster, we observed sudden spike (over 160 MB/s) in read 
requests on *all* Cassandra nodes for a very short period (less than a min); 
this event happens few times a day.

 

I am not able to get to the bottom of this issue, nothing interesting in 
system.log or from app level; repair was not running

 

Does anyone have any thoughts on what could have triggered this event? Under 
what condition C* (if it is tied to c*) will trigger this type of event?

 

Thanks!

 

Subroto



Help with sudden spike in read requests

2019-02-01 Thread Subroto Barua
In our production cluster, we observed sudden spike (over 160 MB/s) in read 
requests on *all* Cassandra nodes for a very short period (less than a min); 
this event happens few times a day.
I am not able to get to the bottom of this issue, nothing interesting in 
system.log or from app level; repair was not running
Does anyone have any thoughts on what could have triggered this event? Under 
what condition C* (if it is tied to c*) will trigger this type of event?
Thanks!
Subroto

Re: Help in understanding strange cassandra CPU usage

2018-12-09 Thread Michael Shuler
On 12/9/18 4:09 AM, Devaki, Srinivas wrote:
> 
> Cassandra Version: 2.2.4

There have been over 300 bug fixes and improvements in the nearly 3
years between 2.2.4 and the latest 2.2.13 release. Somewhere in there
was a GC logging addition as I scanned the changes, which could help
with troubleshooting / tuning. I think that testing the current 2.2
release may also be prudent to rule out some issue that has already been
found & fixed.

https://github.com/apache/cassandra/blob/cassandra-2.2.13/CHANGES.txt#L1-L352
https://github.com/apache/cassandra/blob/cassandra-2.2.13/NEWS.txt#L1-L140

-- 
Kind regards,
Michael

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Help in understanding strange cassandra CPU usage

2018-12-09 Thread Jeff Jirsa
Sounds like over time you’re ending to doing something odd - maybe you’re 
leaking cql connections or something and it gets more and more intensive to 
manage them until you invoke the breaker, then it drops

Will probably take someone going through a heap dump to really understand 
what’s going on, which is unfortunate because it’s a fair amount of effort. 

-- 
Jeff Jirsa


> On Dec 9, 2018, at 2:09 AM, Devaki, Srinivas  wrote:
> 
> Hi Guys,
> 
> Since the start of our org, cassandra used to be a SPOF, due to recent 
> priorities we changed our code base so that cassandra won't be SPOF anymore, 
> and during that process we made a kill switch within the code(PHP), this kill 
> switch would ensure that no connection is made to the cassandra for any 
> queries.
> 
> During the testing phase of kill switch we have identified a strange 
> behaviour that CPU and Load Average would go down from 400%(cpu), 14-20(load 
> on a 16 core machine) to 20%(cpu), 2-3(load)
> 
> and even if the kill switch is activated only for 30 secs, then cpu would go 
> down from 400 to 20, and maintain at 20% for atleast 24 hrs before it starts 
> to increase back to 400 and stay consistent from then. and this is for all 
> the nodes but not just a few.
> 
> Details:
> Cassandra Version: 2.2.4
> Number of Nodes: 8
> AWS Instance Type: c4.4xlarge
> Number of Open Files: 30k to 50k (depending on number of auto scaled php 
> nodes)
> 
> Would be grateful for any explanation regarding this strange behaviour
> 
> Thanks & Regards
> Srinivas Devaki
> SRE/SDE at Zomato
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Help in understanding strange cassandra CPU usage

2018-12-09 Thread Devaki, Srinivas
Hi Guys,

Since the start of our org, cassandra used to be a SPOF, due to recent
priorities we changed our code base so that cassandra won't be SPOF
anymore, and during that process we made a kill switch within the
code(PHP), this kill switch would ensure that no connection is made to the
cassandra for any queries.

During the testing phase of kill switch we have identified a strange
behaviour that CPU and Load Average would go down from 400%(cpu),
14-20(load on a 16 core machine) to 20%(cpu), 2-3(load)

and even if the kill switch is activated only for 30 secs, then cpu would
go down from 400 to 20, and maintain at 20% for atleast 24 hrs before it
starts to increase back to 400 and stay consistent from then. and this is
for all the nodes but not just a few.

Details:
Cassandra Version: 2.2.4
Number of Nodes: 8
AWS Instance Type: c4.4xlarge
Number of Open Files: 30k to 50k (depending on number of auto scaled php
nodes)

Would be grateful for any explanation regarding this strange behaviour

Thanks & Regards
Srinivas Devaki
SRE/SDE at Zomato


Re: Cassandra HEAP Suggestion.. Need a help

2018-05-24 Thread Elliott Sims
JVM GC tuning can be pretty complex, but the simplest solution to OOM is
probably switching to G1GC and feeding it a rather large heap.
Theoretically a smaller heap and carefully-tuned CMS collector is more
efficient, but CMS is kind of fragile and tuning it is more of a black art,
where you can generally get into a state of "good enough" with G1 and a
bigger heap as long as there's physically enough RAM.

If you're on 2.x I'd strongly advise updating to 3 (probably 3.11.x), as
there were some pretty significant improvement in memory allocation.  3.11
also lets you move some things off-heap.

On Thu, May 10, 2018, 10:23 PM Jeff Jirsa  wrote:

> There's no single right answer. It depends a lot on the read/write
> patterns and other settings (onheap memtable, offheap memtable, etc).
>
> One thing that's probably always true, if you're using ParNew/CMS, 16G
> heap is a bit large, but may be appropriate for some read heavy workloads,
> but you'd want to make sure you start CMS earlier than default (set CMS
> initiating occupancy lower than default). May find it easier to do
> something like 12/3 or 12/4, and leave the remaining RAM for page cache.
>
> CASSANDRA-8150 has a bunch of notes for tuning GC configs (
> https://issues.apache.org/jira/browse/CASSANDRA-8150 ), and Amy's 2.1
> tuning guide is pretty solid too (
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html )
>
>
>
>
>
> On Fri, May 11, 2018 at 10:30 AM, Mokkapati, Bhargav (Nokia - IN/Chennai)
>  wrote:
>
>> Hi Team,
>>
>>
>>
>> I have 64GB of total system memory. 5 node cluster.
>>
>>
>>
>> x ~# free -m
>>
>>   totalusedfree  shared  buff/cache
>> available
>>
>> Mem:  64266   17549   41592  665124
>> 46151
>>
>> Swap: 0   0   0
>>
>> x ~#
>>
>>
>>
>> and “egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo” giving 12 cpu
>> cores.
>>
>>
>>
>> Currently Cassandra-env.sh calculating MAX_HEAP_SIZE as ‘8GB’ and
>> HEAP_NEWSIZE as ‘1200 MB’
>>
>>
>>
>> I am facing Java insufficient memory issue and Cassandra service is
>> getting down.
>>
>>
>>
>> I going to hard code the HEAP values in Cassandra-env.sh as below.
>>
>>
>>
>> MAX_HEAP_SIZE="16G"  (1/4 of total RAM)
>>
>> HEAP_NEWSIZE="4G" (1/4 of MAX_HEAP_SIZE)
>>
>>
>>
>> Is these values correct for my setup in production? Is there any
>> disadvantages doing this?
>>
>>
>>
>> Please let me know if any of you people faced the same issue.
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> Best regards,
>>
>> Bhargav M
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Cassandra HEAP Suggestion.. Need a help

2018-05-10 Thread Jeff Jirsa
There's no single right answer. It depends a lot on the read/write patterns
and other settings (onheap memtable, offheap memtable, etc).

One thing that's probably always true, if you're using ParNew/CMS, 16G heap
is a bit large, but may be appropriate for some read heavy workloads, but
you'd want to make sure you start CMS earlier than default (set CMS
initiating occupancy lower than default). May find it easier to do
something like 12/3 or 12/4, and leave the remaining RAM for page cache.

CASSANDRA-8150 has a bunch of notes for tuning GC configs (
https://issues.apache.org/jira/browse/CASSANDRA-8150 ), and Amy's 2.1
tuning guide is pretty solid too (
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html )





On Fri, May 11, 2018 at 10:30 AM, Mokkapati, Bhargav (Nokia - IN/Chennai) <
bhargav.mokkap...@nokia.com> wrote:

> Hi Team,
>
>
>
> I have 64GB of total system memory. 5 node cluster.
>
>
>
> x ~# free -m
>
>   totalusedfree  shared  buff/cache
> available
>
> Mem:  64266   17549   41592  665124
> 46151
>
> Swap: 0   0   0
>
> x ~#
>
>
>
> and “egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo” giving 12 cpu
> cores.
>
>
>
> Currently Cassandra-env.sh calculating MAX_HEAP_SIZE as ‘8GB’ and
> HEAP_NEWSIZE as ‘1200 MB’
>
>
>
> I am facing Java insufficient memory issue and Cassandra service is
> getting down.
>
>
>
> I going to hard code the HEAP values in Cassandra-env.sh as below.
>
>
>
> MAX_HEAP_SIZE="16G"  (1/4 of total RAM)
>
> HEAP_NEWSIZE="4G" (1/4 of MAX_HEAP_SIZE)
>
>
>
> Is these values correct for my setup in production? Is there any
> disadvantages doing this?
>
>
>
> Please let me know if any of you people faced the same issue.
>
>
>
> Thanks in advance!
>
>
>
> Best regards,
>
> Bhargav M
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Cassandra HEAP Suggestion.. Need a help

2018-05-10 Thread Mokkapati, Bhargav (Nokia - IN/Chennai)
Hi Team,

I have 64GB of total system memory. 5 node cluster.

x ~# free -m
  totalusedfree  shared  buff/cache   available
Mem:  64266   17549   41592  665124   46151
Swap: 0   0   0
x ~#

and "egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo" giving 12 cpu cores.

Currently Cassandra-env.sh calculating MAX_HEAP_SIZE as '8GB' and HEAP_NEWSIZE 
as '1200 MB'

I am facing Java insufficient memory issue and Cassandra service is getting 
down.

I going to hard code the HEAP values in Cassandra-env.sh as below.

MAX_HEAP_SIZE="16G"  (1/4 of total RAM)
HEAP_NEWSIZE="4G" (1/4 of MAX_HEAP_SIZE)

Is these values correct for my setup in production? Is there any disadvantages 
doing this?

Please let me know if any of you people faced the same issue.

Thanks in advance!

Best regards,
Bhargav M










Re: Help needed to enbale Client-to-node encryption(SSL)

2018-02-19 Thread Alain RODRIGUEZ
>
>  (2.0 is getting pretty old and isn't supported, you may want to consider
> upgrading; 2.1 would be the smallest change and least risk, but it, too, is
> near end of life)


I would upgrade as well. Yet I think moving from Cassandra 2.0 to Cassandra
2.2 directly is doable smoothly and preferable (still deserve to be tested
for each environment). Nowadays it might be a better move given that, as
Jeff said, Cassandra 2.1 support is very limited already and will sometime
soon be stopped.

Here is the difference:


>- Apache Cassandra 2.2 is supported until *4.0 release (date TBD)*.
>The latest release is 2.2.12
>
> 
> (pgp
>
> 
>, md5
>
> 
> and sha1
>
> ),
>released on 2018-02-16.
>- Apache Cassandra 2.1 is supported until *4.0 release (date TBD)* with
> *critical fixes only*. The latest release is 2.1.20
>
> 
> (pgp
>
> 
>, md5
>
> 
> and sha1
>
> ),
>released on 2018-02-16.
>
>
It's not a huge difference, but we have been doing upgrades from 2.0 and
2.1 to 2.2 directly if I remember correctly. I would say you have the
choice to skip a major this time and I was willing to share in case it
might be worth it for you.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-02-16 16:34 GMT+00:00 Jeff Jirsa :

> http://thelastpickle.com/blog/2015/09/30/hardening-
> cassandra-step-by-step-part-1-server-to-server.html
>
> https://www.youtube.com/watch?v=CKt0XVPogf4
>
> (2.0 is getting pretty old and isn't supported, you may want to consider
> upgrading; 2.1 would be the smallest change and least risk, but it, too, is
> near end of life)
>
>
>
> On Fri, Feb 16, 2018 at 8:05 AM, Prachi Rath 
> wrote:
>
>> Hi,
>>
>> I am using cassandra version  2.0 . My goal is to do cassandra client to
>> node security using SSL with my self-signed CA.
>>
>> What would be the recommended procedure for enabling SSL on  cassandra
>> version 2.0.17 .
>>
>> Thanks,
>> Prachi
>>
>
>


Re: Help needed to enbale Client-to-node encryption(SSL)

2018-02-16 Thread Jeff Jirsa
http://thelastpickle.com/blog/2015/09/30/hardening-cassandra-step-by-step-part-1-server-to-server.html

https://www.youtube.com/watch?v=CKt0XVPogf4

(2.0 is getting pretty old and isn't supported, you may want to consider
upgrading; 2.1 would be the smallest change and least risk, but it, too, is
near end of life)



On Fri, Feb 16, 2018 at 8:05 AM, Prachi Rath  wrote:

> Hi,
>
> I am using cassandra version  2.0 . My goal is to do cassandra client to
> node security using SSL with my self-signed CA.
>
> What would be the recommended procedure for enabling SSL on  cassandra
> version 2.0.17 .
>
> Thanks,
> Prachi
>


Help needed to enbale Client-to-node encryption(SSL)

2018-02-16 Thread Prachi Rath
Hi,

I am using cassandra version  2.0 . My goal is to do cassandra client to
node security using SSL with my self-signed CA.

What would be the recommended procedure for enabling SSL on  cassandra
version 2.0.17 .

Thanks,
Prachi


Re: Need help with incremental repair

2017-10-30 Thread Blake Eggleston
Ah cool, I didn't realize reaper did that.

On October 30, 2017 at 1:29:26 PM, Paulo Motta (pauloricard...@gmail.com) wrote:

> This is also the case for full repairs, if I'm not mistaken. Assuming I'm not 
> missing something here, that should mean that he shouldn't need to mark 
> sstables as unrepaired? 

That's right, but he mentioned that he is using reaper which uses 
subrange repair if I'm not mistaken, which doesn't do anti-compaction. 
So in that case he should probably mark data as unrepaired when no 
longer using incremental repair. 

2017-10-31 3:52 GMT+11:00 Blake Eggleston : 
>> Once you run incremental repair, your data is permanently marked as 
>> repaired 
> 
> This is also the case for full repairs, if I'm not mistaken. I'll admit I'm 
> not as familiar with the quirks of repair in 2.2, but prior to 
> 4.0/CASSANDRA-9143, any global repair ends with an anticompaction that marks 
> sstables as repaired. Looking at the RepairRunnable class, this does seem to 
> be the case. Assuming I'm not missing something here, that should mean that 
> he shouldn't need to mark sstables as unrepaired? 

- 
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
For additional commands, e-mail: user-h...@cassandra.apache.org 



Re: Need help with incremental repair

2017-10-30 Thread Paulo Motta
> This is also the case for full repairs, if I'm not mistaken. Assuming I'm not 
> missing something here, that should mean that he shouldn't need to mark 
> sstables as unrepaired?

That's right, but he mentioned that he is using reaper which uses
subrange repair if I'm not mistaken, which doesn't do anti-compaction.
So in that case he should probably mark data as unrepaired when no
longer using incremental repair.

2017-10-31 3:52 GMT+11:00 Blake Eggleston :
>> Once you run incremental repair, your data is permanently marked as
>> repaired
>
> This is also the case for full repairs, if I'm not mistaken. I'll admit I'm
> not as familiar with the quirks of repair in 2.2, but prior to
> 4.0/CASSANDRA-9143, any global repair ends with an anticompaction that marks
> sstables as repaired. Looking at the RepairRunnable class, this does seem to
> be the case. Assuming I'm not missing something here, that should mean that
> he shouldn't need to mark sstables as unrepaired?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Need help with incremental repair

2017-10-30 Thread Blake Eggleston
> Once you run incremental repair, your data is permanently marked as repaired

This is also the case for full repairs, if I'm not mistaken. I'll admit I'm not 
as familiar with the quirks of repair in 2.2, but prior to 4.0/CASSANDRA-9143, 
any global repair ends with an anticompaction that marks sstables as repaired. 
Looking at the RepairRunnable class, this does seem to be the case. Assuming 
I'm not missing something here, that should mean that he shouldn't need to mark 
sstables as unrepaired?


Re: Need help with incremental repair

2017-10-30 Thread kurt greaves
Yes mark them as unrepaired first. You can get sstablerepairedset from
source if you need (probably make sure you get the correct branch/tag).
It's just a shell script so as long as you have C* installed in a
default/canonical location it should work.
https://github.com/apache/cassandra/blob/trunk/tools/bin/sstablerepairedset​


Re: Need help with incremental repair

2017-10-29 Thread Aiman Parvaiz
Thanks Blake and Paulo for the response.

Yes, the idea is to go back to non incremental repairs. I am waiting for all 
the "anticompaction after repair" activities to complete and in my 
understanding( thanks to Blake for the explanation ), I can run a full repair 
on that KS and then get back to my non incremental repair regiment.


I assume that I should mark the SSTs to un repaired first and then run a full 
repair?

Also, although I am installing Cassandra from package dsc22 on my CentOS 7 I 
couldn't find sstable tools installed, need to figure that out too.


From: Paulo Motta <pauloricard...@gmail.com>
Sent: Sunday, October 29, 2017 1:56:38 PM
To: user@cassandra.apache.org
Subject: Re: Need help with incremental repair

> Assuming the situation is just "we accidentally ran incremental repair", you 
> shouldn't have to do anything. It's not going to hurt anything

Once you run incremental repair, your data is permanently marked as
repaired, and is no longer compacted with new non-incrementally
repaired data. This can cause read fragmentation and prevent deleted
data from being purged. If you ever run incremental repair and want to
switch to non-incremental repair, you should manually mark your
repaired SSTables as not-repaired with the sstablerepairedset tool.

2017-10-29 3:05 GMT+11:00 Blake Eggleston <beggles...@apple.com>:
> Hey Aiman,
>
> Assuming the situation is just "we accidentally ran incremental repair", you
> shouldn't have to do anything. It's not going to hurt anything. Pre-4.0
> incremental repair has some issues that can cause a lot of extra streaming,
> and inconsistencies in some edge cases, but as long as you're running full
> repairs before gc grace expires, everything should be ok.
>
> Thanks,
>
> Blake
>
>
> On October 28, 2017 at 1:28:42 AM, Aiman Parvaiz (ai...@steelhouse.com)
> wrote:
>
> Hi everyone,
>
> We seek your help in a issue we are facing in our 2.2.8 version.
>
> We have 24 nodes cluster spread over 3 DCs.
>
> Initially, when the cluster was in a single DC we were using The Last Pickle
> reaper 0.5 to repair it with incremental repair set to false. We added 2
> more DCs. Now the problem is that accidentally on one of the newer DCs we
> ran nodetool repair  without realizing that for 2.2 the default
> option is incremental.
>
> I am not seeing any errors in the logs till now but wanted to know what
> would be the best way to handle this situation. To make things a little more
> complicated, the node on which we triggered this repair is almost out of
> disk and we had to restart C* on it.
>
> I can see a bunch of "anticompaction after repair" under Opscenter Activites
> across various nodes in the 3 DCs.
>
>
> Any help, suggestion would be appreciated.
>
> Thanks
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Need help with incremental repair

2017-10-29 Thread Paulo Motta
> Assuming the situation is just "we accidentally ran incremental repair", you 
> shouldn't have to do anything. It's not going to hurt anything

Once you run incremental repair, your data is permanently marked as
repaired, and is no longer compacted with new non-incrementally
repaired data. This can cause read fragmentation and prevent deleted
data from being purged. If you ever run incremental repair and want to
switch to non-incremental repair, you should manually mark your
repaired SSTables as not-repaired with the sstablerepairedset tool.

2017-10-29 3:05 GMT+11:00 Blake Eggleston <beggles...@apple.com>:
> Hey Aiman,
>
> Assuming the situation is just "we accidentally ran incremental repair", you
> shouldn't have to do anything. It's not going to hurt anything. Pre-4.0
> incremental repair has some issues that can cause a lot of extra streaming,
> and inconsistencies in some edge cases, but as long as you're running full
> repairs before gc grace expires, everything should be ok.
>
> Thanks,
>
> Blake
>
>
> On October 28, 2017 at 1:28:42 AM, Aiman Parvaiz (ai...@steelhouse.com)
> wrote:
>
> Hi everyone,
>
> We seek your help in a issue we are facing in our 2.2.8 version.
>
> We have 24 nodes cluster spread over 3 DCs.
>
> Initially, when the cluster was in a single DC we were using The Last Pickle
> reaper 0.5 to repair it with incremental repair set to false. We added 2
> more DCs. Now the problem is that accidentally on one of the newer DCs we
> ran nodetool repair  without realizing that for 2.2 the default
> option is incremental.
>
> I am not seeing any errors in the logs till now but wanted to know what
> would be the best way to handle this situation. To make things a little more
> complicated, the node on which we triggered this repair is almost out of
> disk and we had to restart C* on it.
>
> I can see a bunch of "anticompaction after repair" under Opscenter Activites
> across various nodes in the 3 DCs.
>
>
> Any help, suggestion would be appreciated.
>
> Thanks
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Need help with incremental repair

2017-10-28 Thread Blake Eggleston
Hey Aiman,

Assuming the situation is just "we accidentally ran incremental repair", you 
shouldn't have to do anything. It's not going to hurt anything. Pre-4.0 
incremental repair has some issues that can cause a lot of extra streaming, and 
inconsistencies in some edge cases, but as long as you're running full repairs 
before gc grace expires, everything should be ok.

Thanks,

Blake


On October 28, 2017 at 1:28:42 AM, Aiman Parvaiz (ai...@steelhouse.com) wrote:

Hi everyone,

We seek your help in a issue we are facing in our 2.2.8 version.

We have 24 nodes cluster spread over 3 DCs.

Initially, when the cluster was in a single DC we were using The Last Pickle 
reaper 0.5 to repair it with incremental repair set to false. We added 2 more 
DCs. Now the problem is that accidentally on one of the newer DCs we ran 
nodetool repair  without realizing that for 2.2 the default option is 
incremental. 

I am not seeing any errors in the logs till now but wanted to know what would 
be the best way to handle this situation. To make things a little more 
complicated, the node on which we triggered this repair is almost out of disk 
and we had to restart C* on it.

I can see a bunch of "anticompaction after repair" under Opscenter Activites 
across various nodes in the 3 DCs.



Any help, suggestion would be appreciated.

Thanks




Need help with incremental repair

2017-10-28 Thread Aiman Parvaiz
Hi everyone,

We seek your help in a issue we are facing in our 2.2.8 version.

We have 24 nodes cluster spread over 3 DCs.

Initially, when the cluster was in a single DC we were using The Last Pickle 
reaper 0.5 to repair it with incremental repair set to false. We added 2 more 
DCs. Now the problem is that accidentally on one of the newer DCs we ran 
nodetool repair  without realizing that for 2.2 the default option is 
incremental.

I am not seeing any errors in the logs till now but wanted to know what would 
be the best way to handle this situation. To make things a little more 
complicated, the node on which we triggered this repair is almost out of disk 
and we had to restart C* on it.

I can see a bunch of "anticompaction after repair" under Opscenter Activites 
across various nodes in the 3 DCs.


Any help, suggestion would be appreciated.

Thanks



How Can I get started with Using Cassandra and Netbeans- Please help

2017-09-29 Thread Lutaya Shafiq Holmes
How Can I get started with Using Cassandra and Netbeans- Please help
-- 
Lutaaya Shafiq
Web: www.ronzag.com | i...@ronzag.com
Mobile: +256702772721 | +256783564130
Twitter: @lutayashafiq
Skype: lutaya5
Blog: lutayashafiq.com
http://www.fourcornersalliancegroup.com/?a=shafiqholmes

"The most beautiful people we have known are those who have known defeat,
known suffering, known struggle, known loss and have found their way out of
the depths. These persons have an appreciation, a sensitivity and an
understanding of life that fills them with compassion, gentleness and a
deep loving concern. Beautiful people do not just happen." - *Elisabeth
Kubler-Ross*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Help in c* Data modelling

2017-07-23 Thread @Nandan@
Hi ,
The best way will go with per query per table plan.. and distribute the
common column into both tables.
This will help you to support queries as well as Read and Write will be
fast.
Only Drawback will be, you have to insert common data into both tables at
the same time which can be easily handled by the Client side.

On Mon, Jul 24, 2017 at 6:10 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Using a different table to answer each query is the correct answer here
> assuming there's a significant amount of data.
>
> If you don't have that much data, maybe you should consider using a
> database like Postgres which gives you query flexibility instead of
> horizontal scalability.
> On Sun, Jul 23, 2017 at 1:10 PM techpyaasa . <techpya...@gmail.com> wrote:
>
>> Hi vladyu/varunbarala
>>
>> Instead of creating second table as you said can I just have one(first)
>> table below and get all rows with status=0.
>>
>> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
>> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
>> ORDER BY (pid ASC);
>>>
>>
>> I mean get all rows within same partition(account_id) whose status=0(say 
>> some value) using *UDF/UDA* in c* ?
>>
>>>
>>> select group_by_status from test.user;
>>
>>
>> where group_by_status is UDA/UDF
>>
>>
>> Thanks in advance
>> TechPyaasa
>>
>>
>> On Sun, Jul 23, 2017 at 10:42 PM, Vladimir Yudovin <vla...@winguzone.com>
>> wrote:
>>
>>> Hi,
>>>
>>> unfortunately ORDER BY is supported for clustering columns only...
>>>
>>> *Winguzone <https://winguzone.com?from=list> - Cloud Cassandra Hosting*
>>>
>>>
>>>  On Sun, 23 Jul 2017 12:49:36 -0400 *techpyaasa .
>>> <techpya...@gmail.com <techpya...@gmail.com>>* wrote 
>>>
>>> Hi Varun,
>>>
>>> Thanks a lot for your reply.
>>>
>>> In this case if I want to update status(status can be updated for given
>>> account_id, pid) , I need to delete existing row in 2nd table & add new
>>> one...  :( :(
>>>
>>> Its like hitting cassandra twice for 1 change.. :(
>>>
>>>
>>>
>>> On Sun, Jul 23, 2017 at 8:42 PM, Varun Barala <varunbaral...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>> You can create pseudo index table.
>>>
>>> IMO, structure can be:-
>>>
>>>
>>> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
>>> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
>>> ORDER BY (pid ASC);
>>> CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid bigint, 
>>> disp_name text, status int, PRIMARY KEY ((account_id, status), disp_name) ) 
>>> WITH CLUSTERING ORDER BY (disp_name ASC);
>>>
>>> to support query *:-  select * from site24x7.wm_current_status where
>>> uid=1 order by dispName asc;*
>>> You can use *in condition* on last partition key *status *in table
>>> *test.user_index.*
>>>
>>>
>>> *It depends on your use case and amount of data as well. It can be
>>> optimized more...*
>>> Thanks!!
>>>
>>> On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa . <techpya...@gmail.com>
>>> wrote:
>>>
>>> Hi ,
>>>
>>> We have a table like below :
>>>
>>> CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text,
>>> status int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid
>>> ASC);
>>>
>>>
>>>
>>> We would like to have following queries possible on the above table:
>>>
>>> select * from site24x7.wm_current_status where uid=1 and mid=1;
>>> select * from site24x7.wm_current_status where uid=1 order by dispName
>>> asc;
>>> select * from site24x7.wm_current_status where uid=1 and status=0 order
>>> by dispName asc;
>>>
>>> I know first query is possible by default , but I want the last 2
>>> queries also to work.
>>>
>>> So can some one please let me know how can I achieve the same in
>>> cassandra(c*-2.1.17). I'm ok with applying indexes etc,
>>>
>>> Thanks
>>> TechPyaasa
>>>
>>>
>>>
>>


Re: Help in c* Data modelling

2017-07-23 Thread Jonathan Haddad
Using a different table to answer each query is the correct answer here
assuming there's a significant amount of data.

If you don't have that much data, maybe you should consider using a
database like Postgres which gives you query flexibility instead of
horizontal scalability.
On Sun, Jul 23, 2017 at 1:10 PM techpyaasa .  wrote:

> Hi vladyu/varunbarala
>
> Instead of creating second table as you said can I just have one(first)
> table below and get all rows with status=0.
>
> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
> ORDER BY (pid ASC);
>>
>
> I mean get all rows within same partition(account_id) whose status=0(say some 
> value) using *UDF/UDA* in c* ?
>
>>
>> select group_by_status from test.user;
>
>
> where group_by_status is UDA/UDF
>
>
> Thanks in advance
> TechPyaasa
>
>
> On Sun, Jul 23, 2017 at 10:42 PM, Vladimir Yudovin 
> wrote:
>
>> Hi,
>>
>> unfortunately ORDER BY is supported for clustering columns only...
>>
>> *Winguzone  - Cloud Cassandra Hosting*
>>
>>
>>  On Sun, 23 Jul 2017 12:49:36 -0400 *techpyaasa .
>> >* wrote 
>>
>> Hi Varun,
>>
>> Thanks a lot for your reply.
>>
>> In this case if I want to update status(status can be updated for given
>> account_id, pid) , I need to delete existing row in 2nd table & add new
>> one...  :( :(
>>
>> Its like hitting cassandra twice for 1 change.. :(
>>
>>
>>
>> On Sun, Jul 23, 2017 at 8:42 PM, Varun Barala 
>> wrote:
>>
>> Hi,
>> You can create pseudo index table.
>>
>> IMO, structure can be:-
>>
>>
>> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
>> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
>> ORDER BY (pid ASC);
>> CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid bigint, 
>> disp_name text, status int, PRIMARY KEY ((account_id, status), disp_name) ) 
>> WITH CLUSTERING ORDER BY (disp_name ASC);
>>
>> to support query *:-  select * from site24x7.wm_current_status where
>> uid=1 order by dispName asc;*
>> You can use *in condition* on last partition key *status *in table
>> *test.user_index.*
>>
>>
>> *It depends on your use case and amount of data as well. It can be
>> optimized more...*
>> Thanks!!
>>
>> On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa . 
>> wrote:
>>
>> Hi ,
>>
>> We have a table like below :
>>
>> CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status
>> int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);
>>
>>
>>
>> We would like to have following queries possible on the above table:
>>
>> select * from site24x7.wm_current_status where uid=1 and mid=1;
>> select * from site24x7.wm_current_status where uid=1 order by dispName
>> asc;
>> select * from site24x7.wm_current_status where uid=1 and status=0 order
>> by dispName asc;
>>
>> I know first query is possible by default , but I want the last 2 queries
>> also to work.
>>
>> So can some one please let me know how can I achieve the same in
>> cassandra(c*-2.1.17). I'm ok with applying indexes etc,
>>
>> Thanks
>> TechPyaasa
>>
>>
>>
>


Re: Help in c* Data modelling

2017-07-23 Thread techpyaasa .
Hi vladyu/varunbarala

Instead of creating second table as you said can I just have one(first)
table below and get all rows with status=0.

CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint,
disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH
CLUSTERING ORDER BY (pid ASC);
>

I mean get all rows within same partition(account_id) whose
status=0(say some value) using *UDF/UDA* in c* ?

>
> select group_by_status from test.user;


where group_by_status is UDA/UDF


Thanks in advance
TechPyaasa


On Sun, Jul 23, 2017 at 10:42 PM, Vladimir Yudovin 
wrote:

> Hi,
>
> unfortunately ORDER BY is supported for clustering columns only...
>
> *Winguzone  - Cloud Cassandra Hosting*
>
>
>  On Sun, 23 Jul 2017 12:49:36 -0400 *techpyaasa .
> >* wrote 
>
> Hi Varun,
>
> Thanks a lot for your reply.
>
> In this case if I want to update status(status can be updated for given
> account_id, pid) , I need to delete existing row in 2nd table & add new
> one...  :( :(
>
> Its like hitting cassandra twice for 1 change.. :(
>
>
>
> On Sun, Jul 23, 2017 at 8:42 PM, Varun Barala 
> wrote:
>
> Hi,
> You can create pseudo index table.
>
> IMO, structure can be:-
>
>
> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
> ORDER BY (pid ASC);
> CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid bigint, 
> disp_name text, status int, PRIMARY KEY ((account_id, status), disp_name) ) 
> WITH CLUSTERING ORDER BY (disp_name ASC);
>
> to support query *:-  select * from site24x7.wm_current_status where
> uid=1 order by dispName asc;*
> You can use *in condition* on last partition key *status *in table
> *test.user_index.*
>
>
> *It depends on your use case and amount of data as well. It can be
> optimized more...*
> Thanks!!
>
> On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa . 
> wrote:
>
> Hi ,
>
> We have a table like below :
>
> CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status
> int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);
>
>
>
> We would like to have following queries possible on the above table:
>
> select * from site24x7.wm_current_status where uid=1 and mid=1;
> select * from site24x7.wm_current_status where uid=1 order by dispName asc;
> select * from site24x7.wm_current_status where uid=1 and status=0 order by
> dispName asc;
>
> I know first query is possible by default , but I want the last 2 queries
> also to work.
>
> So can some one please let me know how can I achieve the same in
> cassandra(c*-2.1.17). I'm ok with applying indexes etc,
>
> Thanks
> TechPyaasa
>
>
>


Re: Help in c* Data modelling

2017-07-23 Thread Vladimir Yudovin
Hi,



unfortunately ORDER BY is supported for clustering columns only...

 

Winguzone - Cloud Cassandra Hosting






 On Sun, 23 Jul 2017 12:49:36 -0400 techpyaasa . 
techpya...@gmail.com wrote 




Hi Varun,



Thanks a lot for your reply.



In this case if I want to update status(status can be updated for given 
account_id, pid) , I need to delete existing row in 2nd table  add new 
one...  :( :(



Its like hitting cassandra twice for 1 change.. :(



 





On Sun, Jul 23, 2017 at 8:42 PM, Varun Barala varunbaral...@gmail.com 
wrote:

Hi,


You can create pseudo index table.




IMO, structure can be:-





CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, disp_name 
text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING ORDER BY (pid 
ASC); CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid 
bigint, disp_name text, status int, PRIMARY KEY ((account_id, status), 
disp_name) ) WITH CLUSTERING ORDER BY (disp_name ASC);

to support query :-  select * from site24x7.wm_current_status where uid=1 order 
by dispName asc;


You can use in condition on last partition key status in table test.user_index.


It depends on your use case and amount of data as well. It can be optimized 
more...


Thanks!!




On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa . techpya...@gmail.com 
wrote:

Hi ,



We have a table like below :



CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status int, 
PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);




We would like to have following queries possible on the above table:



select * from site24x7.wm_current_status where uid=1 and mid=1;

select * from site24x7.wm_current_status where uid=1 order by dispName asc;

select * from site24x7.wm_current_status where uid=1 and status=0 order by 
dispName asc;




I know first query is possible by default , but I want the last 2 queries also 
to work.



So can some one please let me know how can I achieve the same in 
cassandra(c*-2.1.17). I'm ok with applying indexes etc,



Thanks
TechPyaasa















Re: Help in c* Data modelling

2017-07-23 Thread techpyaasa .
Hi Varun,

Thanks a lot for your reply.

In this case if I want to update status(status can be updated for given
account_id, pid) , I need to delete existing row in 2nd table & add new
one...  :( :(

Its like hitting cassandra twice for 1 change.. :(



On Sun, Jul 23, 2017 at 8:42 PM, Varun Barala 
wrote:

> Hi,
>
> You can create pseudo index table.
>
> IMO, structure can be:-
>
>
> CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint, 
> disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH CLUSTERING 
> ORDER BY (pid ASC);
> CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid bigint, 
> disp_name text, status int, PRIMARY KEY ((account_id, status), disp_name) ) 
> WITH CLUSTERING ORDER BY (disp_name ASC);
>
>
> to support query *:-  select * from site24x7.wm_current_status where
> uid=1 order by dispName asc;*
> You can use *in condition* on last partition key *status *in table
>
> *test.user_index.*
>
>
>
> *It depends on your use case and amount of data as well. It can be
> optimized more...*
> Thanks!!
>
> On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa . 
> wrote:
>
>> Hi ,
>>
>> We have a table like below :
>>
>> CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status
>>> int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);
>>
>>
>>
>> We would like to have following queries possible on the above table:
>>
>> select * from site24x7.wm_current_status where uid=1 and mid=1;
>> select * from site24x7.wm_current_status where uid=1 order by dispName
>> asc;
>> select * from site24x7.wm_current_status where uid=1 and status=0 order
>> by dispName asc;
>>
>> I know first query is possible by default , but I want the last 2 queries
>> also to work.
>>
>> So can some one please let me know how can I achieve the same in
>> cassandra(c*-2.1.17). I'm ok with applying indexes etc,
>>
>> Thanks
>> TechPyaasa
>>
>
>


Re: Help in c* Data modelling

2017-07-23 Thread Varun Barala
Hi,

You can create pseudo index table.

IMO, structure can be:-


CREATE TABLE IF NOT EXISTS test.user ( account_id bigint, pid bigint,
disp_name text, status int, PRIMARY KEY (account_id, pid) ) WITH
CLUSTERING ORDER BY (pid ASC);
CREATE TABLE IF NOT EXISTS test.user_index ( account_id bigint, pid
bigint, disp_name text, status int, PRIMARY KEY ((account_id, status),
disp_name) ) WITH CLUSTERING ORDER BY (disp_name ASC);


to support query *:-  select * from site24x7.wm_current_status where uid=1
order by dispName asc;*
You can use *in condition* on last partition key *status *in table

*test.user_index.*



*It depends on your use case and amount of data as well. It can be
optimized more...*
Thanks!!

On Sun, Jul 23, 2017 at 2:48 AM, techpyaasa .  wrote:

> Hi ,
>
> We have a table like below :
>
> CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status
>> int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);
>
>
>
> We would like to have following queries possible on the above table:
>
> select * from site24x7.wm_current_status where uid=1 and mid=1;
> select * from site24x7.wm_current_status where uid=1 order by dispName asc;
> select * from site24x7.wm_current_status where uid=1 and status=0 order by
> dispName asc;
>
> I know first query is possible by default , but I want the last 2 queries
> also to work.
>
> So can some one please let me know how can I achieve the same in
> cassandra(c*-2.1.17). I'm ok with applying indexes etc,
>
> Thanks
> TechPyaasa
>


Help in c* Data modelling

2017-07-22 Thread techpyaasa .
Hi ,

We have a table like below :

CREATE TABLE ks.cf ( accountId bigint, pid bigint, dispName text, status
> int, PRIMARY KEY (accountId, pid) ) WITH CLUSTERING ORDER BY (pid ASC);



We would like to have following queries possible on the above table:

select * from site24x7.wm_current_status where uid=1 and mid=1;
select * from site24x7.wm_current_status where uid=1 order by dispName asc;
select * from site24x7.wm_current_status where uid=1 and status=0 order by
dispName asc;

I know first query is possible by default , but I want the last 2 queries
also to work.

So can some one please let me know how can I achieve the same in
cassandra(c*-2.1.17). I'm ok with applying indexes etc,

Thanks
TechPyaasa


Re: need help tuning dropped mutation messages

2017-07-06 Thread Subroto Barua
c* version: 3.0.11
cross_node_timeout: truerange_request_timeout_in_ms: 
1write_request_timeout_in_ms: 2000counter_write_request_timeout_in_ms: 
5000cas_contention_timeout_in_ms: 1000

On Thursday, July 6, 2017, 11:43:44 AM PDT, Subroto Barua 
 wrote:

I am seeing these errors:
MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for 
internal timeout and 4 for cross node timeout
write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node 
cluster..

need help tuning dropped mutation messages

2017-07-06 Thread Subroto Barua
I am seeing these errors:
MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for 
internal timeout and 4 for cross node timeout
write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node 
cluster..

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Zoltan Lorincz
Great suggestion! Thanks Avi!

On Mon, Mar 27, 2017 at 3:47 PM, Avi Kivity <a...@scylladb.com> wrote:

> You can use static columns to and just one table:
>
>
> CREATE TABLE documents (
>
> doc_id uuid,
>
> element_id uuid,
>
> description text static,
>
> doc_title text static,
>
> element_title text,
>
> PRIMARY KEY (doc_id, element_id)
>
> );
>
> The static columns are present once per unique doc_id.
>
>
>
> On 03/27/2017 01:08 PM, Zoltan Lorincz wrote:
>
> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
> doc_id uuid,
> description text,
> title text,
> PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
> doc_id uuid,
> element_id uuid,
> title text,
> PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE 
> doc_id=131cfa55-181e-431e-7956-fe449139d613
> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Zoltan,
>>
>> you must try to avoid multi partition queries as much as possible.
>> Instead, use asynchronous queries to grab several partitions concurrently.
>> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
>> your cluster.
>> This would leave you roughly with 1000+ async queries groups to run.
>> Performance will really depend on your hardware, consistency level, load
>> balancing policy, partition fragmentation (how many updates you'll run on
>> each element over time) and the SLA you're expecting.
>>
>> If that approach doesn't meet your SLA requirements, you can try to use
>> wide partitions and group elements under buckets :
>>
>> CREATE TABLE elements (
>> doc_id long,
>> bucket long,
>> element_id long,
>> element_content text,
>> PRIMARY KEY((doc_id, bucket), element_id)
>> )
>>
>> The bucket here could be a modulus of the element_id (or of the hash of
>> element_id if it is not a numerical value). This way you can spread
>> elements over the cluster and access them directly if you have the doc_id
>> and the element_id to perform updates.
>> You'll get to run less queries concurrently but they'll take more time
>> than individual ones in the first scenario (1 partition per element). You
>> should benchmark both solutions to see which one gives best performance.
>> Bucket your elements so that your partitions don't grow over 100MB. Large
>> partitions are silent cluster killers (1GB+ partitions are a direct threat
>> to cluster stability)...
>>
>> To ensure best performance, use prepared statements along with the
>> TokenAwarePolicy
>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>>  to
>> avoid unnecessary coordination.
>>
>> Cheers,
>>
>>
>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>>
>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>> but the real question is, will it be efficient to query 100k+ primary keys
>>> in the elements table?
>>> e.g.
>>>
>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>> element_id3,  element_id100K+)  ?
>>>
>>> The elements_id is a primary key.
>>>
>>> Thank you?
>>>
>>>
>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>>> wrote:
>>>
>>> Have one table hold document metadata (doc_id, title, description, ...)
>>> and have another table elements where partition key is doc_id and
>>> clustering key is element_id.
>>> Only problem here is if you need to query and/or update element just by
>>> element_id but I don't know your queries up front.
>>>
>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>>> wrote:
>>>
>>> Dear cassandra users,
>>>
>>> We have the following structure in MySql:
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>
>>> Notation: table name->[column1(key or index), column2, …]
>>>
>>> We

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Avi Kivity

You can use static columns to and just one table:


CREATE TABLE documents (

doc_id uuid,

element_id uuid,

description text static,

doc_title text static,

element_title text,

PRIMARY KEY (doc_id, element_id)

);


The static columns are present once per unique doc_id.


On 03/27/2017 01:08 PM, Zoltan Lorincz wrote:

Hi Alexander,

thank you for your help! I think we found the answer:

CREATE TABLE documents (
doc_id uuid,
description text,
title text,
PRIMARY KEY (doc_id)
 );

CREATE TABLE nodes (
doc_id uuid,
element_id uuid,
title text,
PRIMARY KEY (doc_id, element_id)
);

We can retrieve all elements with the following query:
 SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
 UPDATE elements SET title='Hello' WHERE 
doc_id=131cfa55-181e-431e-7956-fe449139d613 AND 
element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;


Zoltan.


On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski 
<a...@thelastpickle.com <mailto:a...@thelastpickle.com>> wrote:


Hi Zoltan,

you must try to avoid multi partition queries as much as possible.
Instead, use asynchronous queries to grab several partitions
concurrently.
Try to send no more than  ~100 queries at the same time to avoid
DDOS-ing your cluster.
This would leave you roughly with 1000+ async queries groups to
run. Performance will really depend on your hardware, consistency
level, load balancing policy, partition fragmentation (how many
updates you'll run on each element over time) and the SLA you're
expecting.

If that approach doesn't meet your SLA requirements, you can try
to use wide partitions and group elements under buckets :

CREATE TABLE elements (
doc_id long,
bucket long,
element_id long,
element_content text,
PRIMARY KEY((doc_id, bucket), element_id)
)

The bucket here could be a modulus of the element_id (or of the
hash of element_id if it is not a numerical value). This way you
can spread elements over the cluster and access them directly if
you have the doc_id and the element_id to perform updates.
You'll get to run less queries concurrently but they'll take more
time than individual ones in the first scenario (1 partition per
element). You should benchmark both solutions to see which one
gives best performance.
Bucket your elements so that your partitions don't grow over
100MB. Large partitions are silent cluster killers (1GB+
partitions are a direct threat to cluster stability)...

To ensure best performance, use prepared statements along with the
TokenAwarePolicy

<http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
 to
avoid unnecessary coordination.

Cheers,


On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com
<mailto:zol...@gmail.com>> wrote:

Querying by (doc_id and element_id ) OR just by (element_id)
is fine, but the real question is, will it be efficient to
query 100k+ primary keys in the elements table?
e.g.

SELECT * FROM elements WHERE element_id IN (element_id1,
element_id2, element_id3,  element_id100K+)  ?

The elements_id is a primary key.

Thank you?


On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec
<matija0...@gmail.com <mailto:matija0...@gmail.com>> wrote:

Have one table hold document metadata (doc_id, title,
description, ...) and have another table elements where
partition key is doc_id and clustering key is element_id.
Only problem here is if you need to query and/or update
element just by element_id but I don't know your queries
up front.

On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz
<zol...@gmail.com <mailto:zol...@gmail.com>> wrote:

Dear cassandra users,

We have the following structure in MySql:

documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index),
title, description]

Notation: table name->[column1(key or index), column2, …]

We want to transfer the data to Cassandra.

Each document can contain a large number of elements
(between 1 and 100k+)

We have two requirements:
a) Load all elements for a given doc_id quickly
b) Update the value of one individual element quickly


We were thinking on the following cassandra
configurations:

Option A

documents->[doc_id(primary key), title, description,
elements] (elements could be a SET or a TEXT, each
time 

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Zoltan Lorincz
Thank you Matija, because i am newbie, it was not clear for me that i am
able to query by the partition key (not providing the clustering key),
sorry about that!
Zoltan.

On Mon, Mar 27, 2017 at 1:54 PM, Matija Gobec <matija0...@gmail.com> wrote:

> Thats exactly what I described. IN queries can be used sometimes but I
> usually run parallel async as Alexander explained.
>
> On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zol...@gmail.com> wrote:
>
>> Hi Alexander,
>>
>> thank you for your help! I think we found the answer:
>>
>> CREATE TABLE documents (
>> doc_id uuid,
>> description text,
>> title text,
>> PRIMARY KEY (doc_id)
>>  );
>>
>> CREATE TABLE nodes (
>> doc_id uuid,
>> element_id uuid,
>> title text,
>> PRIMARY KEY (doc_id, element_id)
>> );
>>
>> We can retrieve all elements with the following query:
>>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>>  UPDATE elements SET title='Hello' WHERE 
>> doc_id=131cfa55-181e-431e-7956-fe449139d613
>> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>>
>> Zoltan.
>>
>>
>> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi Zoltan,
>>>
>>> you must try to avoid multi partition queries as much as possible.
>>> Instead, use asynchronous queries to grab several partitions concurrently.
>>> Try to send no more than  ~100 queries at the same time to avoid
>>> DDOS-ing your cluster.
>>> This would leave you roughly with 1000+ async queries groups to run.
>>> Performance will really depend on your hardware, consistency level, load
>>> balancing policy, partition fragmentation (how many updates you'll run on
>>> each element over time) and the SLA you're expecting.
>>>
>>> If that approach doesn't meet your SLA requirements, you can try to use
>>> wide partitions and group elements under buckets :
>>>
>>> CREATE TABLE elements (
>>> doc_id long,
>>> bucket long,
>>> element_id long,
>>> element_content text,
>>> PRIMARY KEY((doc_id, bucket), element_id)
>>> )
>>>
>>> The bucket here could be a modulus of the element_id (or of the hash of
>>> element_id if it is not a numerical value). This way you can spread
>>> elements over the cluster and access them directly if you have the doc_id
>>> and the element_id to perform updates.
>>> You'll get to run less queries concurrently but they'll take more time
>>> than individual ones in the first scenario (1 partition per element). You
>>> should benchmark both solutions to see which one gives best performance.
>>> Bucket your elements so that your partitions don't grow over 100MB.
>>> Large partitions are silent cluster killers (1GB+ partitions are a direct
>>> threat to cluster stability)...
>>>
>>> To ensure best performance, use prepared statements along with the
>>> TokenAwarePolicy
>>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>>>  to
>>> avoid unnecessary coordination.
>>>
>>> Cheers,
>>>
>>>
>>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>>>
>>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>>> but the real question is, will it be efficient to query 100k+ primary keys
>>>> in the elements table?
>>>> e.g.
>>>>
>>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>>> element_id3,  element_id100K+)  ?
>>>>
>>>> The elements_id is a primary key.
>>>>
>>>> Thank you?
>>>>
>>>>
>>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>>>> wrote:
>>>>
>>>> Have one table hold document metadata (doc_id, title, description, ...)
>>>> and have another table elements where partition key is doc_id and
>>>> clustering key is element_id.
>>>> Only problem here is if you need to query and/or update element just by
>>>> element_id but I don't know your queries up front.
>>>>
>>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>>>> wrote:
>>>>
>>>> Dear cassandra users,
>>>>
>>>> We hav

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Matija Gobec
Thats exactly what I described. IN queries can be used sometimes but I
usually run parallel async as Alexander explained.

On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zol...@gmail.com> wrote:

> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
> doc_id uuid,
> description text,
> title text,
> PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
> doc_id uuid,
> element_id uuid,
> title text,
> PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE 
> doc_id=131cfa55-181e-431e-7956-fe449139d613
> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Zoltan,
>>
>> you must try to avoid multi partition queries as much as possible.
>> Instead, use asynchronous queries to grab several partitions concurrently.
>> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
>> your cluster.
>> This would leave you roughly with 1000+ async queries groups to run.
>> Performance will really depend on your hardware, consistency level, load
>> balancing policy, partition fragmentation (how many updates you'll run on
>> each element over time) and the SLA you're expecting.
>>
>> If that approach doesn't meet your SLA requirements, you can try to use
>> wide partitions and group elements under buckets :
>>
>> CREATE TABLE elements (
>> doc_id long,
>> bucket long,
>> element_id long,
>> element_content text,
>> PRIMARY KEY((doc_id, bucket), element_id)
>> )
>>
>> The bucket here could be a modulus of the element_id (or of the hash of
>> element_id if it is not a numerical value). This way you can spread
>> elements over the cluster and access them directly if you have the doc_id
>> and the element_id to perform updates.
>> You'll get to run less queries concurrently but they'll take more time
>> than individual ones in the first scenario (1 partition per element). You
>> should benchmark both solutions to see which one gives best performance.
>> Bucket your elements so that your partitions don't grow over 100MB. Large
>> partitions are silent cluster killers (1GB+ partitions are a direct threat
>> to cluster stability)...
>>
>> To ensure best performance, use prepared statements along with the
>> TokenAwarePolicy
>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>>  to
>> avoid unnecessary coordination.
>>
>> Cheers,
>>
>>
>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>>
>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>> but the real question is, will it be efficient to query 100k+ primary keys
>>> in the elements table?
>>> e.g.
>>>
>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>> element_id3,  element_id100K+)  ?
>>>
>>> The elements_id is a primary key.
>>>
>>> Thank you?
>>>
>>>
>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>>> wrote:
>>>
>>> Have one table hold document metadata (doc_id, title, description, ...)
>>> and have another table elements where partition key is doc_id and
>>> clustering key is element_id.
>>> Only problem here is if you need to query and/or update element just by
>>> element_id but I don't know your queries up front.
>>>
>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>>> wrote:
>>>
>>> Dear cassandra users,
>>>
>>> We have the following structure in MySql:
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>
>>> Notation: table name->[column1(key or index), column2, …]
>>>
>>> We want to transfer the data to Cassandra.
>>>
>>> Each document can contain a large number of elements (between 1 and
>>> 100k+)
>>>
>>> We have two requirements:
>>> a) Load all elements for a given doc_id quickly
>>> b) Update the value of one individual element quickly
>>>
&g

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Zoltan Lorincz
Hi Alexander,

thank you for your help! I think we found the answer:

CREATE TABLE documents (
doc_id uuid,
description text,
title text,
PRIMARY KEY (doc_id)
 );

CREATE TABLE nodes (
doc_id uuid,
element_id uuid,
title text,
PRIMARY KEY (doc_id, element_id)
);

We can retrieve all elements with the following query:
 SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
 UPDATE elements SET title='Hello' WHERE
doc_id=131cfa55-181e-431e-7956-fe449139d613 AND
element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;

Zoltan.


On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Hi Zoltan,
>
> you must try to avoid multi partition queries as much as possible.
> Instead, use asynchronous queries to grab several partitions concurrently.
> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
> your cluster.
> This would leave you roughly with 1000+ async queries groups to run.
> Performance will really depend on your hardware, consistency level, load
> balancing policy, partition fragmentation (how many updates you'll run on
> each element over time) and the SLA you're expecting.
>
> If that approach doesn't meet your SLA requirements, you can try to use
> wide partitions and group elements under buckets :
>
> CREATE TABLE elements (
> doc_id long,
> bucket long,
> element_id long,
> element_content text,
> PRIMARY KEY((doc_id, bucket), element_id)
> )
>
> The bucket here could be a modulus of the element_id (or of the hash of
> element_id if it is not a numerical value). This way you can spread
> elements over the cluster and access them directly if you have the doc_id
> and the element_id to perform updates.
> You'll get to run less queries concurrently but they'll take more time
> than individual ones in the first scenario (1 partition per element). You
> should benchmark both solutions to see which one gives best performance.
> Bucket your elements so that your partitions don't grow over 100MB. Large
> partitions are silent cluster killers (1GB+ partitions are a direct threat
> to cluster stability)...
>
> To ensure best performance, use prepared statements along with the
> TokenAwarePolicy
> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>  to
> avoid unnecessary coordination.
>
> Cheers,
>
>
> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>
>> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
>> the real question is, will it be efficient to query 100k+ primary keys in
>> the elements table?
>> e.g.
>>
>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>> element_id3,  element_id100K+)  ?
>>
>> The elements_id is a primary key.
>>
>> Thank you?
>>
>>
>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>> wrote:
>>
>> Have one table hold document metadata (doc_id, title, description, ...)
>> and have another table elements where partition key is doc_id and
>> clustering key is element_id.
>> Only problem here is if you need to query and/or update element just by
>> element_id but I don't know your queries up front.
>>
>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>> wrote:
>>
>> Dear cassandra users,
>>
>> We have the following structure in MySql:
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(index), title, description]
>>
>> Notation: table name->[column1(key or index), column2, …]
>>
>> We want to transfer the data to Cassandra.
>>
>> Each document can contain a large number of elements (between 1 and
>> 100k+)
>>
>> We have two requirements:
>> a) Load all elements for a given doc_id quickly
>> b) Update the value of one individual element quickly
>>
>>
>> We were thinking on the following cassandra configurations:
>>
>> Option A
>>
>> documents->[doc_id(primary key), title, description, elements] (elements
>> could be a SET or a TEXT, each time new elements are added (they are never
>> removed) we would append it to this column)
>> elements->[element_id(primary key), title, description]
>>
>> Loading a document:
>>
>>  a) Load document with given  and get all element ids
>> SELECT * from documents where doc_id=‘id’
>>
>>  b) Load all elements with the given ids
>> SELECT * FROM elements where element_id IN (ids loaded from query a)

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Alexander Dejanovski
Hi Zoltan,

you must try to avoid multi partition queries as much as possible. Instead,
use asynchronous queries to grab several partitions concurrently.
Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
your cluster.
This would leave you roughly with 1000+ async queries groups to run.
Performance will really depend on your hardware, consistency level, load
balancing policy, partition fragmentation (how many updates you'll run on
each element over time) and the SLA you're expecting.

If that approach doesn't meet your SLA requirements, you can try to use
wide partitions and group elements under buckets :

CREATE TABLE elements (
doc_id long,
bucket long,
element_id long,
element_content text,
PRIMARY KEY((doc_id, bucket), element_id)
)

The bucket here could be a modulus of the element_id (or of the hash of
element_id if it is not a numerical value). This way you can spread
elements over the cluster and access them directly if you have the doc_id
and the element_id to perform updates.
You'll get to run less queries concurrently but they'll take more time than
individual ones in the first scenario (1 partition per element). You should
benchmark both solutions to see which one gives best performance.
Bucket your elements so that your partitions don't grow over 100MB. Large
partitions are silent cluster killers (1GB+ partitions are a direct threat
to cluster stability)...

To ensure best performance, use prepared statements along with the
TokenAwarePolicy

to
avoid unnecessary coordination.

Cheers,


On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz  wrote:

> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
> the real question is, will it be efficient to query 100k+ primary keys in
> the elements table?
> e.g.
>
> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
> element_id3,  element_id100K+)  ?
>
> The elements_id is a primary key.
>
> Thank you?
>
>
> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec 
> wrote:
>
> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz  wrote:
>
> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given  and get all element ids
> SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
> SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Help with data modelling (from MySQL to Cassandra)

2017-03-26 Thread Zoltan Lorincz
Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
the real question is, will it be efficient to query 100k+ primary keys in
the elements table?
e.g.

SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
element_id3,  element_id100K+)  ?

The elements_id is a primary key.

Thank you?


On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec  wrote:

> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz  wrote:
>
>> Dear cassandra users,
>>
>> We have the following structure in MySql:
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(index), title, description]
>>
>> Notation: table name->[column1(key or index), column2, …]
>>
>> We want to transfer the data to Cassandra.
>>
>> Each document can contain a large number of elements (between 1 and
>> 100k+)
>>
>> We have two requirements:
>> a) Load all elements for a given doc_id quickly
>> b) Update the value of one individual element quickly
>>
>>
>> We were thinking on the following cassandra configurations:
>>
>> Option A
>>
>> documents->[doc_id(primary key), title, description, elements] (elements
>> could be a SET or a TEXT, each time new elements are added (they are never
>> removed) we would append it to this column)
>> elements->[element_id(primary key), title, description]
>>
>> Loading a document:
>>
>>  a) Load document with given  and get all element ids
>> SELECT * from documents where doc_id=‘id’
>>
>>  b) Load all elements with the given ids
>> SELECT * FROM elements where element_id IN (ids loaded from query a)
>>
>>
>> Option B
>>
>> documents->[doc_id(primary key), title, description]
>> elements->[element_id(primary key), doc_id(secondary index), title,
>> description]
>>
>> Loading a document:
>>  a) SELECT * from elements where doc_id=‘id’
>>
>>
>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>> querying by Primary keys, the second query will have 100k+ primary key id’s
>> in the WHERE clause, and the second solution looks like an anti pattern in
>> cassandra.
>>
>> Could anyone give any advice how would we create a model for our use case?
>>
>> Thank you in advance,
>> Zoltan.
>>
>>
>


Re: Help with data modelling (from MySQL to Cassandra)

2017-03-26 Thread Matija Gobec
Have one table hold document metadata (doc_id, title, description, ...) and
have another table elements where partition key is doc_id and clustering
key is element_id.
Only problem here is if you need to query and/or update element just by
element_id but I don't know your queries up front.

On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz  wrote:

> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given  and get all element ids
> SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
> SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>


Help with data modelling (from MySQL to Cassandra)

2017-03-26 Thread Zoltan Lorincz
Dear cassandra users,

We have the following structure in MySql:

documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index), title, description]

Notation: table name->[column1(key or index), column2, …]

We want to transfer the data to Cassandra.

Each document can contain a large number of elements (between 1 and 100k+)

We have two requirements:
a) Load all elements for a given doc_id quickly
b) Update the value of one individual element quickly


We were thinking on the following cassandra configurations:

Option A

documents->[doc_id(primary key), title, description, elements] (elements
could be a SET or a TEXT, each time new elements are added (they are never
removed) we would append it to this column)
elements->[element_id(primary key), title, description]

Loading a document:

 a) Load document with given  and get all element ids
SELECT * from documents where doc_id=‘id’

 b) Load all elements with the given ids
SELECT * FROM elements where element_id IN (ids loaded from query a)


Option B

documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(secondary index), title,
description]

Loading a document:
 a) SELECT * from elements where doc_id=‘id’


Neither solutions doesn’t seem to be good, in Option A, even if we are
querying by Primary keys, the second query will have 100k+ primary key id’s
in the WHERE clause, and the second solution looks like an anti pattern in
cassandra.

Could anyone give any advice how would we create a model for our use case?

Thank you in advance,
Zoltan.


Re: HELP with bulk loading

2017-03-14 Thread Artur R
Thank you all!
It turns out that the fastest ways are: https://github.com/brianmhess/
cassandra-loader and COPY FROM.

So I decided to stick with COPY FROM as it built-in and easy-to-use.

On Fri, Mar 10, 2017 at 2:22 PM, Ahmed Eljami 
wrote:

> Hi,
>
> >3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Maybe you can accelerate you load with the option -cph (connection per
> host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000
>
> With cph=12 and t=1000,  I went from 56min (default value) to 11min for
> table of 50Gb.
>
>
>
> 2017-03-10 2:09 GMT+01:00 Stefania Alborghetti  datastax.com>:
>
>> When I tested cqlsh COPY FROM for CASSANDRA-11053
>> ,
>> I was able to import about 20 GB in under 4 minutes on a cluster with 8
>> nodes using the same benchmark created for cassandra-loader, provided the
>> driver was Cythonized, instructions in this blog post
>> .
>> The performance was similar to cassandra-loader.
>>
>> Depending on your schema, one or the other may do slightly better.
>>
>> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla  wrote:
>>
>>> I suggest using cassandra loader
>>>
>>> https://github.com/brianmhess/cassandra-loader
>>>
>>> On Mar 9, 2017 5:30 PM, "Artur R"  wrote:
>>>
 Hello all!

 There are ~500gb of CSV files and I am trying to find the way how to
 upload them to C* table (new empty C* cluster of 3 nodes, replication
 factor 2) within reasonable time (say, 10 hours using 3-4 instance of
 c3.8xlarge EC2 nodes).

 My first impulse was to use CQLSSTableWriter, but it is too slow is of
 single instance and I can't efficiently parallelize it (just creating Java
 threads) because after some moment it always "hangs" (looks like GC is
 overstressed) and eats all available memory.

 So the questions are:
 1. What is the best way to bulk-load huge amount of data to new C*
 cluster?

 This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323
 :

 The preferred way to bulk load is now COPY; see CASSANDRA-11053
>  and linked
> tickets


 is confusing because I read that the CQLSSTableWriter + sstableloader
 is much faster than COPY. Who is right?

 2. Is there any real examples of multi-threaded using of
 CQLSSTableWriter?
 Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
 ?

 3. sstableloader is slow too. Assuming that I have new empty C*
 cluster, how can I improve the upload speed? Maybe disable replication or
 some other settings while streaming and then turn it back?

 Thanks!
 Artur.

>>>
>>
>>
>> --
>>
>> 
>>
>> STEFANIA ALBORGHETTI
>>
>> Software engineer | +852 6114 9265 <+852%206114%209265> |
>> stefania.alborghe...@datastax.com
>>
>>
>> [image: http://www.datastax.com/cloud-applications]
>> 
>>
>>
>>
>>
>
>
> --
> Cordialement;
>
> Ahmed ELJAMI
>


Re: HELP with bulk loading

2017-03-10 Thread Ahmed Eljami
Hi,

>3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Maybe you can accelerate you load with the option -cph (connection per
host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000

With cph=12 and t=1000,  I went from 56min (default value) to 11min for
table of 50Gb.



2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <
stefania.alborghe...@datastax.com>:

> When I tested cqlsh COPY FROM for CASSANDRA-11053
> ,
> I was able to import about 20 GB in under 4 minutes on a cluster with 8
> nodes using the same benchmark created for cassandra-loader, provided the
> driver was Cythonized, instructions in this blog post
> .
> The performance was similar to cassandra-loader.
>
> Depending on your schema, one or the other may do slightly better.
>
> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla  wrote:
>
>> I suggest using cassandra loader
>>
>> https://github.com/brianmhess/cassandra-loader
>>
>> On Mar 9, 2017 5:30 PM, "Artur R"  wrote:
>>
>>> Hello all!
>>>
>>> There are ~500gb of CSV files and I am trying to find the way how to
>>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>>> c3.8xlarge EC2 nodes).
>>>
>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>>> single instance and I can't efficiently parallelize it (just creating Java
>>> threads) because after some moment it always "hangs" (looks like GC is
>>> overstressed) and eats all available memory.
>>>
>>> So the questions are:
>>> 1. What is the best way to bulk-load huge amount of data to new C*
>>> cluster?
>>>
>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>>
>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
  and linked
 tickets
>>>
>>>
>>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>>> much faster than COPY. Who is right?
>>>
>>> 2. Is there any real examples of multi-threaded using of
>>> CQLSSTableWriter?
>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
>>> ?
>>>
>>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>>> how can I improve the upload speed? Maybe disable replication or some other
>>> settings while streaming and then turn it back?
>>>
>>> Thanks!
>>> Artur.
>>>
>>
>
>
> --
>
> 
>
> STEFANIA ALBORGHETTI
>
> Software engineer | +852 6114 9265 <+852%206114%209265> |
> stefania.alborghe...@datastax.com
>
>
> [image: http://www.datastax.com/cloud-applications]
> 
>
>
>
>


-- 
Cordialement;

Ahmed ELJAMI


Re: HELP with bulk loading

2017-03-09 Thread Stefania Alborghetti
When I tested cqlsh COPY FROM for CASSANDRA-11053
,
I was able to import about 20 GB in under 4 minutes on a cluster with 8
nodes using the same benchmark created for cassandra-loader, provided the
driver was Cythonized, instructions in this blog post
.
The performance was similar to cassandra-loader.

Depending on your schema, one or the other may do slightly better.

On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla  wrote:

> I suggest using cassandra loader
>
> https://github.com/brianmhess/cassandra-loader
>
> On Mar 9, 2017 5:30 PM, "Artur R"  wrote:
>
>> Hello all!
>>
>> There are ~500gb of CSV files and I am trying to find the way how to
>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>> c3.8xlarge EC2 nodes).
>>
>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>> single instance and I can't efficiently parallelize it (just creating Java
>> threads) because after some moment it always "hangs" (looks like GC is
>> overstressed) and eats all available memory.
>>
>> So the questions are:
>> 1. What is the best way to bulk-load huge amount of data to new C*
>> cluster?
>>
>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>
>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>>  and linked
>>> tickets
>>
>>
>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>> much faster than COPY. Who is right?
>>
>> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>>
>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>> how can I improve the upload speed? Maybe disable replication or some other
>> settings while streaming and then turn it back?
>>
>> Thanks!
>> Artur.
>>
>


-- 



STEFANIA ALBORGHETTI

Software engineer | +852 6114 9265 | stefania.alborghe...@datastax.com


[image: http://www.datastax.com/cloud-applications]



Re: HELP with bulk loading

2017-03-09 Thread Ryan Svihla
I suggest using cassandra loader

https://github.com/brianmhess/cassandra-loader

On Mar 9, 2017 5:30 PM, "Artur R"  wrote:

> Hello all!
>
> There are ~500gb of CSV files and I am trying to find the way how to
> upload them to C* table (new empty C* cluster of 3 nodes, replication
> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
> c3.8xlarge EC2 nodes).
>
> My first impulse was to use CQLSSTableWriter, but it is too slow is of
> single instance and I can't efficiently parallelize it (just creating Java
> threads) because after some moment it always "hangs" (looks like GC is
> overstressed) and eats all available memory.
>
> So the questions are:
> 1. What is the best way to bulk-load huge amount of data to new C* cluster?
>
> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>
> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>  and linked
>> tickets
>
>
> is confusing because I read that the CQLSSTableWriter + sstableloader is
> much faster than COPY. Who is right?
>
> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>
> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Thanks!
> Artur.
>


HELP with bulk loading

2017-03-09 Thread Artur R
Hello all!

There are ~500gb of CSV files and I am trying to find the way how to upload
them to C* table (new empty C* cluster of 3 nodes, replication factor 2)
within reasonable time (say, 10 hours using 3-4 instance of c3.8xlarge EC2
nodes).

My first impulse was to use CQLSSTableWriter, but it is too slow is of
single instance and I can't efficiently parallelize it (just creating Java
threads) because after some moment it always "hangs" (looks like GC is
overstressed) and eats all available memory.

So the questions are:
1. What is the best way to bulk-load huge amount of data to new C* cluster?

This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:

The preferred way to bulk load is now COPY; see CASSANDRA-11053
>  and linked tickets


is confusing because I read that the CQLSSTableWriter + sstableloader is
much faster than COPY. Who is right?

2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?

3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Thanks!
Artur.


Re: Attached profiled data but need help understanding it

2017-03-06 Thread Romain Hardouin
Hi Kant,
You'll find more information about ixgbevf here 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sriov-networking.htmlI 
repeat myself but don't underestimate VMs placement: same AZ? same placement 
group? etc.Note that LWT are not discouraged but as the doc says: "[...] 
reserve lightweight transactions for those situations where they are absolutely 
necessary;"I hope you'll be able to achieve what you want with more powerful 
VMs. Let us know!
Best,Romain
 

Le Lundi 6 mars 2017 10h49, Kant Kodali  a écrit :
 

 Hi Romain,
We may be able to achieve what we need without LWT but that would require bunch 
of changes from the application side and possibly introducing caching layers 
and designing solution around that. But for now, we are constrained to use 
LWT's for another month or so. All said, I still would like to see the 
discouraged features such as LWT's, secondary indexes, triggers get better over 
time so it would really benefit users.
Agreed High park/unpark is a sign of excessive context switching but any ideas 
why this is happening? yes today we will be experimenting with c3.2Xlarge and 
see what the numbers look like and slowly scale up from there.
How do I make sure I install  ixgbevf driver? Do M4.xlarge or C3.2Xlarge don't 
already have it? when I googled " ixgbevf driver" it tells me it is ethernet 
driver...I thought all instances by default run on ethernet on AWS. can you 
please give more context on this?
Thanks,kant
On Fri, Mar 3, 2017 at 4:42 AM, Romain Hardouin  wrote:

Also, I should have mentioned that it would be a good idea to spawn your three 
benchmark instances in the same AZ, then try with one instance on each AZ to 
see how network latency affects your LWT rate. The lower latency is achievable 
with three instances on the same placement group of course but it's kinda 
dangerous for production. 





   

Re: Attached profiled data but need help understanding it

2017-03-06 Thread Kant Kodali
Hi Romain,

We may be able to achieve what we need without LWT but that would require
bunch of changes from the application side and possibly introducing caching
layers and designing solution around that. But for now, we are constrained
to use LWT's for another month or so. All said, I still would like to see
the discouraged features such as LWT's, secondary indexes, triggers get
better over time so it would really benefit users.

Agreed High park/unpark is a sign of excessive context switching but any
ideas why this is happening? yes today we will be experimenting with
c3.2Xlarge and see what the numbers look like and slowly scale up from
there.

How do I make sure I install  ixgbevf driver? Do M4.xlarge or C3.2Xlarge
don't already have it? when I googled " ixgbevf driver" it tells me it is
ethernet driver...I thought all instances by default run on ethernet on
AWS. can you please give more context on this?

Thanks,
kant

On Fri, Mar 3, 2017 at 4:42 AM, Romain Hardouin  wrote:

> Also, I should have mentioned that it would be a good idea to spawn your
> three benchmark instances in the same AZ, then try with one instance on
> each AZ to see how network latency affects your LWT rate. The lower latency
> is achievable with three instances on the same placement group of course
> but it's kinda dangerous for production.
>
>
>


Re: Attached profiled data but need help understanding it

2017-03-03 Thread Romain Hardouin
Also, I should have mentioned that it would be a good idea to spawn your three 
benchmark instances in the same AZ, then try with one instance on each AZ to 
see how network latency affects your LWT rate. The lower latency is achievable 
with three instances on the same placement group of course but it's kinda 
dangerous for production. 



Re: Attached profiled data but need help understanding it

2017-03-02 Thread Romain Hardouin
Hi Kant,
> By backporting you mean I should cherry pick CASSANDRA-11966 commit and 
> compile from source?
Yes
Regarding the network utilization: you checked throughput but latency is more 
important for LWT. That's why you should make sure your m4 instances (both C* 
and client) are using ixgbevf driver.
I agree 1500 writes/s is not impressive but 4 vCPU is low. It depends on the 
workload but my experience is that an AWS instance become to be powerful with 
16 vCPUs (e.g. c3.4xlarge). And beware of EBS (again, that's my experience 
YMMV).
High park/unpark is a sign of excessive context switching. If I were you I 
would make a LWT benchmark with 3 x c3.4xlarge or c3.8xlarge (32 vCPUs, SSD 
instance store). Spawn spot instances to save money and be sure to tune 
cassandra.yaml accordingly e.g. concurrent_writes.
Finally, a naive question but I must ask you... are you really sure you need 
LWT? Can't you achieve your goal without it?

 Best,
Romain

Le Jeudi 2 mars 2017 10h31, Kant Kodali <k...@peernova.com> a écrit :
 

 Hi Romain,
Any ideas on this? I am not sure why there is so much time being spent in Park 
and Unpark methods as produced by thread dump? Also, could you please look into 
my responses from other email? It would greatly help.
Thanks,kant
On Tue, Feb 28, 2017 at 10:20 PM, Kant Kodali <k...@peernova.com> wrote:

Hi Romain,
I am using Cassandra version 3.0.9 and here is the generated report  (Graphical 
view) of my thread dump as well!. Just send this over in case if it helps.
Thanks,kant
On Tue, Feb 28, 2017 at 7:51 PM, Kant Kodali <k...@peernova.com> wrote:

Hi Romain,
Thanks again. My response are inline.
kant

On Tue, Feb 28, 2017 at 10:04 AM, Romain Hardouin <romainh...@yahoo.fr> wrote:

> we are currently using 3.0.9.  should we use 3.8 or 3.10
No, don't use 3.X in production unless you really need a major feature.I would 
advise to stick to 3.0.X (i.e. 3.0.11 now).You can backport CASSANDRA-11966 
easily but of course you have to deploy from source as a prerequisite.

   By backporting you mean I should cherry pick CASSANDRA-11966 commit and 
compile from source?

> I haven't done any tuning yet.
So it's a good news because maybe there is room for improvement
> Can I change this on a running instance? If so, how? or does it require a 
> downtime?
You can throttle compaction at runtime with "nodetool setcompactionthroughput". 
Be sure to read all nodetool commmands, some of them are really useful for a 
day to day tuning/management. 
If GC is fine, then check other things -> "[...] different pool sizes for NTR, 
concurrent reads and writes, compaction executors, etc. Also check if you can 
improve network latency (e.g. VF or ENA on AWS)."
Regarding thread pools, some of them can be resized at runtime via JMX.
> 5000 is the target.
Right now you reached 1500. Is it per node or for the cluster?We don't know 
your setup so it's hard to say it's doable. Can you provide more details? VM, 
physical nodes, #nodes, etc.Generally speaking LWT should be seldom used. AFAIK 
you won't achieve 10,000 writes/s per node.
Maybe someone on the list already made some tuning for heavy LWT workload?

    1500 total cluster.  
    I have a 8 node cassandra cluster. Each node is AWS m4.xlarge instance (so 
4 vCPU, 16GB, 1Gbit network=125MB/s)
    I have 1 node (m4.xlarge) for my application which just inserts a bunch of 
data and each insert is an LWT
 
    I tested the network throughput of the node.  I can get up 98 MB/s.
    Now, when I start my application. I see that Cassandra nodes Receive rate/ 
throughput is about 4MB/s (yes it is in Mega Bytes. I checked this by running 
sudo iftop -B). The Disk I/O is also same and the Cassandra process CPU usage 
is about 360% (the max is 400% since it is a 4 core machine). The application 
node transmission throughput is about 6MB/s. so even with 4MB/s receive 
throughput at Cassandra node the CPU is almost maxed out. I am not sure what 
this says about Cassandra? But, what I can tell is that Network is way 
underutilized and that 8 nodes are unnecessary so we plan to bring it down to 4 
nodes except each node this time will have 8 cores. All said, I am still not 
sure how to scale up from 1500 writes/sec?       

Best,
Romain








   

Re: Attached profiled data but need help understanding it

2017-02-28 Thread Kant Kodali
Hi Romain,

I am using Cassandra version 3.0.9 and here is the generated report

(Graphical view) of my thread dump as well!. Just send this over in case if
it helps.

Thanks,
kant

On Tue, Feb 28, 2017 at 7:51 PM, Kant Kodali  wrote:

> Hi Romain,
>
> Thanks again. My response are inline.
>
> kant
>
> On Tue, Feb 28, 2017 at 10:04 AM, Romain Hardouin 
> wrote:
>
>> > we are currently using 3.0.9.  should we use 3.8 or 3.10
>>
>> No, don't use 3.X in production unless you really need a major feature.
>> I would advise to stick to 3.0.X (i.e. 3.0.11 now).
>> You can backport CASSANDRA-11966 easily but of course you have to deploy
>> from source as a prerequisite.
>>
>
>   * By backporting you mean I should cherry pick CASSANDRA-11966 commit
> and compile from source?*
>
>>
>> > I haven't done any tuning yet.
>>
>> So it's a good news because maybe there is room for improvement
>>
>> > Can I change this on a running instance? If so, how? or does it require
>> a downtime?
>>
>> You can throttle compaction at runtime with "nodetool
>> setcompactionthroughput". Be sure to read all nodetool commmands, some of
>> them are really useful for a day to day tuning/management.
>>
>> If GC is fine, then check other things -> "[...] different pool sizes for
>> NTR, concurrent reads and writes, compaction executors, etc. Also check if
>> you can improve network latency (e.g. VF or ENA on AWS)."
>>
>> Regarding thread pools, some of them can be resized at runtime via JMX.
>>
>> > 5000 is the target.
>>
>> Right now you reached 1500. Is it per node or for the cluster?
>> We don't know your setup so it's hard to say it's doable. Can you provide
>> more details? VM, physical nodes, #nodes, etc.
>> Generally speaking LWT should be seldom used. AFAIK you won't achieve
>> 10,000 writes/s per node.
>>
>> Maybe someone on the list already made some tuning for heavy LWT workload?
>>
>
> *1500 total cluster.  *
>
> *I have a 8 node cassandra cluster. Each node is AWS m4.xlarge
> instance (so 4 vCPU, 16GB, 1Gbit network=125MB/s)*
>
>
>
> *I have 1 node (m4.xlarge) for my application which just inserts a
> bunch of data and each insert is an LWT I tested the network throughput
> of the node.  I can get up 98 MB/s.*
>
> *Now, when I start my application. I see that Cassandra nodes Receive
> rate/ throughput is about 4MB/s (yes it is in Mega Bytes. I checked this by
> running sudo iftop -B). The Disk I/O is also same and the Cassandra process
> CPU usage is about 360% (the max is 400% since it is a 4 core machine). The
> application node transmission throughput is about 6MB/s. so even with 4MB/s
> receive throughput at Cassandra node the CPU is almost maxed out. I am not
> sure what this says about Cassandra? But, what I can tell is that Network
> is way underutilized and that 8 nodes are unnecessary so we plan to bring
> it down to 4 nodes except each node this time will have 8 cores. All said,
> I am still not sure how to scale up from 1500 writes/sec? *
>
>
>>
>> Best,
>>
>> Romain
>>
>>
>


Re: Attached profiled data but need help understanding it

2017-02-28 Thread Kant Kodali
Hi Romain,

Thanks again. My response are inline.

kant

On Tue, Feb 28, 2017 at 10:04 AM, Romain Hardouin 
wrote:

> > we are currently using 3.0.9.  should we use 3.8 or 3.10
>
> No, don't use 3.X in production unless you really need a major feature.
> I would advise to stick to 3.0.X (i.e. 3.0.11 now).
> You can backport CASSANDRA-11966 easily but of course you have to deploy
> from source as a prerequisite.
>

  * By backporting you mean I should cherry pick CASSANDRA-11966 commit and
compile from source?*

>
> > I haven't done any tuning yet.
>
> So it's a good news because maybe there is room for improvement
>
> > Can I change this on a running instance? If so, how? or does it require
> a downtime?
>
> You can throttle compaction at runtime with "nodetool
> setcompactionthroughput". Be sure to read all nodetool commmands, some of
> them are really useful for a day to day tuning/management.
>
> If GC is fine, then check other things -> "[...] different pool sizes for
> NTR, concurrent reads and writes, compaction executors, etc. Also check if
> you can improve network latency (e.g. VF or ENA on AWS)."
>
> Regarding thread pools, some of them can be resized at runtime via JMX.
>
> > 5000 is the target.
>
> Right now you reached 1500. Is it per node or for the cluster?
> We don't know your setup so it's hard to say it's doable. Can you provide
> more details? VM, physical nodes, #nodes, etc.
> Generally speaking LWT should be seldom used. AFAIK you won't achieve
> 10,000 writes/s per node.
>
> Maybe someone on the list already made some tuning for heavy LWT workload?
>

*1500 total cluster.  *

*I have a 8 node cassandra cluster. Each node is AWS m4.xlarge instance
(so 4 vCPU, 16GB, 1Gbit network=125MB/s)*



*I have 1 node (m4.xlarge) for my application which just inserts a
bunch of data and each insert is an LWT I tested the network throughput
of the node.  I can get up 98 MB/s.*

*Now, when I start my application. I see that Cassandra nodes Receive
rate/ throughput is about 4MB/s (yes it is in Mega Bytes. I checked this by
running sudo iftop -B). The Disk I/O is also same and the Cassandra process
CPU usage is about 360% (the max is 400% since it is a 4 core machine). The
application node transmission throughput is about 6MB/s. so even with 4MB/s
receive throughput at Cassandra node the CPU is almost maxed out. I am not
sure what this says about Cassandra? But, what I can tell is that Network
is way underutilized and that 8 nodes are unnecessary so we plan to bring
it down to 4 nodes except each node this time will have 8 cores. All said,
I am still not sure how to scale up from 1500 writes/sec? *


>
> Best,
>
> Romain
>
>


Re: Attached profiled data but need help understanding it

2017-02-28 Thread Romain Hardouin
> we are currently using 3.0.9.  should we use 3.8 or 3.10
No, don't use 3.X in production unless you really need a major feature.I would 
advise to stick to 3.0.X (i.e. 3.0.11 now).You can backport CASSANDRA-11966 
easily but of course you have to deploy from source as a prerequisite.
> I haven't done any tuning yet.
So it's a good news because maybe there is room for improvement
> Can I change this on a running instance? If so, how? or does it require a 
> downtime?
You can throttle compaction at runtime with "nodetool setcompactionthroughput". 
Be sure to read all nodetool commmands, some of them are really useful for a 
day to day tuning/management. 
If GC is fine, then check other things -> "[...] different pool sizes for NTR, 
concurrent reads and writes, compaction executors, etc. Also check if you can 
improve network latency (e.g. VF or ENA on AWS)."
Regarding thread pools, some of them can be resized at runtime via JMX.
> 5000 is the target.
Right now you reached 1500. Is it per node or for the cluster?We don't know 
your setup so it's hard to say it's doable. Can you provide more details? VM, 
physical nodes, #nodes, etc.Generally speaking LWT should be seldom used. AFAIK 
you won't achieve 10,000 writes/s per node.
Maybe someone on the list already made some tuning for heavy LWT workload?
Best,
Romain


Re: Attached profiled data but need help understanding it

2017-02-27 Thread Kant Kodali
Hi! My answers are inline.

On Mon, Feb 27, 2017 at 11:48 AM, Kant Kodali <k...@peernova.com> wrote:

>
>
> On Mon, Feb 27, 2017 at 10:30 AM, Romain Hardouin <romainh...@yahoo.fr>
> wrote:
>
>> Hi,
>>
>> Regarding shared pool workers see CASSANDRA-11966. You may have to
>> backport it depending on your Cassandra version.
>>
>
> *we are currently using 3.0.9.  should we use 3.8 or 3.10?*
>
>>
>> Did you try to lower compaction throughput to see if it helps? Be sure to
>> keep an eye on pending compactions, SSTables count and SSTable per read of
>> course.
>>
>
>*I haven't done any tuning yet. Can I change this on a running
> instance? If so, how? or does it require a downtime?*
>
>>
>> "alloc" is the memory allocation rate. You can see that compactions are
>> GC intensive.
>>
>
>> You won't be able to achieve impressive writes/s with LWT. But maybe
>> there is room for improvement. Try GC tuning, different pool sizes for NTR,
>> concurrent reads and writes, compaction executors, etc. Also check if you
>> can improve network latency (e.g. VF or ENA on AWS).
>>
>
>* GC seems to be fine because when I checked GC is about 0.25%. Total
> GC time is about 6minutes since the node is up and running for about 50
> hours*.
>
>>
>> What LWT rate would you want to achieve?  *5000 is the target.*
>>
>> Best,
>>
>> Romain
>>
>>
>>
>> Le Lundi 27 février 2017 12h48, Kant Kodali <k...@peernova.com> a écrit :
>>
>>
>> Also Attached is a flamed graph generated from a thread dump.
>>
>> On Mon, Feb 27, 2017 at 2:32 AM, Kant Kodali <k...@peernova.com> wrote:
>>
>> Hi,
>>
>> Attached are the stats of my Cassandra node running on a 4-core CPU. I am
>> using sjk-plus tool for the first time so what are the things I should
>> watched out for in my attached screenshot? I can see the CPU is almost
>> maxed out but should I say that is because of compaction or
>> shared-worker-pool threads (which btw, I dont know what they are doing
>> perhaps I need to take threadump)? Also what is alloc for each thread?
>>
>> I have a insert heavy workload (almost like an ingest running against
>> cassandra cluster) and in my case all writes are LWT.
>>
>> The current throughput is 1500 writes/sec where each write is about 1KB.
>> How can I tune something for a higher throughput? Any pointers or
>> suggestions would help.
>>
>> Thanks much,
>> kant
>>
>>
>>
>>
>>
>


Re: Attached profiled data but need help understanding it

2017-02-27 Thread Kant Kodali
On Mon, Feb 27, 2017 at 10:30 AM, Romain Hardouin <romainh...@yahoo.fr>
wrote:

> Hi,
>
> Regarding shared pool workers see CASSANDRA-11966. You may have to
> backport it depending on your Cassandra version.
>

*we are currently using 3.0.9.  should we use 3.8 or 3.10?*

>
> Did you try to lower compaction throughput to see if it helps? Be sure to
> keep an eye on pending compactions, SSTables count and SSTable per read of
> course.
>

   *I haven't done any tuning yet. Can I change this on a running instance?
If so, how? or does it require a downtime?*

>
> "alloc" is the memory allocation rate. You can see that compactions are GC
> intensive.
>

> You won't be able to achieve impressive writes/s with LWT. But maybe there
> is room for improvement. Try GC tuning, different pool sizes for NTR,
> concurrent reads and writes, compaction executors, etc. Also check if you
> can improve network latency (e.g. VF or ENA on AWS).
>

   * GC seems to be fine because when I checked GC is about 0.25%. Total GC
time is about 6minutes since the node is up and running for about 50 hours*.

>
> What LWT rate would you want to achieve?  *5000 is the target.*
>
> Best,
>
> Romain
>
>
>
> Le Lundi 27 février 2017 12h48, Kant Kodali <k...@peernova.com> a écrit :
>
>
> Also Attached is a flamed graph generated from a thread dump.
>
> On Mon, Feb 27, 2017 at 2:32 AM, Kant Kodali <k...@peernova.com> wrote:
>
> Hi,
>
> Attached are the stats of my Cassandra node running on a 4-core CPU. I am
> using sjk-plus tool for the first time so what are the things I should
> watched out for in my attached screenshot? I can see the CPU is almost
> maxed out but should I say that is because of compaction or
> shared-worker-pool threads (which btw, I dont know what they are doing
> perhaps I need to take threadump)? Also what is alloc for each thread?
>
> I have a insert heavy workload (almost like an ingest running against
> cassandra cluster) and in my case all writes are LWT.
>
> The current throughput is 1500 writes/sec where each write is about 1KB.
> How can I tune something for a higher throughput? Any pointers or
> suggestions would help.
>
> Thanks much,
> kant
>
>
>
>
>


Re: Attached profiled data but need help understanding it

2017-02-27 Thread Romain Hardouin
Hi,
Regarding shared pool workers see CASSANDRA-11966. You may have to backport it 
depending on your Cassandra version. 
Did you try to lower compaction throughput to see if it helps? Be sure to keep 
an eye on pending compactions, SSTables count and SSTable per read of course.
"alloc" is the memory allocation rate. You can see that compactions are GC 
intensive.
You won't be able to achieve impressive writes/s with LWT. But maybe there is 
room for improvement. Try GC tuning, different pool sizes for NTR, concurrent 
reads and writes, compaction executors, etc. Also check if you can improve 
network latency (e.g. VF or ENA on AWS).
What LWT rate would you want to achieve?
Best,
Romain
 

Le Lundi 27 février 2017 12h48, Kant Kodali <k...@peernova.com> a écrit :
 

 Also Attached is a flamed graph generated from a thread dump.
On Mon, Feb 27, 2017 at 2:32 AM, Kant Kodali <k...@peernova.com> wrote:

Hi,
Attached are the stats of my Cassandra node running on a 4-core CPU. I am using 
sjk-plus tool for the first time so what are the things I should watched out 
for in my attached screenshot? I can see the CPU is almost maxed out but should 
I say that is because of compaction or shared-worker-pool threads (which btw, I 
dont know what they are doing perhaps I need to take threadump)? Also what is 
alloc for each thread? 
I have a insert heavy workload (almost like an ingest running against cassandra 
cluster) and in my case all writes are LWT.
The current throughput is 1500 writes/sec where each write is about 1KB. How 
can I tune something for a higher throughput? Any pointers or suggestions would 
help.

Thanks much,kant



   

Re: Attached profiled data but need help understanding it

2017-02-27 Thread Kant Kodali
Also Attached is a flamed graph generated from a thread dump.

On Mon, Feb 27, 2017 at 2:32 AM, Kant Kodali <k...@peernova.com> wrote:

> Hi,
>
> Attached are the stats of my Cassandra node running on a 4-core CPU. I am
> using sjk-plus tool for the first time so what are the things I should
> watched out for in my attached screenshot? I can see the CPU is almost
> maxed out but should I say that is because of compaction or
> shared-worker-pool threads (which btw, I dont know what they are doing
> perhaps I need to take threadump)? Also what is alloc for each thread?
>
> I have a insert heavy workload (almost like an ingest running against
> cassandra cluster) and in my case all writes are LWT.
>
> The current throughput is 1500 writes/sec where each write is about 1KB.
> How can I tune something for a higher throughput? Any pointers or
> suggestions would help.
>
> Thanks much,
> kant
>


Re: Help with cassandra triggers

2017-01-17 Thread Jonathan Haddad
Trigger only gets executed on the coordinator.  There's no remote DC
trigger.

What you need is Change Data Capture (CDC).
https://issues.apache.org/jira/browse/CASSANDRA-8844

On Tue, Jan 17, 2017 at 9:40 AM suraj pasuparthy <suraj.pasupar...@gmail.com>
wrote:

> Hello
> We have a usecase where we need to support triggers with multiple
> datacenters.
> The use case is we need is
> 1) Data is written into DC1.
> 2) The Sync configured will sync the data to DC2
> 3) when the data is written into DC2, we need a trigger to fire on DC2.
>
> I have tested the triggers for a local write. But i do not see the trigger
> on sync.
> Could anyone please help me out with this ?
>
> thanks
> Suraj
>


Help with cassandra triggers

2017-01-17 Thread suraj pasuparthy
Hello
We have a usecase where we need to support triggers with multiple
datacenters.
The use case is we need is
1) Data is written into DC1.
2) The Sync configured will sync the data to DC2
3) when the data is written into DC2, we need a trigger to fire on DC2.

I have tested the triggers for a local write. But i do not see the trigger
on sync.
Could anyone please help me out with this ?

thanks
Suraj


Re: Help

2017-01-15 Thread Jonathan Haddad
I've heard enough stories of firewall issues that I'm willing to bet it's
the problem, if it's sitting between the nodes.
On Sun, Jan 15, 2017 at 9:32 AM Anshu Vajpayee <anshu.vajpa...@gmail.com>
wrote:

> ​Setup is not on cloud. We have  few nodes in one  DC(1) and same number
> of nodes in other DC(2). We have dedicated firewall in-front on nodes.
>
> Read and write happen with local quorum so those dont get affected but
> hints get accumulated from one DC to other DC for replications. Hints are
> also getting timed out sporadically in logs.
>
> describe cluster didn't show any error , but in some case it was taking
> longer time.
>
> On Sun, Jan 15, 2017 at 3:01 AM, Aleksandr Ivanov <ale...@gmail.com>
> wrote:
>
> Could you share a bit your cluster setup? Do you use cloud for your
> deployment or dedicated firewalls in front of nodes?
>
> If gossip shows that everything is up it doesn't mean that all nodes can
> communicate with each other. I have noticed situations when TCP connection
> was killed by firewall and Cassandra didn't reconnect automatically. It can
> be easily detected with nodetool describecluster command.
>
> Aleksandr
>
>  shows - all nodes are up.
>
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
>
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
>
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
>
> Can anyone encounter this scenario ? Network side everything is fine.
>
> Cassandra version is 2.1.13
>
> --
> *Regards,*
> *Anshu *
>
>
>
>
>
> --
> *Regards,*
> *Anshu *
>
>
>


Re: Help

2017-01-15 Thread Anshu Vajpayee
​Setup is not on cloud. We have  few nodes in one  DC(1) and same number of
nodes in other DC(2). We have dedicated firewall in-front on nodes.

Read and write happen with local quorum so those dont get affected but
hints get accumulated from one DC to other DC for replications. Hints are
also getting timed out sporadically in logs.

describe cluster didn't show any error , but in some case it was taking
longer time.

On Sun, Jan 15, 2017 at 3:01 AM, Aleksandr Ivanov <ale...@gmail.com> wrote:

> Could you share a bit your cluster setup? Do you use cloud for your
> deployment or dedicated firewalls in front of nodes?
>
> If gossip shows that everything is up it doesn't mean that all nodes can
> communicate with each other. I have noticed situations when TCP connection
> was killed by firewall and Cassandra didn't reconnect automatically. It can
> be easily detected with nodetool describecluster command.
>
> Aleksandr
>
>  shows - all nodes are up.
>>
>> But when  we perform writes , coordinator stores the hints. It means  -
>> coordinator was not able to deliver the writes to few nodes after meeting
>> consistency requirements.
>>
>> The nodes for which  writes were failing, are in different DC. Those
>> nodes do not have any load.
>>
>> Gossips shows everything is up.  I already set write timeout to 60 sec,
>> but no help.
>>
>> Can anyone encounter this scenario ? Network side everything is fine.
>>
>> Cassandra version is 2.1.13
>>
>> --
>> *Regards,*
>> *Anshu *
>>
>>
>>


-- 
*Regards,*
*Anshu *


Re: Help

2017-01-14 Thread Aleksandr Ivanov
Could you share a bit your cluster setup? Do you use cloud for your
deployment or dedicated firewalls in front of nodes?

If gossip shows that everything is up it doesn't mean that all nodes can
communicate with each other. I have noticed situations when TCP connection
was killed by firewall and Cassandra didn't reconnect automatically. It can
be easily detected with nodetool describecluster command.

Aleksandr

 shows - all nodes are up.
>
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
>
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
>
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
>
> Can anyone encounter this scenario ? Network side everything is fine.
>
> Cassandra version is 2.1.13
>
> --
> *Regards,*
> *Anshu *
>
>
>


Re: Help

2017-01-09 Thread Chris Lohfink
Do you have any monitoring setup around garbage collections?  A GC +
network latency > write timeout will cause intermittent hints.

On Sun, Jan 8, 2017 at 10:30 PM, Anshu Vajpayee <anshu.vajpa...@gmail.com>
wrote:

> Gossip shows - all nodes are up.
>
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
>
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
>
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
>
> Can anyone encounter this scenario ? Network side everything is fine.
>
> Cassandra version is 2.1.13
>
> --
> *Regards,*
> *Anshu *
>
>
>


Re: Help

2017-01-09 Thread Edward Capriolo
On Sun, Jan 8, 2017 at 11:30 PM, Anshu Vajpayee <anshu.vajpa...@gmail.com>
wrote:

> Gossip shows - all nodes are up.
>
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
>
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
>
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
>
> Can anyone encounter this scenario ? Network side everything is fine.
>
> Cassandra version is 2.1.13
>
> --
> *Regards,*
> *Anshu *
>
>
>
This suggests you have some intermittent network issues. I would suggest
using query tracing

http://cassandra.apache.org/doc/latest/tools/cqlsh.html

Hopefully you can use that to determine why some operations are failing.


Help

2017-01-08 Thread Anshu Vajpayee
Gossip shows - all nodes are up.

But when  we perform writes , coordinator stores the hints. It means  -
coordinator was not able to deliver the writes to few nodes after meeting
consistency requirements.

The nodes for which  writes were failing, are in different DC. Those nodes
do not have any load.

Gossips shows everything is up.  I already set write timeout to 60 sec, but
no help.

Can anyone encounter this scenario ? Network side everything is fine.

Cassandra version is 2.1.13

-- 
*Regards,*
*Anshu *


Re: Schema help required

2016-12-18 Thread Sagar Jambhulkar
Thanks Alain for the help. I will give these options a try.

On Dec 18, 2016 10:01 PM, "Alain RODRIGUEZ" <arodr...@gmail.com> wrote:

> Hi Sagar,
>
>
>> But this is a known anti pattern to not use Cassandra as a queue causing
>> tombstones etc.
>> But I could not think of any other way. Does anyone have any other
>> suggestion so as to not delete after a pair is created
>
>
> I believe you could try using a fixed TTL (defined at the table level for
> example), then use a TWCS compaction strategy and compactions options that
> would efficiently manage tombstones. A colleague at The Last Pickle just
> wrote an article about TWCS: http://thelastpickle.
> com/blog/2016/12/08/TWCS-part1.html, and there is a lot more information
> around, including a talk this year at the summit from Jeff who contributed
> with TWCS to Apache Cassandra: https://www.youtube.com/watch?v=PWtekUWCIaw
> .
>
> Also using a time buckets in the partition key could help making sure
> tombstones will be correctly removed and are not being scanned when
> requesting new data. Yet do not use *only* a time bucket in the partition
> key as it would lead to hotspots. For a given date, only one node (+
> replicas) would handle the write / read load.
>
> So using "day + something else" as a partition key and "TWCS + Fixed TTLs"
> *could* be a good way to move forward.
>
> I would give it a try with the cassandra-stress tool that is shipped
> alongside Apache Cassandra and allows the use of a user defined schema.
>
> C*heers,
>
> Alain
>
> 2016-12-17 21:02 GMT+01:00 Sagar Jambhulkar <sagar.jambhul...@gmail.com>:
>
>> Hi,
>> Needed a suggestion for a schema query. I want to build a reconciliation
>> using Cassandra. Basically two or more systems send message to a
>> reconciliation process. The reconciliation process first does a level one
>> match of id's and than does complete comparison of messages.
>>
>> The best I could think of is a like a queue table with id's. My consumer
>> thread/s would, poll this table, create a pair and would have to delete
>> from this table. But this is a known anti pattern to not use Cassandra as a
>> queue causing tombstones etc.
>> But I could not think of any other way. Does anyone have any other
>> suggestion so as to not delete after a pair is created. Is Cassandra not
>> the correct technology for a recon process?
>>
>> Thanks,
>> Sagar
>>
>
>


Re: Schema help required

2016-12-18 Thread Alain RODRIGUEZ
Hi Sagar,


> But this is a known anti pattern to not use Cassandra as a queue causing
> tombstones etc.
> But I could not think of any other way. Does anyone have any other
> suggestion so as to not delete after a pair is created


I believe you could try using a fixed TTL (defined at the table level for
example), then use a TWCS compaction strategy and compactions options that
would efficiently manage tombstones. A colleague at The Last Pickle just
wrote an article about TWCS:
http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html, and there is a
lot more information around, including a talk this year at the summit from
Jeff who contributed with TWCS to Apache Cassandra:
https://www.youtube.com/watch?v=PWtekUWCIaw.

Also using a time buckets in the partition key could help making sure
tombstones will be correctly removed and are not being scanned when
requesting new data. Yet do not use *only* a time bucket in the partition
key as it would lead to hotspots. For a given date, only one node (+
replicas) would handle the write / read load.

So using "day + something else" as a partition key and "TWCS + Fixed TTLs"
*could* be a good way to move forward.

I would give it a try with the cassandra-stress tool that is shipped
alongside Apache Cassandra and allows the use of a user defined schema.

C*heers,

Alain

2016-12-17 21:02 GMT+01:00 Sagar Jambhulkar <sagar.jambhul...@gmail.com>:

> Hi,
> Needed a suggestion for a schema query. I want to build a reconciliation
> using Cassandra. Basically two or more systems send message to a
> reconciliation process. The reconciliation process first does a level one
> match of id's and than does complete comparison of messages.
>
> The best I could think of is a like a queue table with id's. My consumer
> thread/s would, poll this table, create a pair and would have to delete
> from this table. But this is a known anti pattern to not use Cassandra as a
> queue causing tombstones etc.
> But I could not think of any other way. Does anyone have any other
> suggestion so as to not delete after a pair is created. Is Cassandra not
> the correct technology for a recon process?
>
> Thanks,
> Sagar
>


Schema help required

2016-12-17 Thread Sagar Jambhulkar
Hi,
Needed a suggestion for a schema query. I want to build a reconciliation
using Cassandra. Basically two or more systems send message to a
reconciliation process. The reconciliation process first does a level one
match of id's and than does complete comparison of messages.

The best I could think of is a like a queue table with id's. My consumer
thread/s would, poll this table, create a pair and would have to delete
from this table. But this is a known anti pattern to not use Cassandra as a
queue causing tombstones etc.
But I could not think of any other way. Does anyone have any other
suggestion so as to not delete after a pair is created. Is Cassandra not
the correct technology for a recon process?

Thanks,
Sagar


Re: ITrigger - Help

2016-11-11 Thread siddharth verma
I haven't tried CDC either.
Read about it at an abstract level.
Suggested it as an option for exploration.

We too use trigger on production to indicate which primary key has been
acted(update/insert/delete) upon.

Regards

On Sat, Nov 12, 2016 at 12:08 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Using CDC is going to be... difficult.  First off (to my knowledge) all
> you get is a CommitLogReader.  If you take a look at the Mutation class
> (everything is serialized and deserialized there), there's no user
> reference.  You only get a keyspace, key, and a PartitionUpdate, which
> don't include any user information.
>
> Next, you may need to dedupe your messages, since you will get RF messages
> for every mutation.  CDC is per-node, vs triggers which are executed at the
> coordinator level.  This may not apply to you as you only want queries that
> came through cqlsh, but I don't see a reasonable way to differentiate all
> the mutations anyway so I think this is a bust.
>
> I haven't spent a lot of time in this code, happy to be corrected if I'm
> wrong.
>
> Jon
>
>
> On Fri, Nov 11, 2016 at 10:14 AM siddharth verma <
> sidd.verma29.l...@gmail.com> wrote:
>
>> Hi Sathish,
>> You could look into, Change Data Capture (CDC) (
>> https://issues.apache.org/jira/browse/CASSANDRA-8844 .
>> It might help you for some of your requirements.
>>
>> Regards
>> Siddharth Verma
>>
>> On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>> cqlsh uses the Python driver, I don't see how there would be any way to
>> differentiate where the request came from unless you stuck an extra field
>> in the table that you always write when you're not in cqlsh, or you
>> modified cqlsh to include that field whenever it did an insert.
>>
>> Checking iTrigger source, all you get is a reference to the ColumnFamily
>> and some metadata.  At a glance of trunk, it doesn't look like you get the
>> user that initiated the query.
>>
>> To be honest, I wouldn't do any of this, it feels like it's going to
>> become an error prone mess.  Your best bet is to layer something on top of
>> the driver yourself.  The cleanest way I think think of, long term, is to
>> submit a JIRA / patch to enable some class loading & listener hooks in
>> cqlsh itself.  Without a patch and a really good use case I don't know who
>> would want to maintain that though, as it would lock the team into using
>> Python for cqlsh.
>>
>> Jon
>>
>> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>>
>> Hi,
>>
>> We are planning to use ITrigger to notify changes, when we execute
>> scripts or run commands in cqlsh prompt. If the operation is performed
>> through our application CRUD API, we are planning to handle notification in
>> our CRUD API itself, however if user performs some operation(like write
>> operation in cqlsh prompt) we want to handle those changes and update
>> modules that are listening to those changes.
>>
>> Could you please let us know whether it is possible to differentiate
>> updates done through cqlsh prompt and through application.
>>
>> We also thought about creating multiple users in cassandra and using
>> different user for cqlsh and for the application. If we go with this
>> approach, do we get the user who modified the table in ITrigger
>> implementation (ie., augment method)
>>
>>
>> Basically we are trying to limit/restrict usage of ITrigger just for
>> cqlsh prompt as it is little complex and risky (came to know it will impact
>> cassandra running in that node).
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>
>>
>>
>> --
>> Siddharth Verma
>> (Visit https://github.com/siddv29/cfs for a high speed cassandra full
>> table scan)
>>
>


-- 
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table
scan)


Re: ITrigger - Help

2016-11-11 Thread Jonathan Haddad
Using CDC is going to be... difficult.  First off (to my knowledge) all you
get is a CommitLogReader.  If you take a look at the Mutation class
(everything is serialized and deserialized there), there's no user
reference.  You only get a keyspace, key, and a PartitionUpdate, which
don't include any user information.

Next, you may need to dedupe your messages, since you will get RF messages
for every mutation.  CDC is per-node, vs triggers which are executed at the
coordinator level.  This may not apply to you as you only want queries that
came through cqlsh, but I don't see a reasonable way to differentiate all
the mutations anyway so I think this is a bust.

I haven't spent a lot of time in this code, happy to be corrected if I'm
wrong.

Jon

On Fri, Nov 11, 2016 at 10:14 AM siddharth verma <
sidd.verma29.l...@gmail.com> wrote:

> Hi Sathish,
> You could look into, Change Data Capture (CDC) (
> https://issues.apache.org/jira/browse/CASSANDRA-8844 .
> It might help you for some of your requirements.
>
> Regards
> Siddharth Verma
>
> On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
> cqlsh uses the Python driver, I don't see how there would be any way to
> differentiate where the request came from unless you stuck an extra field
> in the table that you always write when you're not in cqlsh, or you
> modified cqlsh to include that field whenever it did an insert.
>
> Checking iTrigger source, all you get is a reference to the ColumnFamily
> and some metadata.  At a glance of trunk, it doesn't look like you get the
> user that initiated the query.
>
> To be honest, I wouldn't do any of this, it feels like it's going to
> become an error prone mess.  Your best bet is to layer something on top of
> the driver yourself.  The cleanest way I think think of, long term, is to
> submit a JIRA / patch to enable some class loading & listener hooks in
> cqlsh itself.  Without a patch and a really good use case I don't know who
> would want to maintain that though, as it would lock the team into using
> Python for cqlsh.
>
> Jon
>
> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>
> Hi,
>
> We are planning to use ITrigger to notify changes, when we execute scripts
> or run commands in cqlsh prompt. If the operation is performed through our
> application CRUD API, we are planning to handle notification in our CRUD
> API itself, however if user performs some operation(like write operation in
> cqlsh prompt) we want to handle those changes and update modules that are
> listening to those changes.
>
> Could you please let us know whether it is possible to differentiate
> updates done through cqlsh prompt and through application.
>
> We also thought about creating multiple users in cassandra and using
> different user for cqlsh and for the application. If we go with this
> approach, do we get the user who modified the table in ITrigger
> implementation (ie., augment method)
>
>
> Basically we are trying to limit/restrict usage of ITrigger just for cqlsh
> prompt as it is little complex and risky (came to know it will impact
> cassandra running in that node).
>
> Thanks and Regards
> A.SathishKumar
>
>
>
>
> --
> Siddharth Verma
> (Visit https://github.com/siddv29/cfs for a high speed cassandra full
> table scan)
>


Re: ITrigger - Help

2016-11-11 Thread sat
Hi Siddharth Verma,

We explored this option, it seems it outputs the change only to a log file
and we cannot get notified to a listener class. Could you please provide us
what kind of information is pushed in the commit log and when we should
read commitlog. Do we need to instantiate CommitLogReader.java and read it
for every few seconds. Could you please provide us detailed
example/tutorial of how to use this.

Thanks and Regards
A.SathishKumar

On Fri, Nov 11, 2016 at 10:13 AM, siddharth verma <
sidd.verma29.l...@gmail.com> wrote:

> Hi Sathish,
> You could look into, Change Data Capture (CDC) (
> https://issues.apache.org/jira/browse/CASSANDRA-8844 .
> It might help you for some of your requirements.
>
> Regards
> Siddharth Verma
>
> On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> cqlsh uses the Python driver, I don't see how there would be any way to
>> differentiate where the request came from unless you stuck an extra field
>> in the table that you always write when you're not in cqlsh, or you
>> modified cqlsh to include that field whenever it did an insert.
>>
>> Checking iTrigger source, all you get is a reference to the ColumnFamily
>> and some metadata.  At a glance of trunk, it doesn't look like you get the
>> user that initiated the query.
>>
>> To be honest, I wouldn't do any of this, it feels like it's going to
>> become an error prone mess.  Your best bet is to layer something on top of
>> the driver yourself.  The cleanest way I think think of, long term, is to
>> submit a JIRA / patch to enable some class loading & listener hooks in
>> cqlsh itself.  Without a patch and a really good use case I don't know who
>> would want to maintain that though, as it would lock the team into using
>> Python for cqlsh.
>>
>> Jon
>>
>> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We are planning to use ITrigger to notify changes, when we execute
>>> scripts or run commands in cqlsh prompt. If the operation is performed
>>> through our application CRUD API, we are planning to handle notification in
>>> our CRUD API itself, however if user performs some operation(like write
>>> operation in cqlsh prompt) we want to handle those changes and update
>>> modules that are listening to those changes.
>>>
>>> Could you please let us know whether it is possible to differentiate
>>> updates done through cqlsh prompt and through application.
>>>
>>> We also thought about creating multiple users in cassandra and using
>>> different user for cqlsh and for the application. If we go with this
>>> approach, do we get the user who modified the table in ITrigger
>>> implementation (ie., augment method)
>>>
>>>
>>> Basically we are trying to limit/restrict usage of ITrigger just for
>>> cqlsh prompt as it is little complex and risky (came to know it will impact
>>> cassandra running in that node).
>>>
>>> Thanks and Regards
>>> A.SathishKumar
>>>
>>>
>
>
> --
> Siddharth Verma
> (Visit https://github.com/siddv29/cfs for a high speed cassandra full
> table scan)
>



-- 
A.SathishKumar
044-24735023


Re: ITrigger - Help

2016-11-11 Thread sat
Hi Jon,

Thanks for your prompt answer.

Thanks
A.SathishKumar

On Fri, Nov 11, 2016 at 10:04 AM, Jonathan Haddad  wrote:

> cqlsh uses the Python driver, I don't see how there would be any way to
> differentiate where the request came from unless you stuck an extra field
> in the table that you always write when you're not in cqlsh, or you
> modified cqlsh to include that field whenever it did an insert.
>
> Checking iTrigger source, all you get is a reference to the ColumnFamily
> and some metadata.  At a glance of trunk, it doesn't look like you get the
> user that initiated the query.
>
> To be honest, I wouldn't do any of this, it feels like it's going to
> become an error prone mess.  Your best bet is to layer something on top of
> the driver yourself.  The cleanest way I think think of, long term, is to
> submit a JIRA / patch to enable some class loading & listener hooks in
> cqlsh itself.  Without a patch and a really good use case I don't know who
> would want to maintain that though, as it would lock the team into using
> Python for cqlsh.
>
> Jon
>
> On Fri, Nov 11, 2016 at 9:52 AM sat  wrote:
>
>> Hi,
>>
>> We are planning to use ITrigger to notify changes, when we execute
>> scripts or run commands in cqlsh prompt. If the operation is performed
>> through our application CRUD API, we are planning to handle notification in
>> our CRUD API itself, however if user performs some operation(like write
>> operation in cqlsh prompt) we want to handle those changes and update
>> modules that are listening to those changes.
>>
>> Could you please let us know whether it is possible to differentiate
>> updates done through cqlsh prompt and through application.
>>
>> We also thought about creating multiple users in cassandra and using
>> different user for cqlsh and for the application. If we go with this
>> approach, do we get the user who modified the table in ITrigger
>> implementation (ie., augment method)
>>
>>
>> Basically we are trying to limit/restrict usage of ITrigger just for
>> cqlsh prompt as it is little complex and risky (came to know it will impact
>> cassandra running in that node).
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>


-- 
A.SathishKumar
044-24735023


Re: ITrigger - Help

2016-11-11 Thread siddharth verma
Hi Sathish,
You could look into, Change Data Capture (CDC) (
https://issues.apache.org/jira/browse/CASSANDRA-8844 .
It might help you for some of your requirements.

Regards
Siddharth Verma

On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> cqlsh uses the Python driver, I don't see how there would be any way to
> differentiate where the request came from unless you stuck an extra field
> in the table that you always write when you're not in cqlsh, or you
> modified cqlsh to include that field whenever it did an insert.
>
> Checking iTrigger source, all you get is a reference to the ColumnFamily
> and some metadata.  At a glance of trunk, it doesn't look like you get the
> user that initiated the query.
>
> To be honest, I wouldn't do any of this, it feels like it's going to
> become an error prone mess.  Your best bet is to layer something on top of
> the driver yourself.  The cleanest way I think think of, long term, is to
> submit a JIRA / patch to enable some class loading & listener hooks in
> cqlsh itself.  Without a patch and a really good use case I don't know who
> would want to maintain that though, as it would lock the team into using
> Python for cqlsh.
>
> Jon
>
> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>
>> Hi,
>>
>> We are planning to use ITrigger to notify changes, when we execute
>> scripts or run commands in cqlsh prompt. If the operation is performed
>> through our application CRUD API, we are planning to handle notification in
>> our CRUD API itself, however if user performs some operation(like write
>> operation in cqlsh prompt) we want to handle those changes and update
>> modules that are listening to those changes.
>>
>> Could you please let us know whether it is possible to differentiate
>> updates done through cqlsh prompt and through application.
>>
>> We also thought about creating multiple users in cassandra and using
>> different user for cqlsh and for the application. If we go with this
>> approach, do we get the user who modified the table in ITrigger
>> implementation (ie., augment method)
>>
>>
>> Basically we are trying to limit/restrict usage of ITrigger just for
>> cqlsh prompt as it is little complex and risky (came to know it will impact
>> cassandra running in that node).
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>


-- 
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table
scan)


Re: ITrigger - Help

2016-11-11 Thread siddharth verma
Hi Sathish,
You could look into, Change Data Capture (CDC) (
https://issues.apache.org/jira/browse/CASSANDRA-8844 .
It might help you for some of your requirements.

Regards
Siddharth Verma

On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> cqlsh uses the Python driver, I don't see how there would be any way to
> differentiate where the request came from unless you stuck an extra field
> in the table that you always write when you're not in cqlsh, or you
> modified cqlsh to include that field whenever it did an insert.
>
> Checking iTrigger source, all you get is a reference to the ColumnFamily
> and some metadata.  At a glance of trunk, it doesn't look like you get the
> user that initiated the query.
>
> To be honest, I wouldn't do any of this, it feels like it's going to
> become an error prone mess.  Your best bet is to layer something on top of
> the driver yourself.  The cleanest way I think think of, long term, is to
> submit a JIRA / patch to enable some class loading & listener hooks in
> cqlsh itself.  Without a patch and a really good use case I don't know who
> would want to maintain that though, as it would lock the team into using
> Python for cqlsh.
>
> Jon
>
> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>
>> Hi,
>>
>> We are planning to use ITrigger to notify changes, when we execute
>> scripts or run commands in cqlsh prompt. If the operation is performed
>> through our application CRUD API, we are planning to handle notification in
>> our CRUD API itself, however if user performs some operation(like write
>> operation in cqlsh prompt) we want to handle those changes and update
>> modules that are listening to those changes.
>>
>> Could you please let us know whether it is possible to differentiate
>> updates done through cqlsh prompt and through application.
>>
>> We also thought about creating multiple users in cassandra and using
>> different user for cqlsh and for the application. If we go with this
>> approach, do we get the user who modified the table in ITrigger
>> implementation (ie., augment method)
>>
>>
>> Basically we are trying to limit/restrict usage of ITrigger just for
>> cqlsh prompt as it is little complex and risky (came to know it will impact
>> cassandra running in that node).
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>


-- 
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table
scan)


Re: ITrigger - Help

2016-11-11 Thread siddharth verma
Hi Sathish,
You could look into, Change Data Capture (CDC) ( https://issues.apache.org/
jira/browse/CASSANDRA-8844 .
It might help you for some of your requirements.

Regards
Siddharth Verma

On Fri, Nov 11, 2016 at 11:34 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> cqlsh uses the Python driver, I don't see how there would be any way to
> differentiate where the request came from unless you stuck an extra field
> in the table that you always write when you're not in cqlsh, or you
> modified cqlsh to include that field whenever it did an insert.
>
> Checking iTrigger source, all you get is a reference to the ColumnFamily
> and some metadata.  At a glance of trunk, it doesn't look like you get the
> user that initiated the query.
>
> To be honest, I wouldn't do any of this, it feels like it's going to
> become an error prone mess.  Your best bet is to layer something on top of
> the driver yourself.  The cleanest way I think think of, long term, is to
> submit a JIRA / patch to enable some class loading & listener hooks in
> cqlsh itself.  Without a patch and a really good use case I don't know who
> would want to maintain that though, as it would lock the team into using
> Python for cqlsh.
>
> Jon
>
> On Fri, Nov 11, 2016 at 9:52 AM sat <sathish.al...@gmail.com> wrote:
>
>> Hi,
>>
>> We are planning to use ITrigger to notify changes, when we execute
>> scripts or run commands in cqlsh prompt. If the operation is performed
>> through our application CRUD API, we are planning to handle notification in
>> our CRUD API itself, however if user performs some operation(like write
>> operation in cqlsh prompt) we want to handle those changes and update
>> modules that are listening to those changes.
>>
>> Could you please let us know whether it is possible to differentiate
>> updates done through cqlsh prompt and through application.
>>
>> We also thought about creating multiple users in cassandra and using
>> different user for cqlsh and for the application. If we go with this
>> approach, do we get the user who modified the table in ITrigger
>> implementation (ie., augment method)
>>
>>
>> Basically we are trying to limit/restrict usage of ITrigger just for
>> cqlsh prompt as it is little complex and risky (came to know it will impact
>> cassandra running in that node).
>>
>> Thanks and Regards
>> A.SathishKumar
>>
>>


-- 
Siddharth Verma
(Visit https://github.com/siddv29/cfs for a high speed cassandra full table
scan)


  1   2   3   4   5   6   >