Re: Query on Performance Dip

2024-04-05 Thread Jon Haddad
Try changing the chunk length parameter on the compression settings to 4kb,
and reduce read ahead to 16kb if you’re using EBS or 4KB if you’re using
decent local ssd or nvme.

Counters read before write.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Fri, Apr 5, 2024 at 9:27 AM Subroto Barua  wrote:

> follow up question on performance issue with 'counter writes'- is there a
> parameter or condition that limits the allocation rate for
> 'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.
>
> The back-end infra is same for both the clusters and same test cases/data
> model.
> On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> Hi,
>
> Unfortunately, the numbers you're posting have no meaning without
> context.  The speculative retries could be the cause of a problem, or you
> could simply be executing enough queries and you have a fairly high
> variance in latency which triggers them often.  It's unclear how many
> queries / second you're executing and there's no historical information to
> suggest if what you're seeing now is an anomaly or business as usual.
>
> If you want to determine if your theory that speculative retries are
> causing your performance issue, then you could try changing speculative
> retry to a fixed value instead of a percentile, such as 50MS.  It's easy
> enough to try and you can get an answer to your question almost immediately.
>
> The problem with this is that you're essentially guessing based on very
> limited information - the output of a nodetool command you've run "every
> few secs".  I prefer to use a more data driven approach.  Get a CPU flame
> graph and figure out where your time is spent:
> https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
>
> The flame graph will reveal where your time is spent, and you can focus on
> improving that, rather than looking at a random statistic that you've
> picked.
>
> I just gave a talk at SCALE on distributed systems performance
> troubleshooting.  You'll be better off following a methodical process than
> guessing at potential root causes, because the odds of you correctly
> guessing the root cause in a system this complex is close to zero.  My talk
> is here: https://www.youtube.com/watch?v=VX9tHk3VTLE
>
> I'm guessing you don't have dashboards in place if you're relying on
> nodetool output with grep.  If your cluster is under 6 nodes, you can take
> advantage of AxonOps's free tier: https://axonops.com/
>
> Good dashboards are essential for these types of problems.
>
> Jon
>
>
>
> On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:
>
> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading 

Re: Query on Performance Dip

2024-04-05 Thread Subroto Barua via user
 follow up question on performance issue with 'counter writes'- is there a 
parameter or condition that limits the allocation rate for 
'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.

The back-end infra is same for both the clusters and same test cases/data model.
On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad 
 wrote:  
 
 Hi,

Unfortunately, the numbers you're posting have no meaning without context.  The 
speculative retries could be the cause of a problem, or you could simply be 
executing enough queries and you have a fairly high variance in latency which 
triggers them often.  It's unclear how many queries / second you're executing 
and there's no historical information to suggest if what you're seeing now is 
an anomaly or business as usual.
If you want to determine if your theory that speculative retries are causing 
your performance issue, then you could try changing speculative retry to a 
fixed value instead of a percentile, such as 50MS.  It's easy enough to try and 
you can get an answer to your question almost immediately.
The problem with this is that you're essentially guessing based on very limited 
information - the output of a nodetool command you've run "every few secs".  I 
prefer to use a more data driven approach.  Get a CPU flame graph and figure 
out where your time is spent: 
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
The flame graph will reveal where your time is spent, and you can focus on 
improving that, rather than looking at a random statistic that you've picked.
I just gave a talk at SCALE on distributed systems performance troubleshooting. 
 You'll be better off following a methodical process than guessing at potential 
root causes, because the odds of you correctly guessing the root cause in a 
system this complex is close to zero.  My talk is here: 
https://www.youtube.com/watch?v=VX9tHk3VTLE
I'm guessing you don't have dashboards in place if you're relying on nodetool 
output with grep.  If your cluster is under 6 nodes, you can take advantage of 
AxonOps's free tier: https://axonops.com/
Good dashboards are essential for these types of problems.    
Jon


On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

Hi All,
On debugging the cluster for performance dip seen while using 4.1.4,  i found 
high speculation retries Value in nodetool tablestats during read operation.
I ran the below tablestats command and checked its output after every few secs 
and noticed that retries are on rising side. Also there is one open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to 
this./usr/share/cassandra/bin/nodetool -u  -pw  -p  
tablestats  | grep -i 'Speculative retries' 

                    

    Speculative retries: 11633

                ..

                ..

                Speculative retries: 13727

     

    Speculative retries: 14256

    Speculative retries: 14855

    Speculative retries: 14858

    Speculative retries: 14859

    Speculative retries: 14873

    Speculative retries: 14875

    Speculative retries: 14890

    Speculative retries: 14893

    Speculative retries: 14896

    Speculative retries: 14901

    Speculative retries: 14905

    Speculative retries: 14946

    Speculative retries: 14948

    Speculative retries: 14957




Suspecting this could be performance dip cause.  Please add in case anyone 
knows more about it.




Regards













On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user 
 wrote:

 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju  

  

Re: Query on Performance Dip

2024-03-30 Thread Jon Haddad
Hi,

Unfortunately, the numbers you're posting have no meaning without context.
The speculative retries could be the cause of a problem, or you could
simply be executing enough queries and you have a fairly high variance in
latency which triggers them often.  It's unclear how many queries / second
you're executing and there's no historical information to suggest if what
you're seeing now is an anomaly or business as usual.

If you want to determine if your theory that speculative retries are
causing your performance issue, then you could try changing speculative
retry to a fixed value instead of a percentile, such as 50MS.  It's easy
enough to try and you can get an answer to your question almost immediately.

The problem with this is that you're essentially guessing based on very
limited information - the output of a nodetool command you've run "every
few secs".  I prefer to use a more data driven approach.  Get a CPU flame
graph and figure out where your time is spent:
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/

The flame graph will reveal where your time is spent, and you can focus on
improving that, rather than looking at a random statistic that you've
picked.

I just gave a talk at SCALE on distributed systems performance
troubleshooting.  You'll be better off following a methodical process than
guessing at potential root causes, because the odds of you correctly
guessing the root cause in a system this complex is close to zero.  My talk
is here: https://www.youtube.com/watch?v=VX9tHk3VTLE

I'm guessing you don't have dashboards in place if you're relying on
nodetool output with grep.  If your cluster is under 6 nodes, you can take
advantage of AxonOps's free tier: https://axonops.com/

Good dashboards are essential for these types of problems.

Jon



On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
>> we are seeing similar perf issues with counter writes - to reproduce:
>>
>> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
>> threads=50 -mode native cql3 user= password= -name 
>>
>>
>> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
>> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
>> Total GC count: 750 (4.1) and 744 (4.0)
>> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>>
>>
>> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
>> goel.ra...@gmail.com> wrote:
>>
>>
>> Hi All,
>>
>> Was going through this mail chain
>> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>>  and was wondering that if this could cause a performance degradation in
>> 4.1 without changing compactionThroughput.
>>
>> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>>
>> Regards
>> Ranju
>>
>


Re: Query on Performance Dip

2024-03-30 Thread ranju goel
Hi All,

On debugging the cluster for performance dip seen while using 4.1.4,  i
found high speculation retries Value in nodetool tablestats during read
operation.

I ran the below tablestats command and checked its output after every few
secs and noticed that retries are on rising side. Also there is one open
ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
this.
/usr/share/cassandra/bin/nodetool -u  -pw  -p 
tablestats  | grep -i 'Speculative retries'



Speculative retries: 11633

..

..

Speculative retries: 13727



Speculative retries: 14256

Speculative retries: 14855

Speculative retries: 14858

Speculative retries: 14859

Speculative retries: 14873

Speculative retries: 14875

Speculative retries: 14890

Speculative retries: 14893

Speculative retries: 14896

Speculative retries: 14901

Speculative retries: 14905

Speculative retries: 14946

Speculative retries: 14948

Speculative retries: 14957


Suspecting this could be performance dip cause.  Please add in case anyone
knows more about it.


Regards








On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
user@cassandra.apache.org> wrote:

> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>
> Regards
> Ranju
>


Re: Query on Performance Dip

2024-03-27 Thread Subroto Barua via user
 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju  

Re: Query on version 4.1.3

2024-01-11 Thread Luciano Greiner
We are a about to do the same upgrade, although aiming v4.1.2

Highly interested in this topic as well.

Luciano Greiner


On Thu, Jan 11, 2024 at 4:13 AM ranju goel  wrote:
>
> Hi Everyone,
>
> We are planning to upgrade from 4.0.11 to 4.1.3, the main motive of upgrading 
> is 4.0.11 going EOS in July 2024.
>
> On analyzing JIRAs, found an Open ticket, CASSANDRA-18766 (high speculative 
> retries on v4.1.3) which talks about Performance Degradation and no activity 
> seen since September.
>
> Wanted to know is anyone using version 4.1.3 and facing this issue.
>
>
>
> Best Regards
>
> Ranju


Re: Query on Token range

2023-06-10 Thread ranju goel
Thanks , it helped but also looking for a way to get total number of token
ranges assigned to that node, which i am doing currently manually(
subtracting) by using nodetool ring.

Best Regards
Ranju

On Fri, Jun 9, 2023 at 12:50 PM guo Maxwell  wrote:

> I think nodetool info with --token may do some help.
>
> ranju goel  于2023年6月9日周五 15:09写道:
>
>> Hi everyone,
>>
>> Is there any faster way to calculate the number of token ranges allocated
>> to a node
>> (x.y.z.w)?
>>
>> I used the manual way by subtracting the last token with the start token
>> shown in the nodetool ring, but it is time consuming.
>>
>>
>>
>> x.y.z.w RAC1   UpNormal 88 GiB  100.00%
>> -5972602825521846313
>> x.y.z.w1   RAC1   UpNormal 87 GiB  100.00%
>> -5956172717199559280
>>
>> Best Regards
>> Ranju Jain
>>
>
>
> --
> you are the apple of my eye !
>


Re: Query on Token range

2023-06-09 Thread guo Maxwell
I think nodetool info with --token may do some help.

ranju goel  于2023年6月9日周五 15:09写道:

> Hi everyone,
>
> Is there any faster way to calculate the number of token ranges allocated
> to a node
> (x.y.z.w)?
>
> I used the manual way by subtracting the last token with the start token
> shown in the nodetool ring, but it is time consuming.
>
>
>
> x.y.z.w RAC1   UpNormal 88 GiB  100.00%
> -5972602825521846313
> x.y.z.w1   RAC1   UpNormal 87 GiB  100.00%
> -5956172717199559280
>
> Best Regards
> Ranju Jain
>


-- 
you are the apple of my eye !


Re: Query for Cassandra Driver

2022-12-22 Thread manish khandelwal
Hi Deepti

I think you can reach out to
https://groups.google.com/a/lists.datastax.com/g/cpp-driver-user.

Regards
Manish

On Fri, Dec 23, 2022 at 12:52 PM Deepti Sharma S via user <
user@cassandra.apache.org> wrote:

> Hello Team,
>
>
>
> Could you please help in answering below query.
>
>
>
>
>
> Regards,
>
> Deepti Sharma
> * PMP® & ITIL*
>
>
>
> *From:* Deepti Sharma S via user 
> *Sent:* 20 December 2022 18:39
> *To:* user@cassandra.apache.org
> *Cc:* Nandita Singh S 
> *Subject:* Query for Cassandra Driver
>
>
>
> Hello Team,
>
>
>
> We have an Application following C++98 standard, compiled with gcc version
> 7.5.0 on SUSE Linux.
>
> We are currently using DataStax C/C++ Driver(Version 2.6) and its working
> fine with application(C++98).
>
> Now We have a requirement to update DataStax C/C++ Driver to latest
> version 2.16.
>
> We want to know whether DataStax C/C++ Driver latest version 2.16 is also
> compatible with application(C++98)
>
>
>
>
>
> Regards,
>
> Deepti Sharma
> * PMP® & ITIL*
>
>
>


RE: Query for Cassandra Driver

2022-12-22 Thread Deepti Sharma S via user
Hello Team,

Could you please help in answering below query.


Regards,
Deepti Sharma
PMP(r) & ITIL

From: Deepti Sharma S via user 
Sent: 20 December 2022 18:39
To: user@cassandra.apache.org
Cc: Nandita Singh S 
Subject: Query for Cassandra Driver

Hello Team,

We have an Application following C++98 standard, compiled with gcc version 
7.5.0 on SUSE Linux.
We are currently using DataStax C/C++ Driver(Version 2.6) and its working fine 
with application(C++98).
Now We have a requirement to update DataStax C/C++ Driver to latest version 
2.16.
We want to know whether DataStax C/C++ Driver latest version 2.16 is also 
compatible with application(C++98)


Regards,
Deepti Sharma
PMP(r) & ITIL



Re: Query regarding EOS for Cassandra version 3.11.13

2022-12-15 Thread manish khandelwal
3.11.x versions will be maintained till May July 2023. Please refer
https://cassandra.apache.org/_/download.html


On Thu, Dec 15, 2022, 20:55 Pranav Kumar (EXT) via user <
user@cassandra.apache.org> wrote:

> Hi Team,
>
>
>
> Could you please help us to know when version 3.11.13 is going to be EOS?
> Till when we are going to get fixes for the version 3.11.13.
>
>
>
> Regards,
>
> Pranav
>


Re: Query drivertimeout PT2S

2022-11-09 Thread Cédrick Lunven
Hi,

DataStax Cassandra 4.14 is actually the driver's version. Almost the latest
https://mvnrepository.com/artifact/com.datastax.oss/java-driver-core

It would be useful to know which version of Cassandra you are using, even
if, it would be surprised it is actually the cause of your error.

As it has been mentioned above the root cause is
=> A client-side timeout considering that the request is too slow, server
did not respond in time.

The reasons are legions:
- The cluster can be busy (hot partitions)
- You can query more and more data which taking more and more time (large
partitions)
- You have not designed your data model based on your queries, and, as a
result, do cross-partition queries.
- You perform full scans of your cluster with (stupid) ALLOW FILTERING.
- You are using the IN clause (same as above)

But, to be honest,  my gut feeling is something that I keep seeing coming
back
=> I bet you might have moved to the latest SPRING DATA CASSANDRA version.
This thing keeps preparing everything like crazy as which leads to timeouts
tend to pop up here and there. It is not easy to explain as if you prepare
twice the same statement nothing happens it has been kept in the cache. Is
there anything else at work?

Is anybody having heard about the same issues lately?

On Wed, Nov 9, 2022 at 4:10 PM Durity, Sean R via user <
user@cassandra.apache.org> wrote:

> From the subject, this looks like a client-side timeout (thrown by the
> driver). I have seen situations where the client/driver timeout of 2
> seconds is a shorter timeout than on the server side (10 seconds). So, the
> server doesn’t really note any problem. Unless this is a very remote client
> and you suspect network-related latency, I would start by looking at the
> query that generates the timeout and the schema of the table. Make sure
> that you are querying WITHIN a partition and not ACROSS partitions. There
> are plenty of other potential problems, but you would need to give us more
> data to guide those discussions.
>
>
>
> Sean R. Durity
>
>
>
> *From:* Bowen Song via user 
> *Sent:* Tuesday, November 8, 2022 1:53 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Query drivertimeout PT2S
>
>
>
> This is a mailing list for the Apache Cassandra, and that's not the same
> as DataStax Enterprise Cassandra you are using. We may still be able to
> help here if you could provide more details, such as the queries, table
> schema, system stats (cpu,
>
> This is a mailing list for the Apache Cassandra, and that's not the same
> as DataStax Enterprise Cassandra you are using. We may still be able to
> help here if you could provide more details, such as the queries, table
> schema, system stats (cpu, ram, disk io, network, and so on), logs, table
> stats, etc., but if it's a DSE Cassandra specific issue, you may have
> better luck contacting DataStax directly or posting it on the DataStax
> Community [community.datastax.com]
> <https://urldefense.com/v3/__https:/community.datastax.com/topics/82/cassandra.html__;!!M-nmYVHPHQ!JeQFduuBvu8AGLCMA3uqnA0pnlFvt5Iqg2uUP1aQQXjlHf7LRNqEotOSqIsxVc0j6sT7uh_U3G__LbiyoZ-2QrKTN0Q$>
> .
>
> On 08/11/2022 14:58, Shagun Bakliwal wrote:
>
> Hi All,
>
>
>
> My application is frequently getting timeout errors since 2 weeks now. I'm
> using datastax Cassandra 4.14
>
>
>
> Can someone help me here?
>
>
>
> Thanks,
>
> Shagun
>
>
>
> INTERNAL USE
>
>

-- 
Cedrick Lunven
e. cedrick.lun...@datastax.com
w. www.datastax.com


RE: Query drivertimeout PT2S

2022-11-09 Thread Durity, Sean R via user
>From the subject, this looks like a client-side timeout (thrown by the 
>driver). I have seen situations where the client/driver timeout of 2 seconds 
>is a shorter timeout than on the server side (10 seconds). So, the server 
>doesn’t really note any problem. Unless this is a very remote client and you 
>suspect network-related latency, I would start by looking at the query that 
>generates the timeout and the schema of the table. Make sure that you are 
>querying WITHIN a partition and not ACROSS partitions. There are plenty of 
>other potential problems, but you would need to give us more data to guide 
>those discussions.

Sean R. Durity

From: Bowen Song via user 
Sent: Tuesday, November 8, 2022 1:53 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Query drivertimeout PT2S

This is a mailing list for the Apache Cassandra, and that's not the same as 
DataStax Enterprise Cassandra you are using. We may still be able to help here 
if you could provide more details, such as the queries, table schema, system 
stats (cpu,


This is a mailing list for the Apache Cassandra, and that's not the same as 
DataStax Enterprise Cassandra you are using. We may still be able to help here 
if you could provide more details, such as the queries, table schema, system 
stats (cpu, ram, disk io, network, and so on), logs, table stats, etc., but if 
it's a DSE Cassandra specific issue, you may have better luck contacting 
DataStax directly or posting it on the DataStax Community 
[community.datastax.com]<https://urldefense.com/v3/__https:/community.datastax.com/topics/82/cassandra.html__;!!M-nmYVHPHQ!JeQFduuBvu8AGLCMA3uqnA0pnlFvt5Iqg2uUP1aQQXjlHf7LRNqEotOSqIsxVc0j6sT7uh_U3G__LbiyoZ-2QrKTN0Q$>.
On 08/11/2022 14:58, Shagun Bakliwal wrote:
Hi All,

My application is frequently getting timeout errors since 2 weeks now. I'm 
using datastax Cassandra 4.14

Can someone help me here?

Thanks,
Shagun


INTERNAL USE


Re: Query drivertimeout PT2S

2022-11-08 Thread Bowen Song via user
This is a mailing list for the Apache Cassandra, and that's not the same 
as DataStax Enterprise Cassandra you are using. We may still be able to 
help here if you could provide more details, such as the queries, table 
schema, system stats (cpu, ram, disk io, network, and so on), logs, 
table stats, etc., but if it's a DSE Cassandra specific issue, you may 
have better luck contacting DataStax directly or posting it on the 
DataStax Community 
.


On 08/11/2022 14:58, Shagun Bakliwal wrote:

Hi All,

My application is frequently getting timeout errors since 2 weeks now. 
I'm using datastax Cassandra 4.14


Can someone help me here?

Thanks,
Shagun

Re: Query around Data Modelling -2

2022-07-01 Thread Bowen Song via user
I don't recall myself ever seen any recommendation on periodically 
running major compactions. Can you share the source of your information?


During the major compaction, the server will be under heavy load, and it 
will need to rewrite ALL sstables. This actually hurts the read 
performance while the compaction is running.


The most important factor of read performance is the amount of data each 
node has to scan in order to complete the read query. Large partitions, 
too many tombstones, partition spread in too many sstables, etc. all 
hurts the performance. You will need to find the bottleneck and act on 
it in order to improve read performance.


Artificially spreading the data from one LCS table into many tables with 
identical schema is not likely to improve the read performance. The only 
benefit you get is more compaction parallelisation, and that may further 
hurt the read performance if the bottleneck is CPU usage, disk IO, or GC.


If you know the table is heavily read, and you have a performance issue 
with that, maybe it's time to redesign the table schema and optimise for 
the most frequently used read queries.


On 01/07/2022 11:29, MyWorld wrote:

 Michiel, This is not in our use case. Since our data is not time 
series, there is no TTL in our case.


Bowen, I think this is what is generally recommend to run a major 
compaction once in a week for better read performance.


On Fri, Jul 1, 2022, 6:52 AM Michiel Saelen 
 wrote:


Hi,

We did do compaction job every week in the past to keep the disk
space used under control as we had mainly data in the table that
needs to expire with TTL and were also using levelled compaction.

In our case we had different TTL’s in the same table and the
partitions were spread over multiple ssTables, as the partitions
were never closing and therefor kept on pushing changes we ended
up with repair actions that had to cover a lot of ssTables which
is heavy on memory and CPU.
By changing the compaction strategy to TWCS

<https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html>,
splitting the table into different tables with their own TTL and
adding a part to the partition key (e.g. the day of the year) to
close the partitions, so they can be “marked” as repaired, we were
able to get rid of these heavy compaction actions.

Not sure if you have the same use case, just wanted to share this
info.

Kind regards,

Michiel

<https://skyline.be/jobs/en>








*Michiel Saelen *|Principal Solution Architect

Email michiel.sae...@skyline.be <mailto:michiel.sae...@skyline.be>



Skyline Communications

39 Hong Kong Street #02-01 |Singapore 059678
www.skyline.be <https://www.skyline.be>|+65 6920 1145


<https://skyline.be/>








<https://teams.microsoft.com/l/chat/0/0?users=michiel.sae...@skyline.be>





<https://community.dataminer.services/?utm_source=signature_medium=email_campaign=icon>




<https://www.linkedin.com/company/skyline-communications>




<https://www.youtube.com/user/SkylineCommu>




<https://www.facebook.com/SkylineCommunications/>




<https://www.instagram.com/skyline.dataminer/>





<https://skyline.be/skyline/awards?utm_source=signature_medium=email_campaign=icon>



*From:* Bowen Song 
*Sent:* Friday, July 1, 2022 08:48
*To:* user@cassandra.apache.org
*Subject:* Re: Query around Data Modelling -2




This message was sent from outside the company. Please do not
click links or open attachments unless you recognise the source of
this email and know the content is safe.

And why do you do that?

On 30/06/2022 16:35, MyWorld wrote:

We run major compaction once in a week

On Thu, Jun 30, 2022, 8:14 PM Bowen Song  wrote:

I have noticed this "running a weekly repair and
compaction job".

What do you mean weekly compaction job? Have you disabled
the auto-compaction on the table and is relying on weekly
scheduled compactions? Or running weekly major
compactions? Neither of these sounds right.

On 30/06/2022 15:03, MyWorld wrote:

Hi all,

Another query around data Modelling.

We have a existing table with below structure:

Table(PK,CK, col1,col2, col3, col4,col5)

Now each Pk here have 1k - 10k Clustering keys. Each
PK has size from 10MB to 80MB. We have overall 100+
millions partitions. Also we have set levelled
compactions in place so as to get better read response

Re: Query around Data Modelling -2

2022-07-01 Thread MyWorld
 Michiel, This is not in our use case. Since our data is not time series,
there is no TTL in our case.

Bowen, I think this is what is generally recommend to run a major
compaction once in a week for better read performance.

On Fri, Jul 1, 2022, 6:52 AM Michiel Saelen 
wrote:

> Hi,
>
> We did do compaction job every week in the past to keep the disk space
> used under control as we had mainly data in the table that needs to expire
> with TTL and were also using levelled compaction.
>
> In our case we had different TTL’s in the same table and the partitions
> were spread over multiple ssTables, as the partitions were never closing
> and therefor kept on pushing changes we ended up with repair actions that
> had to cover a lot of ssTables which is heavy on memory and CPU.
> By changing the compaction strategy to TWCS
> <https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html>,
> splitting the table into different tables with their own TTL and adding a
> part to the partition key (e.g. the day of the year) to close the
> partitions, so they can be “marked” as repaired, we were able to get rid of
> these heavy compaction actions.
>
>
>
> Not sure if you have the same use case, just wanted to share this info.
>
>
>
> Kind regards,
>
> Michiel
>
>
>
> <https://skyline.be/jobs/en>
>
>
>
>
>
> *Michiel Saelen *| Principal Solution Architect
>
> Email michiel.sae...@skyline.be
>
>
>
> Skyline Communications
>
> 39 Hong Kong Street #02-01 | Singapore 059678
> www.skyline.be | +65 6920 1145 <+6569201145>
>
>
>
> <https://skyline.be/>
>
>
>
>
>
> <https://teams.microsoft.com/l/chat/0/0?users=michiel.sae...@skyline.be>
>
>
> <https://community.dataminer.services/?utm_source=signature_medium=email_campaign=icon>
>
> <https://www.linkedin.com/company/skyline-communications>
>
> <https://www.youtube.com/user/SkylineCommu>
>
> <https://www.facebook.com/SkylineCommunications/>
>
> <https://www.instagram.com/skyline.dataminer/>
>
>
> <https://skyline.be/skyline/awards?utm_source=signature_medium=email_campaign=icon>
>
>
>
>
>
>
>
> *From:* Bowen Song 
> *Sent:* Friday, July 1, 2022 08:48
> *To:* user@cassandra.apache.org
> *Subject:* Re: Query around Data Modelling -2
>
>
>
> This message was sent from outside the company. Please do not click links
> or open attachments unless you recognise the source of this email and know
> the content is safe.
>
>
>
> And why do you do that?
>
> On 30/06/2022 16:35, MyWorld wrote:
>
> We run major compaction once in a week
>
>
>
> On Thu, Jun 30, 2022, 8:14 PM Bowen Song  wrote:
>
> I have noticed this "running a weekly repair and compaction job".
>
> What do you mean weekly compaction job? Have you disabled the
> auto-compaction on the table and is relying on weekly scheduled
> compactions? Or running weekly major compactions? Neither of these sounds
> right.
>
> On 30/06/2022 15:03, MyWorld wrote:
>
> Hi all,
>
>
>
> Another query around data Modelling.
>
>
>
> We have a existing table with below structure:
>
> Table(PK,CK, col1,col2, col3, col4,col5)
>
>
>
> Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB
> to 80MB. We have overall 100+ millions partitions. Also we have set
> levelled compactions in place so as to get better read response time.
>
>
>
> We are currently on 3.11.x version of Cassandra. On running a weekly
> repair and compaction job, this model because of levelled compaction
> (occupied till Level 3) consume heavy cpu resource and impact db
> performance.
>
>
>
> Now what if we divide this table in 10 with each table containing 1/10
> partitions. So now each table will be limited to levelled compaction upto
> level-2. I think this would ease down read as well as compaction task.
>
>
>
> What is your opinion on this?
>
> Even if we upgrade to ver 4.0, is the second model ok?
>
>
>
>


RE: Query around Data Modelling -2

2022-06-30 Thread Michiel Saelen
Hi,

We did do compaction job every week in the past to keep the disk space used 
under control as we had mainly data in the table that needs to expire with TTL 
and were also using levelled compaction.
In our case we had different TTL’s in the same table and the partitions were 
spread over multiple ssTables, as the partitions were never closing and 
therefor kept on pushing changes we ended up with repair actions that had to 
cover a lot of ssTables which is heavy on memory and CPU.
By changing the compaction strategy to 
TWCS<https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html>,
 splitting the table into different tables with their own TTL and adding a part 
to the partition key (e.g. the day of the year) to close the partitions, so 
they can be “marked” as repaired, we were able to get rid of these heavy 
compaction actions.

Not sure if you have the same use case, just wanted to share this info.

Kind regards,
Michiel

[cid:image001.png@01D88D2B.263669C0]<https://skyline.be/jobs/en>


Michiel Saelen | Principal Solution Architect
Email michiel.sae...@skyline.be<mailto:michiel.sae...@skyline.be>

Skyline Communications
39 Hong Kong Street #02-01 | Singapore 059678
www.skyline.be<https://www.skyline.be> | +65 6920 1145

[cid:image002.png@01D88D2B.263669C0]<https://skyline.be/>


[cid:image003.png@01D88D2B.263669C0]<https://teams.microsoft.com/l/chat/0/0?users=michiel.sae...@skyline.be>
[cid:image004.png@01D88D2B.263669C0]<https://community.dataminer.services/?utm_source=signature_medium=email_campaign=icon>
[cid:image005.png@01D88D2B.263669C0]<https://www.linkedin.com/company/skyline-communications>
[cid:image006.png@01D88D2B.263669C0]<https://www.youtube.com/user/SkylineCommu>
[cid:image007.png@01D88D2B.263669C0]<https://www.facebook.com/SkylineCommunications/>
[cid:image008.png@01D88D2B.263669C0]<https://www.instagram.com/skyline.dataminer/>
[cid:image009.png@01D88D2B.263669C0]<https://skyline.be/skyline/awards?utm_source=signature_medium=email_campaign=icon>


[cid:image010.png@01D88D2B.263669C0]

From: Bowen Song 
Sent: Friday, July 1, 2022 08:48
To: user@cassandra.apache.org
Subject: Re: Query around Data Modelling -2

This message was sent from outside the company. Please do not click links or 
open attachments unless you recognise the source of this email and know the 
content is safe.


And why do you do that?
On 30/06/2022 16:35, MyWorld wrote:
We run major compaction once in a week

On Thu, Jun 30, 2022, 8:14 PM Bowen Song mailto:bo...@bso.ng>> 
wrote:

I have noticed this "running a weekly repair and compaction job".

What do you mean weekly compaction job? Have you disabled the auto-compaction 
on the table and is relying on weekly scheduled compactions? Or running weekly 
major compactions? Neither of these sounds right.
On 30/06/2022 15:03, MyWorld wrote:
Hi all,

Another query around data Modelling.

We have a existing table with below structure:
Table(PK,CK, col1,col2, col3, col4,col5)

Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB to 
80MB. We have overall 100+ millions partitions. Also we have set levelled 
compactions in place so as to get better read response time.

We are currently on 3.11.x version of Cassandra. On running a weekly repair and 
compaction job, this model because of levelled compaction (occupied till Level 
3) consume heavy cpu resource and impact db performance.

Now what if we divide this table in 10 with each table containing 1/10 
partitions. So now each table will be limited to levelled compaction upto 
level-2. I think this would ease down read as well as compaction task.

What is your opinion on this?
Even if we upgrade to ver 4.0, is the second model ok?



Re: Query around Data Modelling -2

2022-06-30 Thread Bowen Song

And why do you do that?

On 30/06/2022 16:35, MyWorld wrote:

We run major compaction once in a week

On Thu, Jun 30, 2022, 8:14 PM Bowen Song  wrote:

I have noticed this "running a weekly repair and compaction job".

What do you mean weekly compaction job? Have you disabled the
auto-compaction on the table and is relying on weekly scheduled
compactions? Or running weekly major compactions? Neither of these
sounds right.

On 30/06/2022 15:03, MyWorld wrote:

Hi all,

Another query around data Modelling.

We have a existing table with below structure:
Table(PK,CK, col1,col2, col3, col4,col5)

Now each Pk here have 1k - 10k Clustering keys. Each PK has size
from 10MB to 80MB. We have overall 100+ millions partitions. Also
we have set levelled compactions in place so as to get better
read response time.

We are currently on 3.11.x version of Cassandra. On running a
weekly repair and compaction job, this model because of levelled
compaction (occupied till Level 3) consume heavy cpu resource and
impact db performance.

Now what if we divide this table in 10 with each table containing
1/10 partitions. So now each table will be limited to levelled
compaction upto level-2. I think this would ease down read as
well as compaction task.

What is your opinion on this?
Even if we upgrade to ver 4.0, is the second model ok?


Re: Query around Data Modelling -2

2022-06-30 Thread MyWorld
We run major compaction once in a week

On Thu, Jun 30, 2022, 8:14 PM Bowen Song  wrote:

> I have noticed this "running a weekly repair and compaction job".
>
> What do you mean weekly compaction job? Have you disabled the
> auto-compaction on the table and is relying on weekly scheduled
> compactions? Or running weekly major compactions? Neither of these sounds
> right.
> On 30/06/2022 15:03, MyWorld wrote:
>
> Hi all,
>
> Another query around data Modelling.
>
> We have a existing table with below structure:
> Table(PK,CK, col1,col2, col3, col4,col5)
>
> Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB
> to 80MB. We have overall 100+ millions partitions. Also we have set
> levelled compactions in place so as to get better read response time.
>
> We are currently on 3.11.x version of Cassandra. On running a weekly
> repair and compaction job, this model because of levelled compaction
> (occupied till Level 3) consume heavy cpu resource and impact db
> performance.
>
> Now what if we divide this table in 10 with each table containing 1/10
> partitions. So now each table will be limited to levelled compaction upto
> level-2. I think this would ease down read as well as compaction task.
>
> What is your opinion on this?
> Even if we upgrade to ver 4.0, is the second model ok?
>
>


Re: Query around Data Modelling -2

2022-06-30 Thread Bowen Song

I have noticed this "running a weekly repair and compaction job".

What do you mean weekly compaction job? Have you disabled the 
auto-compaction on the table and is relying on weekly scheduled 
compactions? Or running weekly major compactions? Neither of these 
sounds right.


On 30/06/2022 15:03, MyWorld wrote:

Hi all,

Another query around data Modelling.

We have a existing table with below structure:
Table(PK,CK, col1,col2, col3, col4,col5)

Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 
10MB to 80MB. We have overall 100+ millions partitions. Also we have 
set levelled compactions in place so as to get better read response time.


We are currently on 3.11.x version of Cassandra. On running a weekly 
repair and compaction job, this model because of levelled compaction 
(occupied till Level 3) consume heavy cpu resource and impact db 
performance.


Now what if we divide this table in 10 with each table containing 1/10 
partitions. So now each table will be limited to levelled compaction 
upto level-2. I think this would ease down read as well as compaction 
task.


What is your opinion on this?
Even if we upgrade to ver 4.0, is the second model ok?


Re: Query around Data Modelling -2

2022-06-30 Thread MyWorld
Hi Jeff,
We are running repair with -pr option.

You are right it would have no or very minimal impact on read (considering
the fact now data has to be read from 2 levels instead of 3). But my guess
there is no negative impact of this model2.


On Thu, Jun 30, 2022, 7:41 PM Jeff Jirsa  wrote:

> How are you running repair? -pr? Or -st/-et?
>
> 4.0 gives you real incremental repair which helps. Splitting the table
> won’t make reads faster. It will increase the potential parallelization of
> compaction.
>
> On Jun 30, 2022, at 7:04 AM, MyWorld  wrote:
>
> 
> Hi all,
>
> Another query around data Modelling.
>
> We have a existing table with below structure:
> Table(PK,CK, col1,col2, col3, col4,col5)
>
> Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB
> to 80MB. We have overall 100+ millions partitions. Also we have set
> levelled compactions in place so as to get better read response time.
>
> We are currently on 3.11.x version of Cassandra. On running a weekly
> repair and compaction job, this model because of levelled compaction
> (occupied till Level 3) consume heavy cpu resource and impact db
> performance.
>
> Now what if we divide this table in 10 with each table containing 1/10
> partitions. So now each table will be limited to levelled compaction upto
> level-2. I think this would ease down read as well as compaction task.
>
> What is your opinion on this?
> Even if we upgrade to ver 4.0, is the second model ok?
>
>


Re: Query around Data Modelling -2

2022-06-30 Thread Jeff Jirsa
How are you running repair? -pr? Or -st/-et?

4.0 gives you real incremental repair which helps. Splitting the table won’t 
make reads faster. It will increase the potential parallelization of 
compaction. 

> On Jun 30, 2022, at 7:04 AM, MyWorld  wrote:
> 
> 
> Hi all,
> 
> Another query around data Modelling.
> 
> We have a existing table with below structure:
> Table(PK,CK, col1,col2, col3, col4,col5)
> 
> Now each Pk here have 1k - 10k Clustering keys. Each PK has size from 10MB to 
> 80MB. We have overall 100+ millions partitions. Also we have set levelled 
> compactions in place so as to get better read response time.
> 
> We are currently on 3.11.x version of Cassandra. On running a weekly repair 
> and compaction job, this model because of levelled compaction (occupied till 
> Level 3) consume heavy cpu resource and impact db performance.
> 
> Now what if we divide this table in 10 with each table containing 1/10 
> partitions. So now each table will be limited to levelled compaction upto 
> level-2. I think this would ease down read as well as compaction task.
> 
> What is your opinion on this?
> Even if we upgrade to ver 4.0, is the second model ok?
> 


Re: Query around Data Modelling

2022-06-22 Thread MyWorld
Thanks a lot Jeff, Michiel and Manish for your replies. Really helpful.

On Thu, Jun 23, 2022, 9:50 AM Jeff Jirsa  wrote:

> This is assuming each row is like … I dunno 10-1000 bytes. If you’re
> storing like a huge 1mb blob use two tables for sure.
>
> On Jun 22, 2022, at 9:06 PM, Jeff Jirsa  wrote:
>
> 
>
> Ok so here’s how I would think about this
>
> The writes don’t matter. (There’s a tiny tiny bit of nuance in one table
> where you can contend adding to the memtable but the best cassandra
> engineers on earth probably won’t notice that unless you have really super
> hot partitions, so ignore the write path).
>
> The reads are where it changes
>
> In both models/cases, you’ll use the partition index to seek to where the
> partition starts.
>
> In model 2 table 1 if you use ck+col1+… the read will load the column
> index and use that to jump to within 64kb of the col1 value you specify
>
> In model 2 table 2, if you use ck+col3+…, same thing - column index can
> jump to within 64k
>
> What you give up in model one is the granularity of that jump. If you use
> model 1 and col3 instead of col1, the read will have to scan the partition.
> In your case, with 80 rows, that may still be within that 64kb block - you
> may not get more granular than that anyway. And even if it’s slightly
> larger, you’re probably going to be compressing 64k chunks - maybe you have
> to decompress one extra chunk on read if your 1000 rows goes past 64k, but
> you likely won’t actually notice. You’re technically asking the server to
> read and skip data it doesn’t need to return - it’s not really the most
> efficient, but at that partition size it’s noise. You could also just
> return all 80-100 rows, let the server do slightly less work and filter
> client side - also valid, probably slightly worse than the server side
> filter.
>
> Having one table instead of two, though, probably saves you a ton of disk
> space ($€£), and the lower disk space may also mean that data stays in page
> cache, so the extra read may not even go to disk anyway.
>
> So with your actual data shape, I imagine you won’t really notice the
> nominal inefficiency of the first model, and I’d be inclined to do that
> until you demonstrate it won’t work (I bet it works fine for a long long
> time).
>
> On Jun 22, 2022, at 7:11 PM, MyWorld  wrote:
>
> 
> Hi Jeff,
> Let me know how no of rows have an impact here.
> May be today I have 80-100 rows per partition. But what if I started
> storing 2-4k rows per partition. However total partition size is still
> under 100 MB
>
> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:
>
>> How many rows per partition in each model?
>>
>>
>> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
>> >
>> > 
>> > Hi all,
>> >
>> > Just a small query around data Modelling.
>> > Suppose we have to design the data model for 2 different use cases
>> which will query the data on same set of (partion+clustering key). So
>> should we maintain a seperate table for each or a single table.
>> >
>> > Model1 - Combined table
>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>> >
>> > Model2 - Seperate tables
>> > Table1(Pk,CK,col1,col2,col3)
>> > Table2(Pk,CK,col3,col4,col45)
>> >
>> > So here partion and clustering keys are same. Also note column col3 is
>> required in both use cases.
>> >
>> > As per my thought in Model2, partition size would be less. There would
>> be less sstables and when I use level compaction, it could be easily
>> maintained. So should be better read performance.
>> >
>> > Please help me to highlight the drawback and advantage of each data
>> model. Here we have a mix kind of workload (read/write)
>>
>


Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa
This is assuming each row is like … I dunno 10-1000 bytes. If you’re storing 
like a huge 1mb blob use two tables for sure.  

> On Jun 22, 2022, at 9:06 PM, Jeff Jirsa  wrote:
> 
> 
> 
> Ok so here’s how I would think about this
> 
> The writes don’t matter. (There’s a tiny tiny bit of nuance in one table 
> where you can contend adding to the memtable but the best cassandra engineers 
> on earth probably won’t notice that unless you have really super hot 
> partitions, so ignore the write path).
> 
> The reads are where it changes
> 
> In both models/cases, you’ll use the partition index to seek to where the 
> partition starts. 
> 
> In model 2 table 1 if you use ck+col1+… the read will load the column index 
> and use that to jump to within 64kb of the col1 value you specify 
> 
> In model 2 table 2, if you use ck+col3+…, same thing - column index can jump 
> to within 64k
> 
> What you give up in model one is the granularity of that jump. If you use 
> model 1 and col3 instead of col1, the read will have to scan the partition. 
> In your case, with 80 rows, that may still be within that 64kb block - you 
> may not get more granular than that anyway. And even if it’s slightly larger, 
> you’re probably going to be compressing 64k chunks - maybe you have to 
> decompress one extra chunk on read if your 1000 rows goes past 64k, but you 
> likely won’t actually notice. You’re technically asking the server to read 
> and skip data it doesn’t need to return - it’s not really the most efficient, 
> but at that partition size it’s noise. You could also just return all 80-100 
> rows, let the server do slightly less work and filter client side - also 
> valid, probably slightly worse than the server side filter. 
> 
> Having one table instead of two, though, probably saves you a ton of disk 
> space ($€£), and the lower disk space may also mean that data stays in page 
> cache, so the extra read may not even go to disk anyway.
> 
> So with your actual data shape, I imagine you won’t really notice the nominal 
> inefficiency of the first model, and I’d be inclined to do that until you 
> demonstrate it won’t work (I bet it works fine for a long long time). 
> 
>>> On Jun 22, 2022, at 7:11 PM, MyWorld  wrote:
>>> 
>> 
>> Hi Jeff,
>> Let me know how no of rows have an impact here.
>> May be today I have 80-100 rows per partition. But what if I started storing 
>> 2-4k rows per partition. However total partition size is still under 100 MB 
>> 
>>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:
>>> How many rows per partition in each model?
>>> 
>>> 
>>> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
>>> > 
>>> > 
>>> > Hi all,
>>> > 
>>> > Just a small query around data Modelling.
>>> > Suppose we have to design the data model for 2 different use cases which 
>>> > will query the data on same set of (partion+clustering key). So should we 
>>> > maintain a seperate table for each or a single table. 
>>> > 
>>> > Model1 - Combined table
>>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>>> > 
>>> > Model2 - Seperate tables
>>> > Table1(Pk,CK,col1,col2,col3)
>>> > Table2(Pk,CK,col3,col4,col45)
>>> > 
>>> > So here partion and clustering keys are same. Also note column col3 is 
>>> > required in both use cases.
>>> > 
>>> > As per my thought in Model2, partition size would be less. There would be 
>>> > less sstables and when I use level compaction, it could be easily 
>>> > maintained. So should be better read performance.
>>> > 
>>> > Please help me to highlight the drawback and advantage of each data 
>>> > model. Here we have a mix kind of workload (read/write)


Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa


Ok so here’s how I would think about this

The writes don’t matter. (There’s a tiny tiny bit of nuance in one table where 
you can contend adding to the memtable but the best cassandra engineers on 
earth probably won’t notice that unless you have really super hot partitions, 
so ignore the write path).

The reads are where it changes

In both models/cases, you’ll use the partition index to seek to where the 
partition starts. 

In model 2 table 1 if you use ck+col1+… the read will load the column index and 
use that to jump to within 64kb of the col1 value you specify 

In model 2 table 2, if you use ck+col3+…, same thing - column index can jump to 
within 64k

What you give up in model one is the granularity of that jump. If you use model 
1 and col3 instead of col1, the read will have to scan the partition. In your 
case, with 80 rows, that may still be within that 64kb block - you may not get 
more granular than that anyway. And even if it’s slightly larger, you’re 
probably going to be compressing 64k chunks - maybe you have to decompress one 
extra chunk on read if your 1000 rows goes past 64k, but you likely won’t 
actually notice. You’re technically asking the server to read and skip data it 
doesn’t need to return - it’s not really the most efficient, but at that 
partition size it’s noise. You could also just return all 80-100 rows, let the 
server do slightly less work and filter client side - also valid, probably 
slightly worse than the server side filter. 

Having one table instead of two, though, probably saves you a ton of disk space 
($€£), and the lower disk space may also mean that data stays in page cache, so 
the extra read may not even go to disk anyway.

So with your actual data shape, I imagine you won’t really notice the nominal 
inefficiency of the first model, and I’d be inclined to do that until you 
demonstrate it won’t work (I bet it works fine for a long long time). 

> On Jun 22, 2022, at 7:11 PM, MyWorld  wrote:
> 
> 
> Hi Jeff,
> Let me know how no of rows have an impact here.
> May be today I have 80-100 rows per partition. But what if I started storing 
> 2-4k rows per partition. However total partition size is still under 100 MB 
> 
>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:
>> How many rows per partition in each model?
>> 
>> 
>> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
>> > 
>> > 
>> > Hi all,
>> > 
>> > Just a small query around data Modelling.
>> > Suppose we have to design the data model for 2 different use cases which 
>> > will query the data on same set of (partion+clustering key). So should we 
>> > maintain a seperate table for each or a single table. 
>> > 
>> > Model1 - Combined table
>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>> > 
>> > Model2 - Seperate tables
>> > Table1(Pk,CK,col1,col2,col3)
>> > Table2(Pk,CK,col3,col4,col45)
>> > 
>> > So here partion and clustering keys are same. Also note column col3 is 
>> > required in both use cases.
>> > 
>> > As per my thought in Model2, partition size would be less. There would be 
>> > less sstables and when I use level compaction, it could be easily 
>> > maintained. So should be better read performance.
>> > 
>> > Please help me to highlight the drawback and advantage of each data model. 
>> > Here we have a mix kind of workload (read/write)


Re: Query around Data Modelling

2022-06-22 Thread MyWorld
Hi Jeff,
Let me know how no of rows have an impact here.
May be today I have 80-100 rows per partition. But what if I started
storing 2-4k rows per partition. However total partition size is still
under 100 MB

On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa  wrote:

> How many rows per partition in each model?
>
>
> > On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
> >
> > 
> > Hi all,
> >
> > Just a small query around data Modelling.
> > Suppose we have to design the data model for 2 different use cases which
> will query the data on same set of (partion+clustering key). So should we
> maintain a seperate table for each or a single table.
> >
> > Model1 - Combined table
> > Table(Pk,CK, col1,col2, col3, col4,col5)
> >
> > Model2 - Seperate tables
> > Table1(Pk,CK,col1,col2,col3)
> > Table2(Pk,CK,col3,col4,col45)
> >
> > So here partion and clustering keys are same. Also note column col3 is
> required in both use cases.
> >
> > As per my thought in Model2, partition size would be less. There would
> be less sstables and when I use level compaction, it could be easily
> maintained. So should be better read performance.
> >
> > Please help me to highlight the drawback and advantage of each data
> model. Here we have a mix kind of workload (read/write)
>


RE: Query around Data Modelling

2022-06-22 Thread Michiel Saelen
I guess it will depend on your use case.
If your columns for table1 and table2 are significant in size it might be the 
case that model 2 is faster and you could perform queries in parallel, but …
If you always need to retrieve both the row from table1 and table2, then both 
queries together might have some overhead in memory, cpu, …
The answer will really depend on the amount of data that you push in every 
table on how frequent (will there be a difference in how partitions will be 
spread over ssTables for both tables?) and how you want to retrieve it.

The only way to know for sure would be to perform benchmark tests with 
representable data for your use case.
NoSQLBench might be interesting to look into. If 
you are not familiar with it, then might take you a bit of time to figure out 
how to have representable tests/results.

Kind regards,

[cid:image001.png@01D886E7.E4E5C360]


Michiel Saelen | Principal Solution Architect
Email michiel.sae...@skyline.be

Skyline Communications
39 Hong Kong Street #02-01 | Singapore 059678
www.skyline.be | +65 6920 1145

[cid:image002.png@01D886E7.E4E5C360]


[cid:image003.png@01D886E7.E4E5C360]
[cid:image004.png@01D886E7.E4E5C360]
[cid:image005.png@01D886E7.E4E5C360]
[cid:image006.png@01D886E7.E4E5C360]
[cid:image007.png@01D886E7.E4E5C360]
[cid:image008.png@01D886E7.E4E5C360]
[cid:image009.png@01D886E7.E4E5C360]


[cid:image010.png@01D886E7.E4E5C360]

From: MyWorld 
Sent: Thursday, June 23, 2022 09:38
To: user@cassandra.apache.org
Subject: Query around Data Modelling

This message was sent from outside the company. Please do not click links or 
open attachments unless you recognise the source of this email and know the 
content is safe.

Hi all,

Just a small query around data Modelling.
Suppose we have to design the data model for 2 different use cases which will 
query the data on same set of (partion+clustering key). So should we maintain a 
seperate table for each or a single table.

Model1 - Combined table
Table(Pk,CK, col1,col2, col3, col4,col5)

Model2 - Seperate tables
Table1(Pk,CK,col1,col2,col3)
Table2(Pk,CK,col3,col4,col45)

So here partion and clustering keys are same. Also note column col3 is required 
in both use cases.

As per my thought in Model2, partition size would be less. There would be less 
sstables and when I use level compaction, it could be easily maintained. So 
should be better read performance.

Please help me to highlight the drawback and advantage of each data model. Here 
we have a mix kind of workload (read/write)


Re: Query around Data Modelling

2022-06-22 Thread manish khandelwal
Table1 should be fine if some column values are not entered than Cassandra
will not create entry for them so partiton will almost be same in
both cases.

On Thu, Jun 23, 2022, 07:08 MyWorld  wrote:

> Hi all,
>
> Just a small query around data Modelling.
> Suppose we have to design the data model for 2 different use cases which
> will query the data on same set of (partion+clustering key). So should we
> maintain a seperate table for each or a single table.
>
> Model1 - Combined table
> Table(Pk,CK, col1,col2, col3, col4,col5)
>
> Model2 - Seperate tables
> Table1(Pk,CK,col1,col2,col3)
> Table2(Pk,CK,col3,col4,col45)
>
> So here partion and clustering keys are same. Also note column col3 is
> required in both use cases.
>
> As per my thought in Model2, partition size would be less. There would be
> less sstables and when I use level compaction, it could be easily
> maintained. So should be better read performance.
>
> Please help me to highlight the drawback and advantage of each data model.
> Here we have a mix kind of workload (read/write)
>


Re: Query around Data Modelling

2022-06-22 Thread Jeff Jirsa
How many rows per partition in each model?


> On Jun 22, 2022, at 6:38 PM, MyWorld  wrote:
> 
> 
> Hi all,
> 
> Just a small query around data Modelling.
> Suppose we have to design the data model for 2 different use cases which will 
> query the data on same set of (partion+clustering key). So should we maintain 
> a seperate table for each or a single table. 
> 
> Model1 - Combined table
> Table(Pk,CK, col1,col2, col3, col4,col5)
> 
> Model2 - Seperate tables
> Table1(Pk,CK,col1,col2,col3)
> Table2(Pk,CK,col3,col4,col45)
> 
> So here partion and clustering keys are same. Also note column col3 is 
> required in both use cases.
> 
> As per my thought in Model2, partition size would be less. There would be 
> less sstables and when I use level compaction, it could be easily maintained. 
> So should be better read performance.
> 
> Please help me to highlight the drawback and advantage of each data model. 
> Here we have a mix kind of workload (read/write)


Re: Query timed out after PT2M

2022-02-08 Thread Joe Obernberger
Update - the answer was spark.cassandra.input.split.sizeInMB. The 
default value is 512MBytes.  Setting this to 50 resulted in a lot more 
splits and the job ran in under 11 minutes; no timeout errors.  In this 
case the job was a simple count.  10 minutes 48 seconds for over 8.2 
billion rows.  Fast!


Good times ahead.

-Joe

On 2/8/2022 10:00 AM, Joe Obernberger wrote:


Update - I believe that for large tables, the 
spark.cassandra.read.timeoutMS needs to be very long; like 4 hours or 
longer.  The job now runs much longer, but still doesn't complete.  
I'm now facing this all too familiar error:
com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: 
Cassandra timeout during read query at consistency LOCAL_ONE (1 
responses were required but only 0 replica responded)


In the past this has been due to clocks being out of sync (not the 
issue here), or a table that has been written to with LOCAL_ONE 
instead of LOCAL_QUORUM.  I don't believe either of those are the 
case.  To be sure, I ran a repair on the table overnight (about 17 
hours to complete).  For the next test, I set the 
spark.cassandra.connection.timeoutMS to 6 (default is 5000), and 
the spark.cassandra.query.retry.count to -1.


Suggestions?  Thoughts?

Thanks all.

-Joe

On 2/7/2022 10:35 AM, Joe Obernberger wrote:


Some more info.  Tried different GC strategies - no luck.
It only happens on large tables (more than 1 billion rows). Works 
fine on a 300million row table.  There is very high CPU usage during 
the run.


I've tried setting spark.dse.continuousPagingEnabled to false and 
I've tried setting spark.cassandra.input.readsPerSec to 10; no effect.


Stats:

nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 9620329
    Read Latency: 0.5629605546754171 ms
    Write Count: 510561482
    Write Latency: 0.02805177028806885 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 77
    Old SSTable count: 0
    Space used (live): 82061188941
    Space used (total): 82061188941
    Space used by snapshots (total): 0
    Off heap memory used (total): 317037065
    SSTable Compression Ratio: 0.3816525125492022
    Number of partitions (estimate): 101021793
    Memtable cell count: 209646
    Memtable data size: 44087966
    Memtable off heap memory used: 0
    Memtable switch count: 10
    Local read count: 25665
    Local read latency: NaN ms
    Local write count: 2459322
    Local write latency: NaN ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 184.869GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 2063
    Bloom filter false ratio: 0.01020
    Bloom filter space used: 169249016
    Bloom filter off heap memory used: 169248400
    Index summary off heap memory used: 50863401
    Compression metadata off heap memory used: 96925264
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1721
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Dropped Mutations: 0


nodetool tablehistograms doc.doc
doc/doc histograms
Percentile  Read Latency Write Latency SSTables    Partition 
Size    Cell Count

    (micros) (micros) (bytes)
50% 0.00  0.00 0.00  
1109    86
75% 0.00  0.00 0.00  
3311   215
95% 0.00  0.00 0.00  
3311   215
98% 0.00  0.00 0.00  
3311   215
99% 0.00  0.00 0.00  
3311   215
Min 0.00  0.00 0.00   
104 5
Max 0.00  0.00 0.00    
943127  2299


I'm stuck.

-Joe


On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60
seconds.  I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting 

Re: Query timed out after PT2M

2022-02-08 Thread Joe Obernberger
Update - I believe that for large tables, the 
spark.cassandra.read.timeoutMS needs to be very long; like 4 hours or 
longer.  The job now runs much longer, but still doesn't complete.  I'm 
now facing this all too familiar error:
com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: 
Cassandra timeout during read query at consistency LOCAL_ONE (1 
responses were required but only 0 replica responded)


In the past this has been due to clocks being out of sync (not the issue 
here), or a table that has been written to with LOCAL_ONE instead of 
LOCAL_QUORUM.  I don't believe either of those are the case.  To be 
sure, I ran a repair on the table overnight (about 17 hours to 
complete).  For the next test, I set the 
spark.cassandra.connection.timeoutMS to 6 (default is 5000), and the 
spark.cassandra.query.retry.count to -1.


Suggestions?  Thoughts?

Thanks all.

-Joe

On 2/7/2022 10:35 AM, Joe Obernberger wrote:


Some more info.  Tried different GC strategies - no luck.
It only happens on large tables (more than 1 billion rows). Works fine 
on a 300million row table.  There is very high CPU usage during the run.


I've tried setting spark.dse.continuousPagingEnabled to false and I've 
tried setting spark.cassandra.input.readsPerSec to 10; no effect.


Stats:

nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 9620329
    Read Latency: 0.5629605546754171 ms
    Write Count: 510561482
    Write Latency: 0.02805177028806885 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 77
    Old SSTable count: 0
    Space used (live): 82061188941
    Space used (total): 82061188941
    Space used by snapshots (total): 0
    Off heap memory used (total): 317037065
    SSTable Compression Ratio: 0.3816525125492022
    Number of partitions (estimate): 101021793
    Memtable cell count: 209646
    Memtable data size: 44087966
    Memtable off heap memory used: 0
    Memtable switch count: 10
    Local read count: 25665
    Local read latency: NaN ms
    Local write count: 2459322
    Local write latency: NaN ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 184.869GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 2063
    Bloom filter false ratio: 0.01020
    Bloom filter space used: 169249016
    Bloom filter off heap memory used: 169248400
    Index summary off heap memory used: 50863401
    Compression metadata off heap memory used: 96925264
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1721
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Dropped Mutations: 0


nodetool tablehistograms doc.doc
doc/doc histograms
Percentile  Read Latency Write Latency SSTables    Partition 
Size    Cell Count

    (micros) (micros) (bytes)
50% 0.00  0.00 0.00  
1109    86
75% 0.00  0.00 0.00  
3311   215
95% 0.00  0.00 0.00  
3311   215
98% 0.00  0.00 0.00  
3311   215
99% 0.00  0.00 0.00  
3311   215
Min 0.00  0.00 0.00   
104 5
Max 0.00  0.00 0.00    
943127  2299


I'm stuck.

-Joe


On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60
seconds.  I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M

Re: Query timed out after PT2M

2022-02-07 Thread Joe Obernberger

Some more info.  Tried different GC strategies - no luck.
It only happens on large tables (more than 1 billion rows).  Works fine 
on a 300million row table.  There is very high CPU usage during the run.


I've tried setting spark.dse.continuousPagingEnabled to false and I've 
tried setting spark.cassandra.input.readsPerSec to 10; no effect.


Stats:

nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 9620329
    Read Latency: 0.5629605546754171 ms
    Write Count: 510561482
    Write Latency: 0.02805177028806885 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 77
    Old SSTable count: 0
    Space used (live): 82061188941
    Space used (total): 82061188941
    Space used by snapshots (total): 0
    Off heap memory used (total): 317037065
    SSTable Compression Ratio: 0.3816525125492022
    Number of partitions (estimate): 101021793
    Memtable cell count: 209646
    Memtable data size: 44087966
    Memtable off heap memory used: 0
    Memtable switch count: 10
    Local read count: 25665
    Local read latency: NaN ms
    Local write count: 2459322
    Local write latency: NaN ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 184.869GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 2063
    Bloom filter false ratio: 0.01020
    Bloom filter space used: 169249016
    Bloom filter off heap memory used: 169248400
    Index summary off heap memory used: 50863401
    Compression metadata off heap memory used: 96925264
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1721
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Dropped Mutations: 0


nodetool tablehistograms doc.doc
doc/doc histograms
Percentile  Read Latency Write Latency SSTables    Partition 
Size    Cell Count

    (micros) (micros) (bytes)
50% 0.00  0.00 0.00  
1109    86
75% 0.00  0.00 0.00  
3311   215
95% 0.00  0.00 0.00  
3311   215
98% 0.00  0.00 0.00  
3311   215
99% 0.00  0.00 0.00  
3311   215
Min 0.00  0.00 0.00   
104 5
Max 0.00  0.00 0.00    
943127  2299


I'm stuck.

-Joe


On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60 seconds. 
I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M
    at

com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at

com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at


Re: Query timed out after PT2M

2022-02-04 Thread Joe Obernberger

I've tried several different GC settings - but still getting timeouts.
Using openJDK 11 with:
-XX:+UseG1GC
-XX:+ParallelRefProcEnabled
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=24
-XX:ConcGCThreads=24

Machine has 40 cores.  Xmx is set to 32G.
13 node cluster.

Any ideas on what else to try?

-Joe

On 2/4/2022 10:39 AM, Joe Obernberger wrote:


Still no go.  Oddly, I can use trino and do a count OK, but with spark 
I get the timeouts.  I don't believe tombstones are an issue:


nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 1514288521
    Read Latency: 0.5080819034089475 ms
    Write Count: 12716563031
    Write Latency: 0.1462260620347646 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 72
    Old SSTable count: 0
    Space used (live): 74097778114
    Space used (total): 74097778114
    Space used by snapshots (total): 0
    Off heap memory used (total): 287187173
    SSTable Compression Ratio: 0.38644718028460934
    Number of partitions (estimate): 94111032
    Memtable cell count: 175084
    Memtable data size: 36945327
    Memtable off heap memory used: 0
    Memtable switch count: 677
    Local read count: 16237350
    Local read latency: 0.639 ms
    Local write count: 314822497
    Local write latency: 0.061 ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 164.168GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 154552
    Bloom filter false ratio: 0.01059
    Bloom filter space used: 152765592
    Bloom filter off heap memory used: 152765016
    Index summary off heap memory used: 48349869
    Compression metadata off heap memory used: 86072288
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1609
    Average live cells per slice (last five minutes): 
1108.6270918991

    Maximum live cells per slice (last five minutes): 1109
    Average tombstones per slice (last five minutes): 1.0
    Maximum tombstones per slice (last five minutes): 1
    Dropped Mutations: 0

Other things to check?

-Joe

On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60
seconds.  I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M
    at

com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at

com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
    at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
    at


Re: Query timed out after PT2M

2022-02-04 Thread Joe Obernberger
Still no go.  Oddly, I can use trino and do a count OK, but with spark I 
get the timeouts.  I don't believe tombstones are an issue:


nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 1514288521
    Read Latency: 0.5080819034089475 ms
    Write Count: 12716563031
    Write Latency: 0.1462260620347646 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 72
    Old SSTable count: 0
    Space used (live): 74097778114
    Space used (total): 74097778114
    Space used by snapshots (total): 0
    Off heap memory used (total): 287187173
    SSTable Compression Ratio: 0.38644718028460934
    Number of partitions (estimate): 94111032
    Memtable cell count: 175084
    Memtable data size: 36945327
    Memtable off heap memory used: 0
    Memtable switch count: 677
    Local read count: 16237350
    Local read latency: 0.639 ms
    Local write count: 314822497
    Local write latency: 0.061 ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 164.168GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 154552
    Bloom filter false ratio: 0.01059
    Bloom filter space used: 152765592
    Bloom filter off heap memory used: 152765016
    Index summary off heap memory used: 48349869
    Compression metadata off heap memory used: 86072288
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1609
    Average live cells per slice (last five minutes): 
1108.6270918991

    Maximum live cells per slice (last five minutes): 1109
    Average tombstones per slice (last five minutes): 1.0
    Maximum tombstones per slice (last five minutes): 1
    Dropped Mutations: 0

Other things to check?

-Joe

On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60 seconds. 
I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M
    at

com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at

com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
    at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
    at scala.Option.foreach(Option.scala:407)
    at

org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
    at

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
    at
 

Re: Query timed out after PT2M

2022-02-03 Thread manish khandelwal
It maybe the case you have lots of tombstones in this table which is making
reads slow and timeouts during bulk reads.

On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
wrote:

> So it turns out that number after PT is increments of 60 seconds.  I
> changed the timeout to 96, and now I get PT16M (96/6).  Since
> I'm still getting timeouts, something else must be wrong.
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 306 in stage 0.0 failed 4 times, most recent
> failure: Lost task 306.3 in stage 0.0 (TID 1180) (172.16.100.39 executor
> 0): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed
> out after PT16M
> at
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
> at
> com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
> at
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
> at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
> at scala.Option.foreach(Option.scala:407)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query
> timed out after PT16M
> at
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
> at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
> at
> com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
> -Joe
> On 2/3/2022 3:30 PM, Joe Obernberger wrote:
>
> I did find this:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
>
> And "spark.cassandra.read.timeoutMS" is set to 12.
>
> Running a test now, and I think that is it.  Thank you Scott.
>
> -Joe
> On 2/3/2022 3:19 PM, Joe Obernberger wrote:
>
> Thank you Scott!
> I am using the spark cassandra connector.  Code:
>
> SparkSession spark = SparkSession
> .builder()
> .appName("SparkCassandraApp")
> .config("spark.cassandra.connection.host", "chaos")
> .config("spark.cassandra.connection.port", "9042")
> .master("spark://aether.querymasters.com:8181")
> .getOrCreate();
>
> Would I set PT2M in there?  Like .config("pt2m","300") ?
> I'm not familiar with jshell, so I'm not sure where you're getting that
> duration from.
>
> Right now, I'm just doing a count:
> Dataset dataset =
> spark.read().format("org.apache.spark.sql.cassandra")
> .options(new HashMap() {
> {
> put("keyspace", "doc");
> put("table", "doc");
> }
> }).load();
>
> dataset.count();
>
>
> Thank you!
>
> -Joe
> On 2/3/2022 3:01 PM, C. Scott Andreas wrote:
>
> Hi Joe, it looks like "PT2M" may refer to a timeout value 

Re: Query timed out after PT2M

2022-02-03 Thread Joe Obernberger
So it turns out that number after PT is increments of 60 seconds.  I 
changed the timeout to 96, and now I get PT16M (96/6).  
Since I'm still getting timeouts, something else must be wrong.


Exception in thread "main" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 306 in stage 0.0 failed 4 times, most recent 
failure: Lost task 306.3 in stage 0.0 (TID 1180) (172.16.100.39 executor 
0): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT16M
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at 
com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
    at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)

    at scala.Option.foreach(Option.scala:407)
    at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: 
Query timed out after PT16M
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at 
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at 
com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)


-Joe

On 2/3/2022 3:30 PM, Joe Obernberger wrote:


I did find this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md

And "spark.cassandra.read.timeoutMS" is set to 12.

Running a test now, and I think that is it.  Thank you Scott.

-Joe

On 2/3/2022 3:19 PM, Joe Obernberger wrote:


Thank you Scott!
I am using the spark cassandra connector.  Code:

SparkSession spark = SparkSession
    .builder()
    .appName("SparkCassandraApp")
    .config("spark.cassandra.connection.host", "chaos")
    .config("spark.cassandra.connection.port", "9042")
.master("spark://aether.querymasters.com:8181")
    .getOrCreate();

Would I set PT2M in there?  Like .config("pt2m","300") ?
I'm not familiar with jshell, so I'm not sure where you're getting 
that duration from.


Right now, I'm just doing a count:
Dataset dataset = 
spark.read().format("org.apache.spark.sql.cassandra")

    .options(new HashMap() {
    {
    put("keyspace", "doc");
    put("table", "doc");
    }
    }).load();

dataset.count();


Thank you!

-Joe

On 2/3/2022 3:01 PM, C. Scott Andreas wrote:
Hi Joe, it looks like "PT2M" may refer to a timeout value that could 
be set by your Spark job's initialization of the client. I don't see 
a string matching this in the Cassandra codebase itself, but I do 
see that this is parseable as a Duration.


```
jshell> java.time.Duration.parse("PT2M").getSeconds()
$7 ==> 120
```

The server-side log you see is likely an indicator of the timeout 
from the 

Re: Query timed out after PT2M

2022-02-03 Thread Joe Obernberger

I did find this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md

And "spark.cassandra.read.timeoutMS" is set to 12.

Running a test now, and I think that is it.  Thank you Scott.

-Joe

On 2/3/2022 3:19 PM, Joe Obernberger wrote:


Thank you Scott!
I am using the spark cassandra connector.  Code:

SparkSession spark = SparkSession
    .builder()
    .appName("SparkCassandraApp")
    .config("spark.cassandra.connection.host", "chaos")
    .config("spark.cassandra.connection.port", "9042")
    .master("spark://aether.querymasters.com:8181")
    .getOrCreate();

Would I set PT2M in there?  Like .config("pt2m","300") ?
I'm not familiar with jshell, so I'm not sure where you're getting 
that duration from.


Right now, I'm just doing a count:
Dataset dataset = 
spark.read().format("org.apache.spark.sql.cassandra")

    .options(new HashMap() {
    {
    put("keyspace", "doc");
    put("table", "doc");
    }
    }).load();

dataset.count();


Thank you!

-Joe

On 2/3/2022 3:01 PM, C. Scott Andreas wrote:
Hi Joe, it looks like "PT2M" may refer to a timeout value that could 
be set by your Spark job's initialization of the client. I don't see 
a string matching this in the Cassandra codebase itself, but I do see 
that this is parseable as a Duration.


```
jshell> java.time.Duration.parse("PT2M").getSeconds()
$7 ==> 120
```

The server-side log you see is likely an indicator of the timeout 
from the server's perspective. You might consider checking lots from 
the replicas for dropped reads, query aborts due to scanning more 
tombstones than the configured max, or other conditions indicating 
overload/inability to serve a response.


If you're running a Spark job, I'd recommend using the DataStax Spark 
Cassandra Connector which distributes your query to executors 
addressing slices of the token range which will land on replica sets, 
avoiding the scatter-gather behavior that can occur if using the Java 
driver alone.


Cheers,

– Scott


On Feb 3, 2022, at 11:42 AM, Joe Obernberger 
 wrote:



Hi all - using a Cassandra 4.0.1 and a spark job running against a 
large

table (~8 billion rows) and I'm getting this error on the client side:
Query timed out after PT2M

On the server side I see a lot of messages like:
DEBUG [Native-Transport-Requests-39] 2022-02-03 14:39:56,647
ReadCallback.java:119 - Timed out; received 0 of 1 responses

The same code works on another table in the same Cassandra cluster that
is about 300 million rows and completes in about 2 minutes.  The 
cluster

is 13 nodes.

I can't find what PT2M means.  Perhaps the table needs a repair? Other
ideas?
Thank you!

-Joe



 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Query timed out after PT2M

2022-02-03 Thread Joe Obernberger

Thank you Scott!
I am using the spark cassandra connector.  Code:

SparkSession spark = SparkSession
    .builder()
    .appName("SparkCassandraApp")
    .config("spark.cassandra.connection.host", "chaos")
    .config("spark.cassandra.connection.port", "9042")
    .master("spark://aether.querymasters.com:8181")
    .getOrCreate();

Would I set PT2M in there?  Like .config("pt2m","300") ?
I'm not familiar with jshell, so I'm not sure where you're getting that 
duration from.


Right now, I'm just doing a count:
Dataset dataset = spark.read().format("org.apache.spark.sql.cassandra")
    .options(new HashMap() {
    {
    put("keyspace", "doc");
    put("table", "doc");
    }
    }).load();

dataset.count();


Thank you!

-Joe

On 2/3/2022 3:01 PM, C. Scott Andreas wrote:
Hi Joe, it looks like "PT2M" may refer to a timeout value that could 
be set by your Spark job's initialization of the client. I don't see a 
string matching this in the Cassandra codebase itself, but I do see 
that this is parseable as a Duration.


```
jshell> java.time.Duration.parse("PT2M").getSeconds()
$7 ==> 120
```

The server-side log you see is likely an indicator of the timeout from 
the server's perspective. You might consider checking lots from the 
replicas for dropped reads, query aborts due to scanning more 
tombstones than the configured max, or other conditions indicating 
overload/inability to serve a response.


If you're running a Spark job, I'd recommend using the DataStax Spark 
Cassandra Connector which distributes your query to executors 
addressing slices of the token range which will land on replica sets, 
avoiding the scatter-gather behavior that can occur if using the Java 
driver alone.


Cheers,

– Scott


On Feb 3, 2022, at 11:42 AM, Joe Obernberger 
 wrote:



Hi all - using a Cassandra 4.0.1 and a spark job running against a large
table (~8 billion rows) and I'm getting this error on the client side:
Query timed out after PT2M

On the server side I see a lot of messages like:
DEBUG [Native-Transport-Requests-39] 2022-02-03 14:39:56,647
ReadCallback.java:119 - Timed out; received 0 of 1 responses

The same code works on another table in the same Cassandra cluster that
is about 300 million rows and completes in about 2 minutes.  The cluster
is 13 nodes.

I can't find what PT2M means.  Perhaps the table needs a repair? Other
ideas?
Thank you!

-Joe



 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Query timed out after PT2M

2022-02-03 Thread C. Scott Andreas

Hi Joe, it looks like "PT2M" may refer to a timeout value that could be set by your Spark job's 
initialization of the client. I don't see a string matching this in the Cassandra codebase itself, but I do see 
that this is parseable as a Duration.```jshell> java.time.Duration.parse("PT2M").getSeconds()$7 
==> 120```The server-side log you see is likely an indicator of the timeout from the server's perspective. 
You might consider checking lots from the replicas for dropped reads, query aborts due to scanning more 
tombstones than the configured max, or other conditions indicating overload/inability to serve a response.If 
you're running a Spark job, I'd recommend using the DataStax Spark Cassandra Connector which distributes your 
query to executors addressing slices of the token range which will land on replica sets, avoiding the 
scatter-gather behavior that can occur if using the Java driver alone.Cheers,– ScottOn Feb 3, 2022, at 11:42 
AM, Joe Obernberger  wrote:Hi all - using a Cassandra 4.0.1 and a spark job 
running against a large table (~8 billion rows) and I'm getting this error on the client side:Query timed out 
after PT2MOn the server side I see a lot of messages like:DEBUG [Native-Transport-Requests-39] 2022-02-03 
14:39:56,647 ReadCallback.java:119 - Timed out; received 0 of 1 responsesThe same code works on another table 
in the same Cassandra cluster that is about 300 million rows and completes in about 2 minutes.  The cluster is 
13 nodes.I can't find what PT2M means.  Perhaps the table needs a repair? Other ideas?Thank you!-Joe

Re: Query timed out after PT1M

2021-04-13 Thread Bowen Song

Ouch, counters.

Counters in Cassandra have pretty bad performance comparing to 
everything else in Cassandra or counters (and their equivalent, such as 
integer types) in other mainstream databases, and they often are 
inaccurate too. I personally would recommend against the use of counters 
in Cassandra. You may need to add more nodes to deal with the peak load 
in order to avoid the timeouts if you can't move away from using 
counters in Cassandra.



On 13/04/2021 17:45, Joe Obernberger wrote:


Thank you Bowen - I wasn't familiar with PT1M.
I'm doing the following:

update doc.seq set doccount=doccount+? where id=?
Which runs OK.
Immediately following the update, I do:
select doccount from doc.seq where id=?
It is the above statement that is throwing the error under heavy load.

The select also frequently fails with a "No node was available to 
execute the query".  I wait 50mSec and retry and that typically 
works.  Sometimes it will retry as many as 15 times before getting a 
response, but this PT1M error is new.


Running: nodetool cfstats doc.seq results in:

Total number of tables: 80

Keyspace : doc
    Read Count: 57965255
    Read Latency: 0.3294544486347899 ms
    Write Count: 384658145
    Write Latency: 0.1954830251859089 ms
    Pending Flushes: 0
    Table: seq
    SSTable count: 9
    Space used (live): 48344
    Space used (total): 48344
    Space used by snapshots (total): 0
    Off heap memory used (total): 376
    SSTable Compression Ratio: 0.6227272727272727
    Number of partitions (estimate): 35
    Memtable cell count: 6517
    Memtable data size: 264
    Memtable off heap memory used: 0
    Memtable switch count: 154
    Local read count: 12900131
    Local read latency: NaN ms
    Local write count: 15981389
    Local write latency: NaN ms
    Pending flushes: 0
    Percent repaired: 10.69
    Bloom filter false positives: 0
    Bloom filter false ratio: 0.0
    Bloom filter space used: 168
    Bloom filter off heap memory used: 96
    Index summary off heap memory used: 168
    Compression metadata off heap memory used: 112
    Compacted partition minimum bytes: 125
    Compacted partition maximum bytes: 149
    Compacted partition mean bytes: 149
    Average live cells per slice (last five minutes): NaN
    Maximum live cells per slice (last five minutes): 0
    Average tombstones per slice (last five minutes): NaN
    Maximum tombstones per slice (last five minutes): 0
    Dropped Mutations: 0

-Joe

On 4/13/2021 12:35 PM, Bowen Song wrote:


The error message is clear, it was a DriverTimeoutException, and it 
was because the query timed out after one minute.


/Note: "PT1M" means a period of one minute, see 
//https://en.wikipedia.org/wiki/ISO_8601#Durations 
/


If you need help from us to find out why did it happen, you will need 
to share a bit more information with us, such as the CQL query and 
the table definition.



On 13/04/2021 16:53, Joe Obernberger wrote:

I'm getting this error:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT1M


but I can't find any documentation on this message.  Anyone know 
what this means?  I'm updating a counter value and then doing a 
select from the table.  The table that I'm selecting from is very 
small <100 rows.


Thank you!

-Joe




 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Query timed out after PT1M

2021-04-13 Thread Joe Obernberger
Interestingly, I just tried creating two CqlSession objects and when I 
use both instead of a single CqlSession for all queries, the 'No Node 
available to execute query' no longer happens.  In other words, if I 
use a different CqlSession for updating the doc.seq table, it works.  
If that session is shared with other queries, I get the errors.


-Joe

On 4/13/2021 12:35 PM, Bowen Song wrote:


The error message is clear, it was a DriverTimeoutException, and it 
was because the query timed out after one minute.


/Note: "PT1M" means a period of one minute, see 
//https://en.wikipedia.org/wiki/ISO_8601#Durations 
/


If you need help from us to find out why did it happen, you will need 
to share a bit more information with us, such as the CQL query and the 
table definition.



On 13/04/2021 16:53, Joe Obernberger wrote:

I'm getting this error:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT1M


but I can't find any documentation on this message.  Anyone know 
what this means?  I'm updating a counter value and then doing a 
select from the table.  The table that I'm selecting from is very 
small <100 rows.


Thank you!

-Joe




 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Query timed out after PT1M

2021-04-13 Thread Joe Obernberger

Thank you Bowen - I wasn't familiar with PT1M.
I'm doing the following:

update doc.seq set doccount=doccount+? where id=?
Which runs OK.
Immediately following the update, I do:
select doccount from doc.seq where id=?
It is the above statement that is throwing the error under heavy load.

The select also frequently fails with a "No node was available to 
execute the query".  I wait 50mSec and retry and that typically 
works.  Sometimes it will retry as many as 15 times before getting a 
response, but this PT1M error is new.


Running: nodetool cfstats doc.seq results in:

Total number of tables: 80

Keyspace : doc
    Read Count: 57965255
    Read Latency: 0.3294544486347899 ms
    Write Count: 384658145
    Write Latency: 0.1954830251859089 ms
    Pending Flushes: 0
    Table: seq
    SSTable count: 9
    Space used (live): 48344
    Space used (total): 48344
    Space used by snapshots (total): 0
    Off heap memory used (total): 376
    SSTable Compression Ratio: 0.6227272727272727
    Number of partitions (estimate): 35
    Memtable cell count: 6517
    Memtable data size: 264
    Memtable off heap memory used: 0
    Memtable switch count: 154
    Local read count: 12900131
    Local read latency: NaN ms
    Local write count: 15981389
    Local write latency: NaN ms
    Pending flushes: 0
    Percent repaired: 10.69
    Bloom filter false positives: 0
    Bloom filter false ratio: 0.0
    Bloom filter space used: 168
    Bloom filter off heap memory used: 96
    Index summary off heap memory used: 168
    Compression metadata off heap memory 
used: 112

    Compacted partition minimum bytes: 125
    Compacted partition maximum bytes: 149
    Compacted partition mean bytes: 149
    Average live cells per slice (last five 
minutes): NaN
    Maximum live cells per slice (last five 
minutes): 0
    Average tombstones per slice (last five 
minutes): NaN
    Maximum tombstones per slice (last five 
minutes): 0

    Dropped Mutations: 0

-Joe

On 4/13/2021 12:35 PM, Bowen Song wrote:


The error message is clear, it was a DriverTimeoutException, and it 
was because the query timed out after one minute.


/Note: "PT1M" means a period of one minute, see 
//https://en.wikipedia.org/wiki/ISO_8601#Durations 
/


If you need help from us to find out why did it happen, you will need 
to share a bit more information with us, such as the CQL query and the 
table definition.



On 13/04/2021 16:53, Joe Obernberger wrote:

I'm getting this error:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT1M


but I can't find any documentation on this message.  Anyone know 
what this means?  I'm updating a counter value and then doing a 
select from the table.  The table that I'm selecting from is very 
small <100 rows.


Thank you!

-Joe




 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Query timed out after PT1M

2021-04-13 Thread Bowen Song
The error message is clear, it was a DriverTimeoutException, and it was 
because the query timed out after one minute.


/Note: "PT1M" means a period of one minute, see 
//https://en.wikipedia.org/wiki/ISO_8601#Durations 
/


If you need help from us to find out why did it happen, you will need to 
share a bit more information with us, such as the CQL query and the 
table definition.



On 13/04/2021 16:53, Joe Obernberger wrote:

I'm getting this error:
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT1M


but I can't find any documentation on this message.  Anyone know what 
this means?  I'm updating a counter value and then doing a select from 
the table.  The table that I'm selecting from is very small <100 rows.


Thank you!

-Joe




Re: Query data through python using IN clause

2020-04-02 Thread Nitan Kainth
Thanks Alex.

On Thu, Apr 2, 2020 at 1:39 AM Alex Ott  wrote:

> Hi
>
> Working code is below, but I want to warn you - prefer not to use IN with
> partition keys - because you'll have different partition key values,
> coordinator node will need to perform queries to other hosts that hold
> these partition keys, and this slow downs the operation, and adds an
> additional load to the coordinating node.  If you execute queries in
> parallel (using async) for every of combination of pk1 & pk2, and then
> consolidate data application side - this could be faster than query with
> IN.
>
> Answer:
>
> You need to pass list as value of temp - IN expects list there...
>
> query = session.prepare("select * from test.table1 where pk1 IN ? and
> pk2=0 and ck1 > ? AND ck1 < ?;")
> temp = [1,2,3]
>
> import dateutil.parser
>
> ck1 = dateutil.parser.parse('2020-01-01T00:00:00Z')
> ck2 = dateutil.parser.parse('2021-01-01T00:00:00Z')
>
> rows = session.execute(query, (temp, ck1, ck2))
> for row in rows:
> print row
>
>
>
>
> Nitan Kainth  at "Wed, 1 Apr 2020 18:21:54 -0500" wrote:
>  NK> Hi There,
>
>  NK> I am trying to read data from table as below structure:
>
>  NK> table1(
>  NK> pk1 bigint,
>  NK> pk2 bigint,
>  NK> ck1 timestamp,
>  NK> value text,
>  NK> primary key((pk1,pk2),ck1);
>
>  NK> query = session.prepare("select * from table1 where pk IN ? and pk2=0
> and ck1 > ? AND ck1 < ?;")
>
>  NK> temp = 1,2,3
>
>  NK> runq = session.execute(query2, (temp,ck1, ck1))
>
>  NK> TypeError: Received an argument of invalid type for column
> "in(bam_user)". Expected: ,
> Got:
>  NK> ; (cannot convert argument
> to integer)
>
>  NK> I found examples for prepared statements for inserts but couldn't
> find any for select and not able to make it to work.
>
>  NK> Any suggestions?
>
>
>
> --
> With best wishes,Alex Ott
> Principal Architect, DataStax
> http://datastax.com/
>


Re: Query data through python using IN clause

2020-04-02 Thread Alex Ott
Hi

Working code is below, but I want to warn you - prefer not to use IN with
partition keys - because you'll have different partition key values,
coordinator node will need to perform queries to other hosts that hold
these partition keys, and this slow downs the operation, and adds an
additional load to the coordinating node.  If you execute queries in
parallel (using async) for every of combination of pk1 & pk2, and then
consolidate data application side - this could be faster than query with IN.

Answer:

You need to pass list as value of temp - IN expects list there...

query = session.prepare("select * from test.table1 where pk1 IN ? and pk2=0 and 
ck1 > ? AND ck1 < ?;")
temp = [1,2,3]

import dateutil.parser

ck1 = dateutil.parser.parse('2020-01-01T00:00:00Z')
ck2 = dateutil.parser.parse('2021-01-01T00:00:00Z')

rows = session.execute(query, (temp, ck1, ck2))
for row in rows:
print row




Nitan Kainth  at "Wed, 1 Apr 2020 18:21:54 -0500" wrote:
 NK> Hi There,

 NK> I am trying to read data from table as below structure:

 NK> table1(
 NK> pk1 bigint,
 NK> pk2 bigint,
 NK> ck1 timestamp,
 NK> value text,
 NK> primary key((pk1,pk2),ck1);

 NK> query = session.prepare("select * from table1 where pk IN ? and pk2=0 and 
ck1 > ? AND ck1 < ?;")

 NK> temp = 1,2,3

 NK> runq = session.execute(query2, (temp,ck1, ck1))

 NK> TypeError: Received an argument of invalid type for column "in(bam_user)". 
Expected: , Got:
 NK> ; (cannot convert argument to 
integer)

 NK> I found examples for prepared statements for inserts but couldn't find any 
for select and not able to make it to work. 

 NK> Any suggestions?



-- 
With best wishes,Alex Ott
Principal Architect, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Query timeouts after Cassandra Migration

2020-02-07 Thread Reid Pinchback
Ankit, are the instance types identical in the new cluster, with I/O 
configuration identical at the system level, and are the Java settings for C* 
identical between the two clusters?  With radical timing differences happening 
periodically, the two things I’d have on my radar would be garbage collections 
and problems in flushing dirty pages.  Even if neither of those are the issue, 
one way or another, timeouts make me hunt for the resource everybody is queued 
up on.

From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 6, 2020 at 10:08 PM
To: "user@cassandra.apache.org" 
Subject: Re: Query timeouts after Cassandra Migration

Message from External Sender
So do you advise copying tokens in such cases ? What procedure is advisable ?

Specifically for your case with 3 nodes + RF=3, it won't make a difference so 
leave it as it is.

Latency increased on target cluster.

Have you tried to run a trace of the queries which are slow? It will help you 
identify where the slowness is coming from. Cheers!


Re: Query timeouts after Cassandra Migration

2020-02-06 Thread Erick Ramirez
>
> So do you advise copying tokens in such cases ? What procedure is
> advisable ?
>

Specifically for your case with 3 nodes + RF=3, it won't make a difference
so leave it as it is.


> Latency increased on target cluster.
>

Have you tried to run a trace of the queries which are slow? It will help
you identify where the slowness is coming from. Cheers!


Re: Query timeouts after Cassandra Migration

2020-02-06 Thread Ankit Gadhiya
Thanks Eric.
So do you advise copying tokens in such cases ? What procedure is advisable
?

Latency increased on target cluster. I’d double check on storage disks but
it should be same.


— Ankit

On Thu, Feb 6, 2020 at 9:07 PM Erick Ramirez  wrote:

> I didn’t copy tokens since it’s an identical cluster and we have RF as 3
>> on 3 node cluster. Is it still needed , why?
>>
>
> In C*, same number of nodes alone isn't enough. Clusters aren't really
> identical unless token assignments are the same. In your case though since
> each node has a full copy of the data (RF = N nodes), they "appear"
> identical.
>
> I recently migrated Cassandra keyspace data from one Azure cluster (3
>> Nodes) to another (3 nodes different region) using simple sstable copy.
>> Post this , we are observing overall response time has increased and
>> timeouts every 20 mins.
>>
>
>  You mean the response time on the source cluster increased? Or the
> destination cluster? I can't see how the copy could affect latency unless
> you're using premium storage disks and you've maxed out the throughput on
> them. For example, P30 disks are capped at 200MB/s.
>
> Do I need to copy anything from system*
>
>
> No, system tables are local to a node. Only ever copy the application
> keyspaces. Cheers!
>
-- 
*Thanks & Regards,*
*Ankit Gadhiya*


Re: Query timeouts after Cassandra Migration

2020-02-06 Thread Erick Ramirez
>
> I didn’t copy tokens since it’s an identical cluster and we have RF as 3
> on 3 node cluster. Is it still needed , why?
>

In C*, same number of nodes alone isn't enough. Clusters aren't really
identical unless token assignments are the same. In your case though since
each node has a full copy of the data (RF = N nodes), they "appear"
identical.

I recently migrated Cassandra keyspace data from one Azure cluster (3
> Nodes) to another (3 nodes different region) using simple sstable copy.
> Post this , we are observing overall response time has increased and
> timeouts every 20 mins.
>

 You mean the response time on the source cluster increased? Or the
destination cluster? I can't see how the copy could affect latency unless
you're using premium storage disks and you've maxed out the throughput on
them. For example, P30 disks are capped at 200MB/s.

Do I need to copy anything from system*


No, system tables are local to a node. Only ever copy the application
keyspaces. Cheers!


Re: Query timeouts after Cassandra Migration

2020-02-06 Thread Ankit Gadhiya
Hi Michael,

Thanks for your response.

I didn’t copy tokens since it’s an identical cluster and we have RF as 3 on
3 node cluster. Is it still needed , why?

Don’t see anything in cassandra log as such. I don’t have debugs enabled.


Thanks & Regards,
Ankit

On Thu, Feb 6, 2020 at 1:47 PM Michael Shuler 
wrote:

> Did you copy the tokens from cluster1 to new cluster2? Same Cassandra
> version, same instance type/size? What to the logs say on cluster2 that
> look different from the cluster1 norm? There are a number of possible
> `nodetool` utilities that may help see what is happening on new cluster2.
>
> Michael
>
> On 2/6/20 8:09 AM, Ankit Gadhiya wrote:
> > Hi Folks,
> >
> > I recently migrated Cassandra keyspace data from one Azure cluster (3
> > Nodes) to another (3 nodes different region) using simple sstable copy.
> > Post this , we are observing overall response time has increased and
> > timeouts every 20 mins.
> >
> > Has anyone faced such in their experiences ?
> > Do I need to copy anything from system*
> > Anything wrt statistics/cache ?
> >
> > Your time and responses on this are much appreciated.
> >
> >
> > Thanks & Regards,
> > Ankit
> > --
> > *Thanks & Regards,*
> > *Ankit Gadhiya*
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
> --
*Thanks & Regards,*
*Ankit Gadhiya*


Re: Query timeouts after Cassandra Migration

2020-02-06 Thread Michael Shuler
Did you copy the tokens from cluster1 to new cluster2? Same Cassandra 
version, same instance type/size? What to the logs say on cluster2 that 
look different from the cluster1 norm? There are a number of possible 
`nodetool` utilities that may help see what is happening on new cluster2.


Michael

On 2/6/20 8:09 AM, Ankit Gadhiya wrote:

Hi Folks,

I recently migrated Cassandra keyspace data from one Azure cluster (3 
Nodes) to another (3 nodes different region) using simple sstable copy. 
Post this , we are observing overall response time has increased and 
timeouts every 20 mins.


Has anyone faced such in their experiences ?
Do I need to copy anything from system*
Anything wrt statistics/cache ?

Your time and responses on this are much appreciated.


Thanks & Regards,
Ankit
--
*Thanks & Regards,*
*Ankit Gadhiya*



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Query failure

2019-03-14 Thread Léo FERLIN SUTTON
I checked and the configuration file matched on all the nodes.

I checked `cqlsh  --cqlversion "3.4.0" -u cassandra_superuser -p
my_password nodeXX 9042` with each node and finally one failed.

It had somehow not been restarted since the configuration change. It was
not responsive to `systemctl stop/start/restart cassandra` but once I
finnaly got it to restart my issues disappeared.

Thank you so much for the help !

Regards,

Leo


On Thu, Mar 14, 2019 at 1:38 PM Sam Tunnicliffe  wrote:

> Hi Leo
>
> my guess would be that your configuration is not consistent across all
> nodes in the cluster. The responses you’re seeing are totally indicative of
> being connected to a node where PasswordAuthenticator is not enabled in
> cassandra.yaml.
>
> Thanks,
> Sam
>
> On 14 Mar 2019, at 10:56, Léo FERLIN SUTTON 
> wrote:
>
> Hello !
>
> Recently I have noticed some clients are having errors almost every time
> they try to contact my Cassandra cluster.
>
> The error messages vary but there is one constant : *It's not constant* !
> Let me show you :
>
> From the client host :
>
> `cqlsh  --cqlversion "3.4.0" -u cassandra_superuser -p my_password
> cassandra_address 9042`
>
> The CL commands will fail half of the time :
>
> ```
> cassandra_vault_superuser@cqlsh> CREATE ROLE leo333 WITH PASSWORD =
> 'leo4' AND LOGIN=TRUE;
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="org.apache.cassandra.auth.CassandraRoleManager doesn't support
> PASSWORD"
> cassandra_vault_superuser@cqlsh> CREATE ROLE leo333 WITH PASSWORD =
> 'leo4' AND LOGIN=TRUE;
> ```
>
> Same with grants :
> ```
> cassandra_vault_superuser@cqlsh> GRANT read_write_role TO leo333;
> Unauthorized: Error from server: code=2100 [Unauthorized] message="You
> have to be logged in and not anonymous to perform this request"
> cassandra_vault_superuser@cqlsh> GRANT read_write_role TO leo333;
> ```
>
> Same with `list roles` :
> ```
> cassandra_vault_superuser@cqlsh> list roles;
>
>  role | super | login
> | options
>
> --+---+---+-
> cassandra |  True |  True
> |{}
> [...]
>
> cassandra_vault_superuser@cqlsh> list roles;
> Unauthorized: Error from server: code=2100 [Unauthorized] message="You
> have to be logged in and not anonymous to perform this request"
> ```
>
> My Cassandra  (3.0.18) configuration seems correct :
> ```
> authenticator: PasswordAuthenticator
> authorizer: CassandraAuthorizer
> role_manager: CassandraRoleManager
> ```
>
> The system_auth schema seems correct as well :
> `CREATE KEYSPACE system_auth WITH replication = {'class':
> 'NetworkTopologyStrategy', 'my_dc': '3'}  AND durable_writes = true;`
>
>
> I am only having those errors when :
>
>   * I am on a non local client.
>   * Via `cqlsh`
>   * Or via the vaultproject client (
> https://www.vaultproject.io/docs/secrets/databases/cassandra.html) (1
> error occurred: You have to be logged in and not anonymous to perform this
> request)
>
> If I am using cqlsh (with authentification) but from a Cassandra node it
> works 100% of the time.
>
> Any idas abut what might be going wrong ?
>
> Regards,
>
> Leo
>
>
>


Re: Query failure

2019-03-14 Thread Sam Tunnicliffe
Hi Leo

my guess would be that your configuration is not consistent across all nodes in 
the cluster. The responses you’re seeing are totally indicative of being 
connected to a node where PasswordAuthenticator is not enabled in 
cassandra.yaml. 

Thanks,
Sam

> On 14 Mar 2019, at 10:56, Léo FERLIN SUTTON  
> wrote:
> 
> Hello !
> 
> Recently I have noticed some clients are having errors almost every time they 
> try to contact my Cassandra cluster.
> 
> The error messages vary but there is one constant : It's not constant ! Let 
> me show you : 
> 
> From the client host : 
> 
> `cqlsh  --cqlversion "3.4.0" -u cassandra_superuser -p my_password 
> cassandra_address 9042`
> 
> The CL commands will fail half of the time :
> 
> ```
> cassandra_vault_superuser@cqlsh> CREATE ROLE leo333 WITH PASSWORD = 'leo4' 
> AND LOGIN=TRUE;
> InvalidRequest: Error from server: code=2200 [Invalid query] 
> message="org.apache.cassandra.auth.CassandraRoleManager doesn't support 
> PASSWORD"
> cassandra_vault_superuser@cqlsh> CREATE ROLE leo333 WITH PASSWORD = 'leo4' 
> AND LOGIN=TRUE;
> ```
> 
> Same with grants : 
> ```
> cassandra_vault_superuser@cqlsh> GRANT read_write_role TO leo333;
> Unauthorized: Error from server: code=2100 [Unauthorized] message="You have 
> to be logged in and not anonymous to perform this request"
> cassandra_vault_superuser@cqlsh> GRANT read_write_role TO leo333;
> ```
> 
> Same with `list roles` : 
> ```
> cassandra_vault_superuser@cqlsh> list roles;
> 
>  role | super | login | 
> options
> --+---+---+-
> cassandra |  True |  True |   
>  {}
> [...]
> 
> cassandra_vault_superuser@cqlsh> list roles;
> Unauthorized: Error from server: code=2100 [Unauthorized] message="You have 
> to be logged in and not anonymous to perform this request"
> ```
> 
> My Cassandra  (3.0.18) configuration seems correct : 
> ```
> authenticator: PasswordAuthenticator
> authorizer: CassandraAuthorizer
> role_manager: CassandraRoleManager
> ```
> 
> The system_auth schema seems correct as well : 
> `CREATE KEYSPACE system_auth WITH replication = {'class': 
> 'NetworkTopologyStrategy', 'my_dc': '3'}  AND durable_writes = true;`
> 
> 
> I am only having those errors when : 
> 
>   * I am on a non local client. 
>   * Via `cqlsh`
>   * Or via the vaultproject client 
> (https://www.vaultproject.io/docs/secrets/databases/cassandra.html 
> ) (1 error 
> occurred: You have to be logged in and not anonymous to perform this request)
> 
> If I am using cqlsh (with authentification) but from a Cassandra node it 
> works 100% of the time.
> 
> Any idas abut what might be going wrong ?
> 
> Regards,
> 
> Leo
> 



Re: Query With Limit Clause

2018-11-07 Thread shalom sagges
Thanks a lot for the info :)

On Tue, Nov 6, 2018 at 11:11 AM DuyHai Doan  wrote:

> Cassandra will execute such request using a Partition Range Scan.
>
> See more details here http://www.doanduyhai.com/blog/?p=13191, chapter E
> Cluster Read Path (look at the formula of Concurrency Factor)
>
>
>
> On Tue, Nov 6, 2018 at 8:21 AM shalom sagges 
> wrote:
>
>> Hi All,
>>
>> If I run for example:
>> select * from myTable limit 3;
>>
>> Does Cassandra do a full table scan regardless of the limit?
>>
>> Thanks!
>>
>


Re: Query on Data Modelling of a specific usecase

2017-04-20 Thread Naresh Yadav
Hi Jon,

Thanks for your guidance.

In above mentioned table i can have different scale depending on Report.

One report may have 1 rows.
Second report may have half million rows.
Third report may have 1 million rows.
Fourth report may have 10 million rows.

As this is timeseries data that was main reason of modelling in cassandra.
We preferred separate table for each report as there is no usecase of
quering across reports and also Light reports will work faster.
I can plan to reduce no of tables drastically by combining lighter reports
in one table at application level.

If you can suggest optimal table design keeping one table in mind with 10
millions to 1 billion rows scale for the mentioned queries.

Thanks,
Naresh Yadav

On Wed, Apr 19, 2017 at 9:26 PM, Jon Haddad 
wrote:

> How much data do you plan to store in each table?
>
> I’ll be honest, this doesn’t sound like a Cassandra use case at first
> glance.  1 table per report x 1000 is going to be a bad time.  Odds are
> with different queries, you’ll need multiple views, so lets call that a
> handful of tables per report.  Sounds to me like you need CSV (for small
> reports) or Parquet + a file system (for large ones).
>
> Jon
>
>
> On Apr 18, 2017, at 11:34 PM, Naresh Yadav  wrote:
>
> Looking for cassandra expert's recommendation on above usecase, please
> reply.
>
> On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav 
> wrote:
>
>> Hi all,
>>
>> This is my existing table configured on apache-cassandra-3.0.9:
>>
>> CREATE TABLE report_id1 (
>>mc_id text,
>>tag_id text,
>>e_date timestamp.
>>value text
>>PRIMARY KEY ((mc_id, tag_id), e_date)
>> }
>>
>> I create table dynamically for each report from application. Need to
>> support upto 1000 reports means 1000 such tables.
>> unique mc_id will be in range of 5 to 100 in a report.
>> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
>> report.
>> For a mc_id, tag_id there will be unique e_date values in range of 10 to
>> 5000.
>>
>> Current queries to answer :
>> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date='16Apr2017 23:59:59';
>> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>>
>> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017
>> 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
>> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>>
>> Looking for better design for this case, keeping in mind dynamic tables
>> usecase and queries listed.
>>
>> Thanks in advance,
>> Naresh
>>
>>
>
>


Re: Query on Data Modelling of a specific usecase

2017-04-19 Thread Jon Haddad
How much data do you plan to store in each table?

I’ll be honest, this doesn’t sound like a Cassandra use case at first glance.  
1 table per report x 1000 is going to be a bad time.  Odds are with different 
queries, you’ll need multiple views, so lets call that a handful of tables per 
report.  Sounds to me like you need CSV (for small reports) or Parquet + a file 
system (for large ones).

Jon


> On Apr 18, 2017, at 11:34 PM, Naresh Yadav  wrote:
> 
> Looking for cassandra expert's recommendation on above usecase, please reply.
> 
> On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav  > wrote:
> Hi all,
> 
> This is my existing table configured on apache-cassandra-3.0.9:
> 
> CREATE TABLE report_id1 (
>mc_id text,
>tag_id text,
>e_date timestamp.
>value text
>PRIMARY KEY ((mc_id, tag_id), e_date)
> }
> 
> I create table dynamically for each report from application. Need to support 
> upto 1000 reports means 1000 such tables.
> unique mc_id will be in range of 5 to 100 in a report.
> For a mc_id there will be unique tag_id in range of 100 to 1 million in a 
> report.
> For a mc_id, tag_id there will be unique e_date values in range of 10 to 5000.
> 
> Current queries to answer : 
> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND 
> e_date='16Apr2017 23:59:59';
> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND 
> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
> 
> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017 00:00:00' 
> AND e_date <='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
>
> Looking for better design for this case, keeping in mind dynamic tables 
> usecase and queries listed.   
> 
> Thanks in advance,
> Naresh
> 
> 



Re: Query on Data Modelling of a specific usecase

2017-04-19 Thread Naresh Yadav
Looking for cassandra expert's recommendation on above usecase, please
reply.

On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav  wrote:

> Hi all,
>
> This is my existing table configured on apache-cassandra-3.0.9:
>
> CREATE TABLE report_id1 (
>mc_id text,
>tag_id text,
>e_date timestamp.
>value text
>PRIMARY KEY ((mc_id, tag_id), e_date)
> }
>
> I create table dynamically for each report from application. Need to
> support upto 1000 reports means 1000 such tables.
> unique mc_id will be in range of 5 to 100 in a report.
> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
> report.
> For a mc_id, tag_id there will be unique e_date values in range of 10 to
> 5000.
>
> Current queries to answer :
> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date='16Apr2017 23:59:59';
> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>
> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
>
> Looking for better design for this case, keeping in mind dynamic tables
> usecase and queries listed.
>
> Thanks in advance,
> Naresh
>
>


RE: Query on Cassandra clusters

2017-01-03 Thread SEAN_R_DURITY
A couple thoughts (for after you up/downgrade to one version for all nodes):

-  16 GB of total RAM on a node is a minimum I would use; 32 would be 
much better

-  With a lower amount of memory, I think would keep memtables on-heap 
in order to keep a tighter rein on how much they use. If you are consistently 
using 75% or more of heap space, you need more (either more nodes or more 
memory per node).

-  I would try giving Cassandra 50% of the RAM on the host. And remove 
any client or non-Cassandra processes. Nodes should be dedicated to Cassandra 
(for Production)

-  For disk, my rule for size-tiered is that you need 50% overhead IF 
it is primarily a single table application (90%+ of data in one table). 
Otherwise, I am ok with 35-40% overhead. Just know you can hit issues down the 
road as the sstables get larger.


Sean Durity
From: Sumit Anvekar [mailto:sumit.anve...@gmail.com]
Sent: Wednesday, December 21, 2016 3:47 PM
To: user@cassandra.apache.org
Subject: Re: Query on Cassandra clusters

Thank you Alain for the detailed explanation.
To answer you question on Java version, JVM settings and Memory usage. We are 
using using 1.8.0_45. precisely
>java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
JVM settings are identical on all nodes (cassandra-env.sh is identical).
Further when I say high on memory usage, Cassandra is using heap (-Xmx3767M) 
and off heap of about 6GB out of the total system memory of 14.7 GB. Along with 
this there are other processes running on this system which is bring the 
overall memory usage to >95%. This bring me to another point whether heap 
memory + off heap (sum of values of Space used (total)) from nodetool cfstats 
is the total memory used by Cassandra on a node?
Also, on the disk front, what is a good amount of empty space to be left out 
unused in the partition(~50%
 should be?) considering we use SizeTieredCompaction strategy?

On Wed, Dec 21, 2016 at 6:30 PM, Alain RODRIGUEZ 
<arodr...@gmail.com<mailto:arodr...@gmail.com>> wrote:
Hi Sumit,

1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra version 
3.0.3 and then newer 5 nodes have 3.6.0 version.

I strongly recommend to:


  *   Stick with one version of Apache Cassandra per cluster.
  *   Always be as close as possible from the last minor release of the 
Cassandra version in use.

So you really should not be using 3.0.6 AND 3.6.0 but rather 3.0.10 OR 3.7 
(currently). Note that Cassandra 3.X (with X > 0) uses a tic toc release cycle 
where odd are bug fixes only and even numbers introduce new features as well.

Running multiple version for a long period can induces errors, Cassandra is 
built to handle multiple versions only to give the time to operators to run a 
rolling restart. No streaming (adding / removing / repairing nodes) should 
happen during this period. Also, I have seen in the past some cases where 
changing the schema was also an issue with multiple versions leading to schema 
disagreements.

Due to this scenario, a couple boxes are running very high on memory (95% 
usage) whereas some of the older version nodes have just 60-70% memory usage.

Hard to say if this is related to the mutiple versions of Cassandra but it 
could. Are you sure nodes are using the same JVM / GC options 
(cassandra-env.sh) and Java version?

Also, what is exactly "high on memory 95%"? Are we talking about heap or Native 
memory. Isn't the memory used as page cache (that would still be available for 
the system)?

2. To counter #1, I am planning to upgrade system configuration of the nodes 
where there is higher memory usage. But the question is, will it be a problem 
if we have a Cassandra cluster, where in a couple of nodes have double the 
system configuration than other nodes in the cluster.

It is not a problem per se to have distinct configurations on distinct nodes. 
Cassandra does it very well, and it is frequently used to test some 
configuration change on a canary node, to prevent it from impacting the whole 
service.

Yet, all the nodes should be doing the same work (unless you have some 
heterogenous hardware and are using distinct number of vnodes on each node). 
Keeping things homogenous allows the operator to easily compare how nodes are 
doing and it makes reasoning about Cassandra, as well as troubleshooting issues 
a way easier.

So I would:

- Fully upgrade / downgrade asap to a chosen version (3.X is known as being not 
yet stable, but going back to 3.0.X might be more painful)
- Make sure nodes are well balanced and using the same number of ranges 
'nodetool status '
- Make sure the node are using the same Java version and JVM settings.

Hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France

The

Re: Query

2016-12-30 Thread Work
Actually, "noSQL" is a misleading misnomer. With C* you have CQL which is 
adapted from SQL syntax and purpose.

For a poster boy, try Netflix.

Regards,

James 

Sent from my iPhone

> On Dec 30, 2016, at 4:59 AM, Sikander Rafiq <hafiz_ra...@hotmail.com> wrote:
> 
> Thanks for your comments/suggestions.
> 
> 
> Yes I understand my project needs and requirements. Surely it requires to 
> handle huge data for what i'm exploring what suits for it.
> 
> 
> Though Cassandra is distributed, scalable and highly available, but it is 
> NoSql means Sql part is missing and needs to be handled.
> 
> 
> 
> Can anyone please tell me some big name who is using Cassandra for handling 
> its huge data sets like Twitter etc.
> 
> 
> 
> Sent from Outlook
> 
> 
>  
> From: Edward Capriolo <edlinuxg...@gmail.com>
> Sent: Friday, December 30, 2016 5:53 AM
> To: user@cassandra.apache.org
> Subject: Re: Query
>  
> You should start with understanding your needs. Once you understand your need 
> you can pick the software that fits your need. Staring with a software stack 
> is backwards.
> 
>> On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater <ben.sla...@instaclustr.com> 
>> wrote:
>> I wasn’t familiar with Gizzard either so I thought I’d take a look. The 
>> first things on their github readme is:
>> NB: This project is currently not recommended as a base for new consumers.
>> (And no commits since 2013)
>> 
>> So, Cassandra definitely looks like a better choice as your datastore for a 
>> new project.
>> 
>> Cheers
>> Ben
>> 
>>> On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar <khangaon...@gmail.com> 
>>> wrote:
>>> I am not that familiar with gizzard but with gizzard + mysql , you have 
>>> multiple moving parts in the system that need to managed separately. You'll 
>>> need the mysql expert for mysql and the gizzard expert to manage the 
>>> distributed part. It can be argued that long term this will have higher 
>>> adminstration cost
>>> 
>>> Cassandra's value add is its simple peer to peer architecture that is easy 
>>> to manage - a single database solution that is distributed, scalable, 
>>> highly available etc. In other words, once you gain expertise cassandra, 
>>> you get everything in one package.
>>> 
>>> regards
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq <hafiz_ra...@hotmail.com> 
>>> wrote:
>>> Hi,
>>> 
>>> I'm exploring Cassandra for handling large data sets for mobile app, but 
>>> i'm not clear where it stands.
>>> 
>>> 
>>> If we use MySQL as  underlying database and Gizzard for building custom 
>>> distributed databases (with arbitrary storage technology) and Memcached for 
>>> highly queried data, then where lies Cassandra?
>>> 
>>> 
>>> 
>>> As i have read that Twitter uses both Cassandra and Gizzard. Please explain 
>>> me where Cassandra will act.
>>> 
>>> 
>>> Thanks in advance.
>>> 
>>> 
>>> Regards,
>>> 
>>> Sikander
>>> 
>>> 
>>> 
>>> Sent from Outlook
>>> 
>>> 
>>> 
>>> -- 
>>> http://khangaonkar.blogspot.com/
> 


RE: Query

2016-12-30 Thread SEAN_R_DURITY
A few of the many companies that rely on Cassandra are mentioned here:
http://cassandra.apache.org
Apple, Netflix, Weather Channel, etc.
(Not nearly as good as the Planet Cassandra list that DataStax used to 
maintain. Boo for the Apache/DataStax squabble!)

DataStax has a list of many case studies, too, with their enterprise version of 
Cassandra:
http://www.datastax.com/resources/casestudies


Sean Durity

From: Sikander Rafiq [mailto:hafiz_ra...@hotmail.com]
Sent: Friday, December 30, 2016 8:00 AM
To: user@cassandra.apache.org
Subject: Re: Query


Thanks for your comments/suggestions.



Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.



Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.



Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.





Sent from Outlook<http://aka.ms/weboutlook>


From: Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>>
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
<khangaon...@gmail.com<mailto:khangaon...@gmail.com>> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost
Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.
regards




On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
<hafiz_ra...@hotmail.com<mailto:hafiz_ra...@hotmail.com>> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.



If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?



As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.



Thanks in advance.



Regards,

Sikander




Sent from Outlook<http://aka.ms/weboutlook>


--
http://khangaonkar.blogspot.com/




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot discla

Re: Query

2016-12-30 Thread Sikander Rafiq
Thanks for your comments/suggestions.


Yes I understand my project needs and requirements. Surely it requires to 
handle huge data for what i'm exploring what suits for it.


Though Cassandra is distributed, scalable and highly available, but it is NoSql 
means Sql part is missing and needs to be handled.


Can anyone please tell me some big name who is using Cassandra for handling its 
huge data sets like Twitter etc.



Sent from Outlook<http://aka.ms/weboutlook>



From: Edward Capriolo <edlinuxg...@gmail.com>
Sent: Friday, December 30, 2016 5:53 AM
To: user@cassandra.apache.org
Subject: Re: Query

You should start with understanding your needs. Once you understand your need 
you can pick the software that fits your need. Staring with a software stack is 
backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
I wasn't familiar with Gizzard either so I thought I'd take a look. The first 
things on their github readme is:
NB: This project is currently not recommended as a base for new consumers.
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a new 
project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
<khangaon...@gmail.com<mailto:khangaon...@gmail.com>> wrote:
I am not that familiar with gizzard but with gizzard + mysql , you have 
multiple moving parts in the system that need to managed separately. You'll 
need the mysql expert for mysql and the gizzard expert to manage the 
distributed part. It can be argued that long term this will have higher 
adminstration cost

Cassandra's value add is its simple peer to peer architecture that is easy to 
manage - a single database solution that is distributed, scalable, highly 
available etc. In other words, once you gain expertise cassandra, you get 
everything in one package.

regards





On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
<hafiz_ra...@hotmail.com<mailto:hafiz_ra...@hotmail.com>> wrote:

Hi,

I'm exploring Cassandra for handling large data sets for mobile app, but i'm 
not clear where it stands.


If we use MySQL as  underlying database and Gizzard for building custom 
distributed databases (with arbitrary storage technology) and Memcached for 
highly queried data, then where lies Cassandra?


As i have read that Twitter uses both Cassandra and Gizzard. Please explain me 
where Cassandra will act.


Thanks in advance.


Regards,

Sikander



Sent from Outlook<http://aka.ms/weboutlook>



--
http://khangaonkar.blogspot.com/



Re: Query

2016-12-29 Thread Edward Capriolo
You should start with understanding your needs. Once you understand your
need you can pick the software that fits your need. Staring with a software
stack is backwards.

On Thu, Dec 29, 2016 at 11:34 PM, Ben Slater 
wrote:

> I wasn’t familiar with Gizzard either so I thought I’d take a look. The
> first things on their github readme is:
> *NB: This project is currently not recommended as a base for new
> consumers.*
> (And no commits since 2013)
>
> So, Cassandra definitely looks like a better choice as your datastore for
> a new project.
>
> Cheers
> Ben
>
> On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
> wrote:
>
>> I am not that familiar with gizzard but with gizzard + mysql , you have
>> multiple moving parts in the system that need to managed separately. You'll
>> need the mysql expert for mysql and the gizzard expert to manage the
>> distributed part. It can be argued that long term this will have higher
>> adminstration cost
>>
>> Cassandra's value add is its simple peer to peer architecture that is
>> easy to manage - a single database solution that is distributed, scalable,
>> highly available etc. In other words, once you gain expertise cassandra,
>> you get everything in one package.
>>
>> regards
>>
>>
>>
>>
>>
>> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
>> wrote:
>>
>> Hi,
>>
>> I'm exploring Cassandra for handling large data sets for mobile app, but
>> i'm not clear where it stands.
>>
>>
>> If we use MySQL as  underlying database and Gizzard for building custom
>> distributed databases (with arbitrary storage technology) and Memcached for
>> highly queried data, then where lies Cassandra?
>>
>>
>> As i have read that Twitter uses both Cassandra and Gizzard. Please
>> explain me where Cassandra will act.
>>
>>
>> Thanks in advance.
>>
>>
>> Regards,
>>
>> Sikander
>>
>>
>> Sent from Outlook 
>>
>>
>>
>>
>> --
>> http://khangaonkar.blogspot.com/
>>
>


Re: Query

2016-12-29 Thread Ben Slater
I wasn’t familiar with Gizzard either so I thought I’d take a look. The
first things on their github readme is:
*NB: This project is currently not recommended as a base for new consumers.*
(And no commits since 2013)

So, Cassandra definitely looks like a better choice as your datastore for a
new project.

Cheers
Ben

On Fri, 30 Dec 2016 at 12:41 Manoj Khangaonkar 
wrote:

> I am not that familiar with gizzard but with gizzard + mysql , you have
> multiple moving parts in the system that need to managed separately. You'll
> need the mysql expert for mysql and the gizzard expert to manage the
> distributed part. It can be argued that long term this will have higher
> adminstration cost
>
> Cassandra's value add is its simple peer to peer architecture that is easy
> to manage - a single database solution that is distributed, scalable,
> highly available etc. In other words, once you gain expertise cassandra,
> you get everything in one package.
>
> regards
>
>
>
>
>
> On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
> wrote:
>
> Hi,
>
> I'm exploring Cassandra for handling large data sets for mobile app, but
> i'm not clear where it stands.
>
>
> If we use MySQL as  underlying database and Gizzard for building custom
> distributed databases (with arbitrary storage technology) and Memcached for
> highly queried data, then where lies Cassandra?
>
>
> As i have read that Twitter uses both Cassandra and Gizzard. Please
> explain me where Cassandra will act.
>
>
> Thanks in advance.
>
>
> Regards,
>
> Sikander
>
>
> Sent from Outlook 
>
>
>
>
> --
> http://khangaonkar.blogspot.com/
>


Re: Query

2016-12-29 Thread Manoj Khangaonkar
I am not that familiar with gizzard but with gizzard + mysql , you have
multiple moving parts in the system that need to managed separately. You'll
need the mysql expert for mysql and the gizzard expert to manage the
distributed part. It can be argued that long term this will have higher
adminstration cost

Cassandra's value add is its simple peer to peer architecture that is easy
to manage - a single database solution that is distributed, scalable,
highly available etc. In other words, once you gain expertise cassandra,
you get everything in one package.

regards





On Thu, Dec 29, 2016 at 4:05 AM, Sikander Rafiq 
wrote:

> Hi,
>
> I'm exploring Cassandra for handling large data sets for mobile app, but
> i'm not clear where it stands.
>
>
> If we use MySQL as  underlying database and Gizzard for building custom
> distributed databases (with arbitrary storage technology) and Memcached for
> highly queried data, then where lies Cassandra?
>
>
> As i have read that Twitter uses both Cassandra and Gizzard. Please
> explain me where Cassandra will act.
>
>
> Thanks in advance.
>
>
> Regards,
>
> Sikander
>
>
> Sent from Outlook 
>



-- 
http://khangaonkar.blogspot.com/


Re: Query on Cassandra clusters

2016-12-21 Thread Sumit Anvekar
Thank you Alain for the detailed explanation.

To answer you question on Java version, JVM settings and Memory usage. We
are using using 1.8.0_45. precisely
>java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

JVM settings are identical on all nodes (cassandra-env.sh is identical).

Further when I say high on memory usage, Cassandra is using heap
(-Xmx3767M) and off heap of about 6GB out of the total system memory of
14.7 GB. Along with this there are other processes running on this system
which is bring the overall memory usage to >95%. This bring me to another
point whether *heap memory* + *off heap (sum of values of Space used
(total)) from nodetool cfstats* is the total memory used by Cassandra on a
node?

Also, on the disk front, what is a good amount of empty space to be left
out unused in the partition(~50%
 should be?) considering we use SizeTieredCompaction strategy?

On Wed, Dec 21, 2016 at 6:30 PM, Alain RODRIGUEZ  wrote:

> Hi Sumit,
>
> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
>> version 3.0.3 and then newer 5 nodes have 3.6.0 version.
>
>
> I strongly recommend to:
>
>
>- Stick with one version of Apache Cassandra per cluster.
>- Always be as close as possible from the last minor release of the
>Cassandra version in use.
>
>
> So you *really should* not be using 3.0.6 *AND* 3.6.0 but rather 3.0.10
> *OR* 3.7 (currently). Note that Cassandra 3.X (with X > 0) uses a tic toc
> release cycle where odd are bug fixes only and even numbers introduce new
> features as well.
>
> Running multiple version for a long period can induces errors, Cassandra
> is built to handle multiple versions only to give the time to operators to
> run a rolling restart. No streaming (adding / removing / repairing nodes)
> should happen during this period. Also, I have seen in the past some cases
> where changing the schema was also an issue with multiple versions leading
> to schema disagreements.
>
> Due to this scenario, a couple boxes are running very high on memory (95%
>> usage) whereas some of the older version nodes have just 60-70% memory
>> usage.
>
>
> Hard to say if this is related to the mutiple versions of Cassandra but it
> could. Are you sure nodes are using the same JVM / GC options
> (cassandra-env.sh) and Java version?
>
> Also, what is exactly "high on memory 95%"? Are we talking about heap or
> Native memory. Isn't the memory used as page cache (that would still be
> available for the system)?
>
> 2. To counter #1, I am planning to upgrade system configuration of the
>> nodes where there is higher memory usage. But the question is, will it be a
>> problem if we have a Cassandra cluster, where in a couple of nodes have
>> double the system configuration than other nodes in the cluster.
>>
>
> It is not a problem per se to have distinct configurations on distinct
> nodes. Cassandra does it very well, and it is frequently used to test some
> configuration change on a canary node, to prevent it from impacting the
> whole service.
>
> Yet, all the nodes should be doing the same work (unless you have some
> heterogenous hardware and are using distinct number of vnodes on each
> node). Keeping things homogenous allows the operator to easily compare how
> nodes are doing and it makes reasoning about Cassandra, as well as
> troubleshooting issues a way easier.
>
> So I would:
>
> - Fully upgrade / downgrade asap to a chosen version (3.X is known as
> being not yet stable, but going back to 3.0.X might be more painful)
> - Make sure nodes are well balanced and using the same number of ranges
> 'nodetool status '
> - Make sure the node are using the same Java version and JVM settings.
>
> Hope that helps,
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-12-21 8:22 GMT+01:00 Sumit Anvekar :
>
>> I have a couple questions.
>>
>> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
>> version 3.0.3 and then newer 5 nodes have 3.6.0 version. I has been running
>> fine until recently I am seeing higher amount of data residing in newer
>> boxes. The configuration file (YAML file) is exactly same on all nodes
>> (except for the node host names). Wondering if the version has something to
>> do with this scenario. Due to this scenario, a couple boxes are running
>> very high on memory (95% usage) whereas some of the older version nodes
>> have just 60-70% memory usage.
>>
>> 2. To counter #1, I am planning to upgrade system configuration of the
>> nodes where there is higher memory usage. But the question is, will it be a
>> problem if we have a Cassandra cluster, where in a couple of nodes have
>> double the system configuration than other nodes in the cluster.
>>
>> 

Re: Query on Cassandra clusters

2016-12-21 Thread Alain RODRIGUEZ
Hi Sumit,

1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
> version 3.0.3 and then newer 5 nodes have 3.6.0 version.


I strongly recommend to:


   - Stick with one version of Apache Cassandra per cluster.
   - Always be as close as possible from the last minor release of the
   Cassandra version in use.


So you *really should* not be using 3.0.6 *AND* 3.6.0 but rather 3.0.10 *OR*
3.7 (currently). Note that Cassandra 3.X (with X > 0) uses a tic toc
release cycle where odd are bug fixes only and even numbers introduce new
features as well.

Running multiple version for a long period can induces errors, Cassandra is
built to handle multiple versions only to give the time to operators to run
a rolling restart. No streaming (adding / removing / repairing nodes)
should happen during this period. Also, I have seen in the past some cases
where changing the schema was also an issue with multiple versions leading
to schema disagreements.

Due to this scenario, a couple boxes are running very high on memory (95%
> usage) whereas some of the older version nodes have just 60-70% memory
> usage.


Hard to say if this is related to the mutiple versions of Cassandra but it
could. Are you sure nodes are using the same JVM / GC options
(cassandra-env.sh) and Java version?

Also, what is exactly "high on memory 95%"? Are we talking about heap or
Native memory. Isn't the memory used as page cache (that would still be
available for the system)?

2. To counter #1, I am planning to upgrade system configuration of the
> nodes where there is higher memory usage. But the question is, will it be a
> problem if we have a Cassandra cluster, where in a couple of nodes have
> double the system configuration than other nodes in the cluster.
>

It is not a problem per se to have distinct configurations on distinct
nodes. Cassandra does it very well, and it is frequently used to test some
configuration change on a canary node, to prevent it from impacting the
whole service.

Yet, all the nodes should be doing the same work (unless you have some
heterogenous hardware and are using distinct number of vnodes on each
node). Keeping things homogenous allows the operator to easily compare how
nodes are doing and it makes reasoning about Cassandra, as well as
troubleshooting issues a way easier.

So I would:

- Fully upgrade / downgrade asap to a chosen version (3.X is known as being
not yet stable, but going back to 3.0.X might be more painful)
- Make sure nodes are well balanced and using the same number of ranges
'nodetool status '
- Make sure the node are using the same Java version and JVM settings.

Hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-12-21 8:22 GMT+01:00 Sumit Anvekar :

> I have a couple questions.
>
> 1. I have a Cassandra cluster with 11 nodes, 5 of which have Cassandra
> version 3.0.3 and then newer 5 nodes have 3.6.0 version. I has been running
> fine until recently I am seeing higher amount of data residing in newer
> boxes. The configuration file (YAML file) is exactly same on all nodes
> (except for the node host names). Wondering if the version has something to
> do with this scenario. Due to this scenario, a couple boxes are running
> very high on memory (95% usage) whereas some of the older version nodes
> have just 60-70% memory usage.
>
> 2. To counter #1, I am planning to upgrade system configuration of the
> nodes where there is higher memory usage. But the question is, will it be a
> problem if we have a Cassandra cluster, where in a couple of nodes have
> double the system configuration than other nodes in the cluster.
>
> Appreciate any comment on the same.
>
> Sumit.
>


Re: Query regarding spark on cassandra

2016-04-28 Thread Siddharth Verma
Anyways, thanks for your reply.


On Thu, Apr 28, 2016 at 1:59 PM, Hannu Kröger  wrote:

> Ok, then I don’t understand the problem.
>
> Hannu
>
> On 28 Apr 2016, at 11:19, Siddharth Verma 
> wrote:
>
> Hi Hannu,
>
> Had the issue been caused due to read, the insert, and delete statement
> would have been erroneous.
> "I saw the stdout from web-ui of spark, and the query along with true was
> printed for both the queries.".
> The statements were correct as seen on the UI.
> Thanks,
> Siddharth Verma
>
>
>
> On Thu, Apr 28, 2016 at 1:22 PM, Hannu Kröger  wrote:
>
>> Hi,
>>
>> could it be consistency level issue? If you use ONE for reads and writes,
>> might be that sometimes you don't get what you are writing.
>>
>> See:
>>
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
>>
>> Br,
>> Hannu
>>
>>
>> 2016-04-27 20:41 GMT+03:00 Siddharth Verma 
>> :
>>
>>> Hi,
>>> I dont know, if someone has faced this problem or not.
>>> I am running a job where some data is loaded from cassandra table. From
>>> that data, i make some insert and delete statements.
>>> and execute it (using forEach)
>>>
>>> Code snippet:
>>> boolean deleteStatus=
>>> connector.openSession().execute(delete).wasApplied();
>>> boolean  insertStatus =
>>> connector.openSession().execute(insert).wasApplied();
>>> System.out.println(delete+":"+deleteStatus);
>>> System.out.println(insert+":"+insertStatus);
>>>
>>> When i run it locally, i see the respective results in the table.
>>>
>>> However when i run it on a cluster, sometimes the result is displayed
>>> and sometime the changes don't take place.
>>> I saw the stdout from web-ui of spark, and the query along with true was
>>> printed for both the queries.
>>>
>>> I can't understand, what could be the issue.
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks,
>>> Siddharth Verma
>>>
>>
>>
>
>


Re: Query regarding spark on cassandra

2016-04-28 Thread Hannu Kröger
Ok, then I don’t understand the problem.

Hannu

> On 28 Apr 2016, at 11:19, Siddharth Verma  
> wrote:
> 
> Hi Hannu,
> 
> Had the issue been caused due to read, the insert, and delete statement would 
> have been erroneous.
> "I saw the stdout from web-ui of spark, and the query along with true was 
> printed for both the queries.".
> The statements were correct as seen on the UI.
> Thanks,
> Siddharth Verma
> 
> 
> 
> On Thu, Apr 28, 2016 at 1:22 PM, Hannu Kröger  > wrote:
> Hi,
> 
> could it be consistency level issue? If you use ONE for reads and writes, 
> might be that sometimes you don't get what you are writing.
> 
> See:
> https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
>  
> 
> 
> Br,
> Hannu
> 
> 
> 2016-04-27 20:41 GMT+03:00 Siddharth Verma  >:
> Hi,
> I dont know, if someone has faced this problem or not.
> I am running a job where some data is loaded from cassandra table. From that 
> data, i make some insert and delete statements.
> and execute it (using forEach)
> 
> Code snippet:
> boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
> boolean  insertStatus = connector.openSession().execute(insert).wasApplied();
> System.out.println(delete+":"+deleteStatus);
> System.out.println(insert+":"+insertStatus);
> 
> When i run it locally, i see the respective results in the table.
> 
> However when i run it on a cluster, sometimes the result is displayed and 
> sometime the changes don't take place.
> I saw the stdout from web-ui of spark, and the query along with true was 
> printed for both the queries.
> 
> I can't understand, what could be the issue.
> 
> Any help would be appreciated.
> 
> Thanks,
> Siddharth Verma
> 
> 



Re: Query regarding spark on cassandra

2016-04-28 Thread Siddharth Verma
Hi Hannu,

Had the issue been caused due to read, the insert, and delete statement
would have been erroneous.
"I saw the stdout from web-ui of spark, and the query along with true was
printed for both the queries.".
The statements were correct as seen on the UI.
Thanks,
Siddharth Verma



On Thu, Apr 28, 2016 at 1:22 PM, Hannu Kröger  wrote:

> Hi,
>
> could it be consistency level issue? If you use ONE for reads and writes,
> might be that sometimes you don't get what you are writing.
>
> See:
>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
>
> Br,
> Hannu
>
>
> 2016-04-27 20:41 GMT+03:00 Siddharth Verma :
>
>> Hi,
>> I dont know, if someone has faced this problem or not.
>> I am running a job where some data is loaded from cassandra table. From
>> that data, i make some insert and delete statements.
>> and execute it (using forEach)
>>
>> Code snippet:
>> boolean deleteStatus=
>> connector.openSession().execute(delete).wasApplied();
>> boolean  insertStatus =
>> connector.openSession().execute(insert).wasApplied();
>> System.out.println(delete+":"+deleteStatus);
>> System.out.println(insert+":"+insertStatus);
>>
>> When i run it locally, i see the respective results in the table.
>>
>> However when i run it on a cluster, sometimes the result is displayed and
>> sometime the changes don't take place.
>> I saw the stdout from web-ui of spark, and the query along with true was
>> printed for both the queries.
>>
>> I can't understand, what could be the issue.
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Siddharth Verma
>>
>
>


Re: Query regarding spark on cassandra

2016-04-28 Thread Hannu Kröger
Hi,

could it be consistency level issue? If you use ONE for reads and writes,
might be that sometimes you don't get what you are writing.

See:
https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html

Br,
Hannu


2016-04-27 20:41 GMT+03:00 Siddharth Verma :

> Hi,
> I dont know, if someone has faced this problem or not.
> I am running a job where some data is loaded from cassandra table. From
> that data, i make some insert and delete statements.
> and execute it (using forEach)
>
> Code snippet:
> boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
> boolean  insertStatus =
> connector.openSession().execute(insert).wasApplied();
> System.out.println(delete+":"+deleteStatus);
> System.out.println(insert+":"+insertStatus);
>
> When i run it locally, i see the respective results in the table.
>
> However when i run it on a cluster, sometimes the result is displayed and
> sometime the changes don't take place.
> I saw the stdout from web-ui of spark, and the query along with true was
> printed for both the queries.
>
> I can't understand, what could be the issue.
>
> Any help would be appreciated.
>
> Thanks,
> Siddharth Verma
>


Re: Query regarding spark on cassandra

2016-04-28 Thread Siddharth Verma
Edit:
1. dc2 node has been removed.
nodetool status shows only active nodes.
2. Repair done on all nodes.
3. Cassandra restarted

Still it doesn't solve the problem.

On Thu, Apr 28, 2016 at 9:00 AM, Siddharth Verma <
verma.siddha...@snapdeal.com> wrote:

> Hi, If the info could be used
> we are using two DCs
> dc1 - 3 nodes
> dc2 - 1 node
> however, dc2 has been down for 3-4 weeks, and we haven't removed it yet.
>
> spark slaves on same machines as the cassandra nodes.
> each node has two instances of slaves.
>
> spark master on a separate machine.
>
> If anyone could provide insight to the problem, it would be helpful.
>
> Thanks
>
> On Wed, Apr 27, 2016 at 11:11 PM, Siddharth Verma <
> verma.siddha...@snapdeal.com> wrote:
>
>> Hi,
>> I dont know, if someone has faced this problem or not.
>> I am running a job where some data is loaded from cassandra table. From
>> that data, i make some insert and delete statements.
>> and execute it (using forEach)
>>
>> Code snippet:
>> boolean deleteStatus=
>> connector.openSession().execute(delete).wasApplied();
>> boolean  insertStatus =
>> connector.openSession().execute(insert).wasApplied();
>> System.out.println(delete+":"+deleteStatus);
>> System.out.println(insert+":"+insertStatus);
>>
>> When i run it locally, i see the respective results in the table.
>>
>> However when i run it on a cluster, sometimes the result is displayed and
>> sometime the changes don't take place.
>> I saw the stdout from web-ui of spark, and the query along with true was
>> printed for both the queries.
>>
>> I can't understand, what could be the issue.
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Siddharth Verma
>>
>
>


Re: Query regarding spark on cassandra

2016-04-27 Thread Siddharth Verma
Hi, If the info could be used
we are using two DCs
dc1 - 3 nodes
dc2 - 1 node
however, dc2 has been down for 3-4 weeks, and we haven't removed it yet.

spark slaves on same machines as the cassandra nodes.
each node has two instances of slaves.

spark master on a separate machine.

If anyone could provide insight to the problem, it would be helpful.

Thanks

On Wed, Apr 27, 2016 at 11:11 PM, Siddharth Verma <
verma.siddha...@snapdeal.com> wrote:

> Hi,
> I dont know, if someone has faced this problem or not.
> I am running a job where some data is loaded from cassandra table. From
> that data, i make some insert and delete statements.
> and execute it (using forEach)
>
> Code snippet:
> boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
> boolean  insertStatus =
> connector.openSession().execute(insert).wasApplied();
> System.out.println(delete+":"+deleteStatus);
> System.out.println(insert+":"+insertStatus);
>
> When i run it locally, i see the respective results in the table.
>
> However when i run it on a cluster, sometimes the result is displayed and
> sometime the changes don't take place.
> I saw the stdout from web-ui of spark, and the query along with true was
> printed for both the queries.
>
> I can't understand, what could be the issue.
>
> Any help would be appreciated.
>
> Thanks,
> Siddharth Verma
>


Re: Query regarding CassandraJavaRDD while running spark job on cassandra

2016-03-24 Thread Kai Wang
I suggest you post this to spark-cassandra-connector list.

On Sat, Mar 12, 2016 at 12:52 AM, Siddharth Verma <
verma.siddha...@snapdeal.com> wrote:

> In cassandra I have a table with the following schema.
>
> CREATE TABLE my_keyspace.my_table1 (
> col_1 text,
> col_2 text,
> col_3 text,
> col_4 text,,
> col_5 text,
> col_6 text,
> col_7 text,
> PRIMARY KEY (col_1, col_2, col_3)
> ) WITH CLUSTERING ORDER BY (col_2 ASC, col_3 ASC);
>
> For processing I create a spark job.
>
> CassandraJavaRDD data1 =
> function.cassandraTable("my_keyspace", "my_table1")
>
>
> 1. Does it guarantee mutual exclusivity of fetched rows across all RDDs
> which are on worker nodes?
> (At the cost of redundancy and verbosity, I will reiterate.
> Suppose I have an entry in the table : ('1','2','3','4','5','6','7')
> What I mean to ask is, when I perform transformations/actions on data1
> RDD), can I be sure that the above entry will be present on ONLY ONE worker
> node?)
>
> 2. All the data pertaining to one partition will be on one node?
> (Suppose I have the following entries in the table :
> ('p1','c2_1','c3_1','4','5','6','7')
> ('p1','c2_2','c3'_2,'4','5','6','7')
> ('p1','c2_3','c3_3','4','5','6','7')
> ('p1','c2_4','c3_4','4','5','6','7')
> ('p1' )
> ('p1' )
> ('p1' )
> All the data for the same partition will be present on only one node?
> )
>
> 3. If i have a DC specifically for analytics, and I place the spark worker
> on the same machines as cassandra node, for that entire DC.
> Can I make sure that the spark worker fetches the data from the token
> range present on that node? (I.E. the node does't fetch data present on
> different node)
> 3.1 (as with the above statement which doesn't have a 'where' clause).
> 3.2 (as with the above statement which has a 'where' clause).
>


Re: Query Consistency Issues...

2015-12-15 Thread Paulo Motta
What cassandra and driver versions are you running?

It may be that the second update is getting the same timestamp as the
first, or even a lower timestamp if it's being processed by another server
with unsynced clock, so that update may be getting lost.

If you have high frequency updates in the same partition from the same
client you should probably use client-side timestamps with a configured
timestamp generator on the driver, available in Cassandra 2.1 and Java
driver 2.1.2, and default in java driver 3.0.

For more information:
- http://www.datastax.com/dev/blog/java-driver-2-1-2-native-protocol-v3
- https://datastax.github.io/java-driver/features/query_timestamps/
-
https://docs.datastax.com/en/developer/cpp-driver/2.1/cpp-driver/reference/clientsideTimestamps.html

2015-12-15 11:36 GMT-08:00 James Carman :

> We are encountering a situation in our environment (a 6-node Cassandra
> ring) where we are trying to insert a row and then immediately update it,
> using LOCAL_QUORUM consistency (replication factor = 3).  I have replicated
> the issue using the following code:
>
> https://gist.github.com/jwcarman/72714e6d0ea3508e24cc
>
> Should we expect this to work?  Should LOCAL_QUORUM be sufficient?  If so,
> what type of setup issues would we look for which would cause these types
> of issues?
>
> Thanks,
>
> James
>


Re: Query Consistency Issues...

2015-12-15 Thread James Carman
On Tue, Dec 15, 2015 at 2:57 PM Paulo Motta 
wrote:

> What cassandra and driver versions are you running?
>
>
We are using 2.1.7.1


> It may be that the second update is getting the same timestamp as the
> first, or even a lower timestamp if it's being processed by another server
> with unsynced clock, so that update may be getting lost.
>
>
So, we need to look for clock sync issues between nodes in our ring?  How
close do they need to be?


> If you have high frequency updates in the same partition from the same
> client you should probably use client-side timestamps with a configured
> timestamp generator on the driver, available in Cassandra 2.1 and Java
> driver 2.1.2, and default in java driver 3.0.
>
>
Very cool!  If we have multiple nodes in our application, I suppose *their*
clocks will have to be sync'ed for this to work, right?


Re: Query Consistency Issues...

2015-12-15 Thread Steve Robenalt
I agree with Jon. It's almost a statistical certainty that such updates
will be processed out of order some of the time because the clock sync
between machines will never be perfect.

Depending on how your actual code that shows this problem is structured,
there are ways to reduce or eliminate such issues. If the successive
updates are always expected to occur together in a specific order, you can
wrap them in a BatchStatement, which forces them to use the same
coordinator node and thus preserves the ordering of the updates. If there
is a causal relationship driving the order of the updates, a Light Weight
Transaction might be appropriate. Another strategy is to publish an event
to a topic after the first update and a subscriber can then trigger the
second.

There are other options, but I've used the above 3 to solve this problem
whenever I've encountered this situation and haven't found a case where I
needed another.

HTH,
Steve

On Tue, Dec 15, 2015 at 12:56 PM, Jonathan Haddad  wrote:

> High volume updates to a single key in a distributed system that relies on
> a timestamp for conflict resolution is not a particularly great idea.  If
> you ever do this from multiple clients you'll find unexpected results at
> least some of the time.
>
> On Tue, Dec 15, 2015 at 12:41 PM Paulo Motta 
> wrote:
>
>> > We are using 2.1.7.1
>>
>> Then you should be able to use the java driver timestamp generators.
>>
>> > So, we need to look for clock sync issues between nodes in our ring?
>> How close do they need to be?
>>
>> millisecond precision since that is the server precision for timestamps,
>> so probably NTP should do the job. if your application have submillisecond
>> updates in the same partitions, you'd probably need to use client-side
>> timestamps anyway, since they allow setting timestamps with sub-ms
>> precision.
>>
>> > Very cool!  If we have multiple nodes in our application, I suppose
>> *their* clocks will have to be sync'ed for this to work, right?
>>
>> correct, you may also use ntp to synchronize clocks between clients.
>>
>>
>> 2015-12-15 12:19 GMT-08:00 James Carman :
>>
>>>
>>>
>>> On Tue, Dec 15, 2015 at 2:57 PM Paulo Motta 
>>> wrote:
>>>
 What cassandra and driver versions are you running?


>>> We are using 2.1.7.1
>>>
>>>
 It may be that the second update is getting the same timestamp as the
 first, or even a lower timestamp if it's being processed by another server
 with unsynced clock, so that update may be getting lost.


>>> So, we need to look for clock sync issues between nodes in our ring?
>>> How close do they need to be?
>>>
>>>
 If you have high frequency updates in the same partition from the same
 client you should probably use client-side timestamps with a configured
 timestamp generator on the driver, available in Cassandra 2.1 and Java
 driver 2.1.2, and default in java driver 3.0.


>>> Very cool!  If we have multiple nodes in our application, I suppose
>>> *their* clocks will have to be sync'ed for this to work, right?
>>>
>>
>>


-- 
Steve Robenalt
Software Architect
sroben...@highwire.org 
(office/cell): 916-505-1785

HighWire Press, Inc.
425 Broadway St, Redwood City, CA 94063
www.highwire.org

Technology for Scholarly Communication


Re: Query Consistency Issues...

2015-12-15 Thread Jonathan Haddad
High volume updates to a single key in a distributed system that relies on
a timestamp for conflict resolution is not a particularly great idea.  If
you ever do this from multiple clients you'll find unexpected results at
least some of the time.

On Tue, Dec 15, 2015 at 12:41 PM Paulo Motta 
wrote:

> > We are using 2.1.7.1
>
> Then you should be able to use the java driver timestamp generators.
>
> > So, we need to look for clock sync issues between nodes in our ring?
> How close do they need to be?
>
> millisecond precision since that is the server precision for timestamps,
> so probably NTP should do the job. if your application have submillisecond
> updates in the same partitions, you'd probably need to use client-side
> timestamps anyway, since they allow setting timestamps with sub-ms
> precision.
>
> > Very cool!  If we have multiple nodes in our application, I suppose
> *their* clocks will have to be sync'ed for this to work, right?
>
> correct, you may also use ntp to synchronize clocks between clients.
>
>
> 2015-12-15 12:19 GMT-08:00 James Carman :
>
>>
>>
>> On Tue, Dec 15, 2015 at 2:57 PM Paulo Motta 
>> wrote:
>>
>>> What cassandra and driver versions are you running?
>>>
>>>
>> We are using 2.1.7.1
>>
>>
>>> It may be that the second update is getting the same timestamp as the
>>> first, or even a lower timestamp if it's being processed by another server
>>> with unsynced clock, so that update may be getting lost.
>>>
>>>
>> So, we need to look for clock sync issues between nodes in our ring?  How
>> close do they need to be?
>>
>>
>>> If you have high frequency updates in the same partition from the same
>>> client you should probably use client-side timestamps with a configured
>>> timestamp generator on the driver, available in Cassandra 2.1 and Java
>>> driver 2.1.2, and default in java driver 3.0.
>>>
>>>
>> Very cool!  If we have multiple nodes in our application, I suppose
>> *their* clocks will have to be sync'ed for this to work, right?
>>
>
>


Re: query statement return empty

2015-07-30 Thread Jeff Jirsa
What consistency level are you using with your query?
What replication factor are you using on your keyspace?
Have you run repair?

The most likely explanation is that you wrote with low consistency (ANY, ONE, 
etc), and that one or more replicas does not have the cell. You’re then reading 
with low consistency (ONE, etc), and occasionally the coordinator choses a 
replica without the data, so it is returning an empty result.

You can either increase your consistency level on reads and/or writes, or you 
can run repair to get the data on all nodes.



From:  鄢来琼
Reply-To:  user@cassandra.apache.org
Date:  Thursday, July 30, 2015 at 6:02 PM
To:  user@cassandra.apache.org
Subject:  query statement return empty

Hi ALL

 

The result of “select * from t_test where id = 1” statement is not consistency,

Could you tell me why?

 

test case,

I = 0;

While I  5:

result = cassandra_session.execute(“select ratio from t_test where id = 1”)

print result

 

testing result:

[Row(ratio=Decimal('0.000'))]

[]

[Row(ratio=Decimal('0.000'))]

[Row(ratio=Decimal('0.000'))]

[Row(ratio=Decimal('0.000'))]

 

Cassandra cluster:

My Cassandra version is 2.12, 

the Cassandra cluster has 9 nodes.

The python driver version is 2.6

 

I have tested both the AsyncoreConnection and LibevConnection, the results are 
in consistency.

 

Thanks a lot.

 

Peter



smime.p7s
Description: S/MIME cryptographic signature


RE: query statement return empty

2015-07-30 Thread 鄢来琼
The replication factor is 3, and we have tested it using “ALL”/ “QUORUM 
”consistency level, the result are in-consistency.
But we rewrite it using java or C#, the results are consistency.

Thanks.

发件人: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com]
发送时间: Friday, July 31, 2015 9:15 AM
收件人: user@cassandra.apache.org
主题: Re: query statement return empty

What consistency level are you using with your query?
What replication factor are you using on your keyspace?
Have you run repair?

The most likely explanation is that you wrote with low consistency (ANY, ONE, 
etc), and that one or more replicas does not have the cell. You’re then reading 
with low consistency (ONE, etc), and occasionally the coordinator choses a 
replica without the data, so it is returning an empty result.

You can either increase your consistency level on reads and/or writes, or you 
can run repair to get the data on all nodes.



From: 鄢来琼
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, July 30, 2015 at 6:02 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: query statement return empty

Hi ALL

The result of “select * from t_test where id = 1” statement is not consistency,
Could you tell me why?

test case,
I = 0;
While I  5:
result = cassandra_session.execute(“select ratio from t_test where id = 1”)
print result

testing result:
[Row(ratio=Decimal('0.000'))]
[]
[Row(ratio=Decimal('0.000'))]
[Row(ratio=Decimal('0.000'))]
[Row(ratio=Decimal('0.000'))]

Cassandra cluster:
My Cassandra version is 2.12,
the Cassandra cluster has 9 nodes.
The python driver version is 2.6

I have tested both the AsyncoreConnection and LibevConnection, the results are 
in consistency.

Thanks a lot.

Peter


Re: Query returning tombstones

2015-05-03 Thread horschi
Hi Jens,

thanks a lot for the link! Your ticket seems very similar to my request.

kind regards,
Christian


On Sat, May 2, 2015 at 2:25 PM, Jens Rantil jens.ran...@tink.se wrote:

 Hi Christian,

 I just know Sylvain explicitly stated he wasn't a fan of exposing
 tombstones here:
 https://issues.apache.org/jira/browse/CASSANDRA-8574?focusedCommentId=14292063page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292063

 Cheers,
 Jens

 On Wed, Apr 29, 2015 at 12:43 PM, horschi hors...@gmail.com wrote:

 Hi,

 did anybody ever raise a feature request for selecting tombstones in
 CQL/thrift?

 It would be nice if I could use CQLSH to see where my tombstones are
 coming from. This would much more convenient than using sstable2json.

 Maybe someone can point me to an existing jira-ticket, but I also
 appreciate any other feedback :-)

 regards,
 Christian




 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink



Re: query contains IN on the partition key and an ORDER BY

2015-05-02 Thread Robert Wille
Bag the IN clause and execute multiple parallel queries instead. It’s more 
performant anyway.

On May 2, 2015, at 11:46 AM, Abhishek Singh Bailoo 
abhishek.singh.bai...@gmail.commailto:abhishek.singh.bai...@gmail.com wrote:

Hi

I have run into the following issue 
https://issues.apache.org/jira/browse/CASSANDRA-6722 when running a query 
(contains IN on the partition key and an ORDER BY ) using datastax driver for 
Java.

However, I am able to run this query alright in cqlsh.

cqlsh: show version;
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]

cqlsh:gps select * from log where imeih in 
('862170011627815@2015-01-29@03','862170011627815@2015-01-30@21','862170011627815@2015-01-30@04')
 and dtime  '2015-01-30 23:59:59' order by dtime desc limit 1;

The same query when run via datastax Java driver gives the following error:

Exception in thread main 
com.datastax.driver.core.exceptions.InvalidQueryException: Cannot page queries 
with both ORDER BY and a IN restriction on the partition key; you must either 
remove the ORDER BY or the IN and sort client side, or disable paging for this 
query
at 
com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)

Any ideas?

Thanks,
Abhishek.



Re: Query returning tombstones

2015-05-02 Thread Jens Rantil
Hi Christian,

I just know Sylvain explicitly stated he wasn't a fan of exposing
tombstones here:
https://issues.apache.org/jira/browse/CASSANDRA-8574?focusedCommentId=14292063page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292063

Cheers,
Jens

On Wed, Apr 29, 2015 at 12:43 PM, horschi hors...@gmail.com wrote:

 Hi,

 did anybody ever raise a feature request for selecting tombstones in
 CQL/thrift?

 It would be nice if I could use CQLSH to see where my tombstones are
 coming from. This would much more convenient than using sstable2json.

 Maybe someone can point me to an existing jira-ticket, but I also
 appreciate any other feedback :-)

 regards,
 Christian




-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: query by column size

2015-02-13 Thread chandra Varahala
I have already secondary index on that column, but how to I query that
column by size ?

thanks
chandra

On Fri, Feb 13, 2015 at 3:30 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 There is no automatic indexing in Cassandra. There are secondary indexes,
 but not for these cases.
 You could use a solution like DSE, to get data automatically indexed on
 solr, in each node, as soon as data comes. Then you could do such a query
 on solr.
 If the query can be slow, you could run a MR job over all rows, filtering
 the ones you want.
 []s

 From: user@cassandra.apache.org
 Subject: Re:query by column size

 Greetings,

 I have one column family with 10 columns,  one of the column we store
 xml/json.
 Is there a way I can query  that column where size  50kb  ?  assuming I
  have index on that column.

 thanks
 CV.





Re: query by column size

2015-02-13 Thread Tyler Hobbs
On Fri, Feb 13, 2015 at 11:18 AM, chandra Varahala 
hadoopandcassan...@gmail.com wrote:

 I have already secondary index on that column, but how to I query that
 column by size ?


You can't.  If this is a query that you want to do regularly and
efficiently, I suggest creating a second table to act as an index (or
materialized view of sorts).  Whenever your application writes a row to the
original table with a column  50kb, it should also update the second table.


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Query strategy with respect to tombstones

2014-12-17 Thread Ryan Svihla
so first limits are good, the unlimited row count of a user can eventually
eat you, which I suspect it is here, you maybe better off partitioning your
data with some reasonable limits, but this is a bigger domain modeling
conversation.
Second, tombstone overflowing is typically a canary for a data model that
no longer fits the application's needs.

Typical options for tombstones are:

   1. gc_grace_seconds to a much lower number. I'm not a huge fan of this
   strategy as it means you can easily introduce inconsistency if you don't
   handle repairs before gc_grace_seconds.
   2. partition data in a way that makes it easier to manage tombstones, if
   there is a logical way to allocate data, either by problem domain or time
   then you can at some point safely truncate 'aged out' data.
   
http://lostechies.com/ryansvihla/2014/10/20/domain-modeling-around-deletes-or-using-cassandra-as-a-queue-even-when-you-know-better/

I will say I'm not a huge fan of the soft delete pattern in Cassandra, it's
like a permanent tombstone.

On Wed, Dec 17, 2014 at 6:38 AM, Jens Rantil jens.ran...@tink.se wrote:

   Hi,

 I have a table with composite primary id ((userid), id). Some patterns
 about my table:
  * Each user generally has 0-3000 rows. But there is currently no upper
 limit.
  * Deleting rows for a user is extremely rare, but when done it can be
 done thousands of rows at a time.
  * The absolutely most common query is to select all rows for a user.

 Recently I saw a user that previously had 65000 tombstones when querying
 for all his rows. system.log was printing TombstoneOverwhelmingException.

 What are my options to avoid this overwhelming tombstone exception? I am
 willing to have slower queries than actually not being able to query at
 all. I see a couple of options:
  * Using an anti-column to mark rows as deleted. I could then control the
 rate of which I am writing tombstones by occasionally deleting
 anti-columns/rows with their equivalent rows.
  * Simply raise tombstone_failure_threshold. AFAIK, this will eventually
 make me run into possible GC issues.
  * Use fetchSize to limit the number of rows paged through. This would
 make every single query slower, and would not entirely avoid the
 possibility of getting TombstoneOverwhelmingException.

 Have I missed any alternatives here?

  In the best of worlds, the fetchSize property would also honour the
 number of tombstones, but I don’t think that would be possible, right?

 Thanks,
 Jens

 ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: query tracing

2014-11-15 Thread Jimmy Lin
Well we are able to do the tracing under normal load, but not yet able to
turn on tracing on demand during heavy load from client side(due to hard to
predict traffic pattern).

under normal load we saw most of the time query spent (in one particular
row we focus on) between
merging data from memtables and (2-3) sstables
Read 10xx live cell and 2x tomstones cell.

Our cql basically pull out one row that has about 1000 columns(approx. 800k
size of data). This table already in level compaction.

But once we get a series of exact same cql(against same row), the response
time start to dramatically degraded from normal 300-500ms to like 1 sec or
4 sec.
Other part of the system seems remain fine, no obvious latency spike In
read/write within the same keyspace or different keyspace.

So I wonder what is causing the sudden increase in latency of exact same
cql? what do we saturated ? if we saturated the disk IO, other part of the
tables will see similar effect but we didn't see it.
is there any table specific factor may contribute to the slowness?

thanks








On Mon, Nov 10, 2014 at 7:21 AM, DuyHai Doan doanduy...@gmail.com wrote:

 As Jonathan said, it's better to activate query tracing client side. It'll
 give you better flexibility of when to turn on  off tracing and on which
 table. Server-side tracing is global (all tables) and probabilistic, thus
 may not give satisfactory level of debugging.

  Programmatically it's pretty simple to achieve and coupled with a good
 logging framework (LogBack for Java), you'll even have dynamic logging on
 production without having to redeploy client code. I have implemented it in
 Achilles very easily by wrapping over the Regular/Bound/Simple statements
 of Java driver and display the bound values at runtime :
 https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

 On Mon, Nov 10, 2014 at 3:52 PM, Johnny Miller johnny.p.mil...@gmail.com
 wrote:

 Be cautious enabling query tracing. Great tool for dev/testing/diagnosing
 etc.. - but it does persist data to the system_traces keyspace with a TTL
 of 24 hours and will, as a consequence, consume resources.

 http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2


 On 7 Nov 2014, at 20:20, Jonathan Haddad j...@jonhaddad.com wrote:

 Personally I've found that using query timing + log aggregation on the
 client side is more effective than trying to mess with tracing probability
 in order to find a single query which has recently become a problem.  I
 recommend wrapping your session with something that can automatically log
 the statement on a slow query, then use tracing to identify exactly what
 happened.  This way finding your problem is not a matter of chance.



 On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com
 wrote:

 It saves a lot of information for each request thats traced so there is
 significant overhead.  If you start at a low probability and move it up
 based on the load impact it will provide a lot of insight and you can
 control the cost.

 ---
 Chris Lohfink

 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com
 wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the
 system in some other ways?

 thanks








Re: query tracing

2014-11-15 Thread Jens Rantil
Maybe you should try to lower your read repair probability?


—
Sent from Mailbox

On Sat, Nov 15, 2014 at 9:40 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 Well we are able to do the tracing under normal load, but not yet able to
 turn on tracing on demand during heavy load from client side(due to hard to
 predict traffic pattern).
 under normal load we saw most of the time query spent (in one particular
 row we focus on) between
 merging data from memtables and (2-3) sstables
 Read 10xx live cell and 2x tomstones cell.
 Our cql basically pull out one row that has about 1000 columns(approx. 800k
 size of data). This table already in level compaction.
 But once we get a series of exact same cql(against same row), the response
 time start to dramatically degraded from normal 300-500ms to like 1 sec or
 4 sec.
 Other part of the system seems remain fine, no obvious latency spike In
 read/write within the same keyspace or different keyspace.
 So I wonder what is causing the sudden increase in latency of exact same
 cql? what do we saturated ? if we saturated the disk IO, other part of the
 tables will see similar effect but we didn't see it.
 is there any table specific factor may contribute to the slowness?
 thanks
 On Mon, Nov 10, 2014 at 7:21 AM, DuyHai Doan doanduy...@gmail.com wrote:
 As Jonathan said, it's better to activate query tracing client side. It'll
 give you better flexibility of when to turn on  off tracing and on which
 table. Server-side tracing is global (all tables) and probabilistic, thus
 may not give satisfactory level of debugging.

  Programmatically it's pretty simple to achieve and coupled with a good
 logging framework (LogBack for Java), you'll even have dynamic logging on
 production without having to redeploy client code. I have implemented it in
 Achilles very easily by wrapping over the Regular/Bound/Simple statements
 of Java driver and display the bound values at runtime :
 https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

 On Mon, Nov 10, 2014 at 3:52 PM, Johnny Miller johnny.p.mil...@gmail.com
 wrote:

 Be cautious enabling query tracing. Great tool for dev/testing/diagnosing
 etc.. - but it does persist data to the system_traces keyspace with a TTL
 of 24 hours and will, as a consequence, consume resources.

 http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2


 On 7 Nov 2014, at 20:20, Jonathan Haddad j...@jonhaddad.com wrote:

 Personally I've found that using query timing + log aggregation on the
 client side is more effective than trying to mess with tracing probability
 in order to find a single query which has recently become a problem.  I
 recommend wrapping your session with something that can automatically log
 the statement on a slow query, then use tracing to identify exactly what
 happened.  This way finding your problem is not a matter of chance.



 On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com
 wrote:

 It saves a lot of information for each request thats traced so there is
 significant overhead.  If you start at a low probability and move it up
 based on the load impact it will provide a lot of insight and you can
 control the cost.

 ---
 Chris Lohfink

 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com
 wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the
 system in some other ways?

 thanks







Re: query tracing

2014-11-15 Thread Jimmy Lin
hi Jen,
interesting idea, but I thought read repair happen in background, and so
won't affect the actual read request calling from real client. ?



On Sat, Nov 15, 2014 at 1:04 AM, Jens Rantil jens.ran...@tink.se wrote:

 Maybe you should try to lower your read repair probability?

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Sat, Nov 15, 2014 at 9:40 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

  Well we are able to do the tracing under normal load, but not yet able
 to turn on tracing on demand during heavy load from client side(due to hard
 to predict traffic pattern).

 under normal load we saw most of the time query spent (in one particular
 row we focus on) between
 merging data from memtables and (2-3) sstables
 Read 10xx live cell and 2x tomstones cell.

 Our cql basically pull out one row that has about 1000 columns(approx.
 800k size of data). This table already in level compaction.

 But once we get a series of exact same cql(against same row), the
 response time start to dramatically degraded from normal 300-500ms to like
 1 sec or 4 sec.
 Other part of the system seems remain fine, no obvious latency spike In
 read/write within the same keyspace or different keyspace.

 So I wonder what is causing the sudden increase in latency of exact same
 cql? what do we saturated ? if we saturated the disk IO, other part of the
 tables will see similar effect but we didn't see it.
 is there any table specific factor may contribute to the slowness?

 thanks








 On Mon, Nov 10, 2014 at 7:21 AM, DuyHai Doan doanduy...@gmail.com
 wrote:

 As Jonathan said, it's better to activate query tracing client side.
 It'll give you better flexibility of when to turn on  off tracing and on
 which table. Server-side tracing is global (all tables) and probabilistic,
 thus may not give satisfactory level of debugging.

  Programmatically it's pretty simple to achieve and coupled with a good
 logging framework (LogBack for Java), you'll even have dynamic logging on
 production without having to redeploy client code. I have implemented it in
 Achilles very easily by wrapping over the Regular/Bound/Simple statements
 of Java driver and display the bound values at runtime :
 https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

 On Mon, Nov 10, 2014 at 3:52 PM, Johnny Miller 
 johnny.p.mil...@gmail.com wrote:

 Be cautious enabling query tracing. Great tool for
 dev/testing/diagnosing etc.. - but it does persist data to the
 system_traces keyspace with a TTL of 24 hours and will, as a consequence,
 consume resources.


 http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2


 On 7 Nov 2014, at 20:20, Jonathan Haddad j...@jonhaddad.com wrote:

 Personally I've found that using query timing + log aggregation on the
 client side is more effective than trying to mess with tracing probability
 in order to find a single query which has recently become a problem.  I
 recommend wrapping your session with something that can automatically log
 the statement on a slow query, then use tracing to identify exactly what
 happened.  This way finding your problem is not a matter of chance.



 On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com
 wrote:

 It saves a lot of information for each request thats traced so there
 is significant overhead.  If you start at a low probability and move it up
 based on the load impact it will provide a lot of insight and you can
 control the cost.

 ---
 Chris Lohfink

 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com
 wrote:

  is there any significant  performance penalty if one turn on
 Cassandra query tracing, through DataStax java driver (say, per every 
 query
 request of some trouble query)?

 More sampling seems better but then doing so may also slow down the
 system in some other ways?

 thanks










Re: query tracing

2014-11-10 Thread Johnny Miller
Be cautious enabling query tracing. Great tool for dev/testing/diagnosing etc.. 
- but it does persist data to the system_traces keyspace with a TTL of 24 hours 
and will, as a consequence, consume resources.

http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2 
http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2


 On 7 Nov 2014, at 20:20, Jonathan Haddad j...@jonhaddad.com wrote:
 
 Personally I've found that using query timing + log aggregation on the client 
 side is more effective than trying to mess with tracing probability in order 
 to find a single query which has recently become a problem.  I recommend 
 wrapping your session with something that can automatically log the statement 
 on a slow query, then use tracing to identify exactly what happened.  This 
 way finding your problem is not a matter of chance.
 
 
 
 On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com 
 mailto:clohfin...@gmail.com wrote:
 It saves a lot of information for each request thats traced so there is 
 significant overhead.  If you start at a low probability and move it up based 
 on the load impact it will provide a lot of insight and you can control the 
 cost.
 
 ---
 Chris Lohfink
 
 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com 
 mailto:y2klyf+w...@gmail.com wrote:
 is there any significant  performance penalty if one turn on Cassandra query 
 tracing, through DataStax java driver (say, per every query request of some 
 trouble query)?
 
 More sampling seems better but then doing so may also slow down the system in 
 some other ways?
 
 thanks
 
 
 



Re: query tracing

2014-11-10 Thread DuyHai Doan
As Jonathan said, it's better to activate query tracing client side. It'll
give you better flexibility of when to turn on  off tracing and on which
table. Server-side tracing is global (all tables) and probabilistic, thus
may not give satisfactory level of debugging.

 Programmatically it's pretty simple to achieve and coupled with a good
logging framework (LogBack for Java), you'll even have dynamic logging on
production without having to redeploy client code. I have implemented it in
Achilles very easily by wrapping over the Regular/Bound/Simple statements
of Java driver and display the bound values at runtime :
https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

On Mon, Nov 10, 2014 at 3:52 PM, Johnny Miller johnny.p.mil...@gmail.com
wrote:

 Be cautious enabling query tracing. Great tool for dev/testing/diagnosing
 etc.. - but it does persist data to the system_traces keyspace with a TTL
 of 24 hours and will, as a consequence, consume resources.

 http://www.datastax.com/dev/blog/advanced-request-tracing-in-cassandra-1-2


 On 7 Nov 2014, at 20:20, Jonathan Haddad j...@jonhaddad.com wrote:

 Personally I've found that using query timing + log aggregation on the
 client side is more effective than trying to mess with tracing probability
 in order to find a single query which has recently become a problem.  I
 recommend wrapping your session with something that can automatically log
 the statement on a slow query, then use tracing to identify exactly what
 happened.  This way finding your problem is not a matter of chance.



 On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com
 wrote:

 It saves a lot of information for each request thats traced so there is
 significant overhead.  If you start at a low probability and move it up
 based on the load impact it will provide a lot of insight and you can
 control the cost.

 ---
 Chris Lohfink

 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the
 system in some other ways?

 thanks







Re: query tracing

2014-11-07 Thread Robert Coli
On Fri, Nov 7, 2014 at 9:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?


What does 'significant' mean in your sentence? I'm pretty sure the answer
for most meanings of it is no.

=Rob


Re: query tracing

2014-11-07 Thread Chris Lohfink
It saves a lot of information for each request thats traced so there is
significant overhead.  If you start at a low probability and move it up
based on the load impact it will provide a lot of insight and you can
control the cost.

---
Chris Lohfink

On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the system
 in some other ways?

 thanks





Re: query tracing

2014-11-07 Thread Jonathan Haddad
Personally I've found that using query timing + log aggregation on the
client side is more effective than trying to mess with tracing probability
in order to find a single query which has recently become a problem.  I
recommend wrapping your session with something that can automatically log
the statement on a slow query, then use tracing to identify exactly what
happened.  This way finding your problem is not a matter of chance.



On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com wrote:

 It saves a lot of information for each request thats traced so there is
 significant overhead.  If you start at a low probability and move it up
 based on the load impact it will provide a lot of insight and you can
 control the cost.

 ---
 Chris Lohfink

 On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the
 system in some other ways?

 thanks






Re: Query returns incomplete result

2014-05-19 Thread Aaron Morton
Calling execute the second time runs the query a second time, and it looks like 
the query mutates instance state during the pagination. 

What happens if you only call execute() once ? 

Cheers
Aaron

-
Aaron Morton
New Zealand
@aaronmorton

Co-Founder  Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/05/2014, at 8:03 pm, Lu, Boying boying...@emc.com wrote:

 Hi, All,
  
 I use the astyanax 1.56.48 + Cassandra 2.0.6 in my test codes and do some 
 query like this:
  
 query = keyspace.prepareQuery(..).getKey(…)
 .autoPaginate(true)
 .withColumnRange(new RangeBuilder().setLimit(pageSize).build());
  
 ColumnListIndexColumnName result;
 result= query.execute().getResult();
 while (!result.isEmpty()) {
 //handle result here
 result= query.execute().getResult();
 }
  
 There are 2003 records in the DB, if the pageSize is set to 1100, I get only 
 2002 records back.
 and if the pageSize is set to 3000, I can get the all 2003 records back.
  
 Does anyone know why? Is it a bug?
  
 Thanks
  
 Boying



Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-19 Thread Bryan Talbot
I think there are several issues in your schema and queries.

First, the schema can't efficiently return the single newest post for every
author. It can efficiently return the newest N posts for a particular
author.

On Fri, May 16, 2014 at 11:53 PM, 後藤 泰陽 matope@gmail.com wrote:


 But I consider LIMIT to be a keyword to limits result numbers from WHOLE
 results retrieved by the SELECT statement.



This is happening due to the incorrect use of minTimeuuid() function. All
of your created_at values are equal so you're essentially getting 2 (order
not defined) values that have the lowest created_at value.

The minTimeuuid() function is mean to be used in the WHERE clause of a
SELECT statement often with maxTimeuuid() to do BETWEEN sort of queries on
timeuuid values.




 The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
 wanted.
 I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
 it)

 cqlsh:blog_test create table posts(
  ... author ascii,
  ... created_at timeuuid,
  ... entry text,
  ... primary key(author,created_at)
  ... )WITH CLUSTERING ORDER BY (created_at DESC);
 cqlsh:blog_test
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 mike');
 cqlsh:blog_test select * from posts limit 2;

  author | created_at   | entry

 +--+--
mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
 mike
mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
 mike






To get most recent posts by a particular author, you'll need statements
more like this:

cqlsh:test insert into posts(author,created_at,entry) values
('john',now(),'This is an old entry by john'); cqlsh:test insert into
posts(author,created_at,entry) values ('john',now(),'This is a new entry by
john'); cqlsh:test insert into posts(author,created_at,entry) values
('mike',now(),'This is an old entry by mike'); cqlsh:test insert into
posts(author,created_at,entry) values ('mike',now(),'This is a new entry by
mike');

and then you can get posts by 'john' ordered by newest to oldest as:

cqlsh:test select author, created_at, dateOf(created_at), entry from posts
where author = 'john' limit 2 ;

 author | created_at   | dateOf(created_at)   |
entry
+--+--+--
   john | 7cb1ac30-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:36-0700 |
 This is a new entry by john
   john | 74bb6750-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:23-0700 |
This is an old entry by john


-Bryan


Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread 後藤 泰陽
Hello,

Thank you for your addressing.

But I consider LIMIT to be a keyword to limits result numbers from WHOLE 
results retrieved by the SELECT statement.
The result with SELECT.. LIMIT is below. Unfortunately, This is not what I 
wanted.
I wante latest posts of each authors. (Now I doubt if CQL3 can't represent it)

 cqlsh:blog_test create table posts(
  ... author ascii,
  ... created_at timeuuid,
  ... entry text,
  ... primary key(author,created_at)
  ... )WITH CLUSTERING ORDER BY (created_at DESC);
 cqlsh:blog_test 
 cqlsh:blog_test insert into posts(author,created_at,entry) values 
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values 
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values 
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by mike');
 cqlsh:blog_test insert into posts(author,created_at,entry) values 
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
 cqlsh:blog_test select * from posts limit 2;
 
  author | created_at   | entry
 +--+--
mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by mike
mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by mike



2014/05/16 23:54、Jonathan Lacefield jlacefi...@datastax.com のメール:

 Hello,
 
  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of CQL3?
 
  These may help you achieve your goals.
 
   
 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html
   
 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
 
 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
 
 
 
 
 
 
 On Fri, May 16, 2014 at 12:23 AM, Matope Ono matope@gmail.com wrote:
 Hi, I'm modeling some queries in CQL3.
 
 I'd like to query first 1 columns for each partitioning keys in CQL3.
 
 For example:
 
 create table posts(
   author ascii,
   created_at timeuuid,
   entry text,
   primary key(author,created_at)
 );
 insert into posts(author,created_at,entry) values 
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by john');
 insert into posts(author,created_at,entry) values 
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
 insert into posts(author,created_at,entry) values 
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by mike');
 insert into posts(author,created_at,entry) values 
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');
 
 And I want results like below.
 
 mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
 john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john
 
 I think that this is what SELECT FIRST  statements did in CQL2.
 
 The only way I came across in CQL3 is retrieve whole records and drop 
 manually,
 but it's obviously not efficient.
 
 Could you please tell me more straightforward way in CQL3?
 



Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread DuyHai Doan
Clearly with your current data model, having X latest post for each author
is not possible.

 However, what's about this ?

CREATE TABLE latest_posts_per_user (
   author ascii
   latest_post mapuuid,text,
   PRIMARY KEY (author)
)

 The latest_post will keep a collection of X latest posts for each user.
Now the challenge is to update this latest_post map every time an user
create a new post. This can be done in a single CQL3 statement: UPDATE
latest_posts_per_user SET latest_post = latest_post + {new_uuid: 'new
entry', oldest_uuid: null} WHERE author = xxx;

 You'll need to know the uuid of the oldest post to remove it from the map



On Sat, May 17, 2014 at 8:53 AM, 後藤 泰陽 matope@gmail.com wrote:

 Hello,

 Thank you for your addressing.

 But I consider LIMIT to be a keyword to limits result numbers from WHOLE
 results retrieved by the SELECT statement.
 The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
 wanted.
 I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
 it)

 cqlsh:blog_test create table posts(
  ... author ascii,
  ... created_at timeuuid,
  ... entry text,
  ... primary key(author,created_at)
  ... )WITH CLUSTERING ORDER BY (created_at DESC);
 cqlsh:blog_test
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 mike');
 cqlsh:blog_test select * from posts limit 2;

  author | created_at   | entry

 +--+--
mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
 mike
mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
 mike




 2014/05/16 23:54、Jonathan Lacefield jlacefi...@datastax.com のメール:

 Hello,

  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
 CQL3?

  These may help you achieve your goals.


 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html

 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
 http://www.linkedin.com/in/jlacefield

 http://www.datastax.com/cassandrasummit14



 On Fri, May 16, 2014 at 12:23 AM, Matope Ono matope@gmail.com wrote:

 Hi, I'm modeling some queries in CQL3.

 I'd like to query first 1 columns for each partitioning keys in CQL3.

 For example:

 create table posts(
 author ascii,
 created_at timeuuid,
 entry text,
 primary key(author,created_at)
 );
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');


 And I want results like below.

 mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
 john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john


 I think that this is what SELECT FIRST  statements did in CQL2.

 The only way I came across in CQL3 is retrieve whole records and drop
 manually,
 but it's obviously not efficient.

 Could you please tell me more straightforward way in CQL3?






Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-17 Thread Matope Ono
Hmm. Something like a user-managed-index looks the only way to do what I
want to do.
Thank you, I'll try that.


2014-05-17 18:07 GMT+09:00 DuyHai Doan doanduy...@gmail.com:

 Clearly with your current data model, having X latest post for each author
 is not possible.

  However, what's about this ?

 CREATE TABLE latest_posts_per_user (
author ascii
latest_post mapuuid,text,
PRIMARY KEY (author)
 )

  The latest_post will keep a collection of X latest posts for each user.
 Now the challenge is to update this latest_post map every time an user
 create a new post. This can be done in a single CQL3 statement: UPDATE
 latest_posts_per_user SET latest_post = latest_post + {new_uuid: 'new
 entry', oldest_uuid: null} WHERE author = xxx;

  You'll need to know the uuid of the oldest post to remove it from the map



 On Sat, May 17, 2014 at 8:53 AM, 後藤 泰陽 matope@gmail.com wrote:

 Hello,

 Thank you for your addressing.

 But I consider LIMIT to be a keyword to limits result numbers from WHOLE
 results retrieved by the SELECT statement.
 The result with SELECT.. LIMIT is below. Unfortunately, This is not what
 I wanted.
 I wante latest posts of each authors. (Now I doubt if CQL3 can't
 represent it)

 cqlsh:blog_test create table posts(
  ... author ascii,
  ... created_at timeuuid,
  ... entry text,
  ... primary key(author,created_at)
  ... )WITH CLUSTERING ORDER BY (created_at DESC);
 cqlsh:blog_test
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 mike');
 cqlsh:blog_test select * from posts limit 2;

  author | created_at   | entry

 +--+--
mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
 mike
mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
 mike




 2014/05/16 23:54、Jonathan Lacefield jlacefi...@datastax.com のメール:

 Hello,

  Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
 CQL3?

  These may help you achieve your goals.


 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html

 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
 http://www.linkedin.com/in/jlacefield

 http://www.datastax.com/cassandrasummit14



 On Fri, May 16, 2014 at 12:23 AM, Matope Ono matope@gmail.comwrote:

 Hi, I'm modeling some queries in CQL3.

 I'd like to query first 1 columns for each partitioning keys in CQL3.

 For example:

 create table posts(
 author ascii,
 created_at timeuuid,
 entry text,
 primary key(author,created_at)
 );
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by 
 john');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by 
 mike');


 And I want results like below.

 mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
 john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john


 I think that this is what SELECT FIRST  statements did in CQL2.

 The only way I came across in CQL3 is retrieve whole records and drop
 manually,
 but it's obviously not efficient.

 Could you please tell me more straightforward way in CQL3?







Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-16 Thread Jonathan Lacefield
Hello,

 Have you looked at using the CLUSTERING ORDER BY and LIMIT features of
CQL3?

 These may help you achieve your goals.


http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refClstrOrdr.html

http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
http://www.linkedin.com/in/jlacefield

http://www.datastax.com/cassandrasummit14



On Fri, May 16, 2014 at 12:23 AM, Matope Ono matope@gmail.com wrote:

 Hi, I'm modeling some queries in CQL3.

 I'd like to query first 1 columns for each partitioning keys in CQL3.

 For example:

 create table posts(
 author ascii,
 created_at timeuuid,
 entry text,
 primary key(author,created_at)
 );
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by john');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by mike');


 And I want results like below.

 mike,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by mike
 john,1c4d9000-83e9-11e2-8080-808080808080,This is a new entry by john


 I think that this is what SELECT FIRST  statements did in CQL2.

 The only way I came across in CQL3 is retrieve whole records and drop
 manually,
 but it's obviously not efficient.

 Could you please tell me more straightforward way in CQL3?



Re: Query on blob col using CQL3

2014-02-28 Thread Mikhail Stepura

Did you try http://cassandra.apache.org/doc/cql3/CQL.html#blobFun ?


On 2/28/14, 9:14, Senthil, Athinanthny X. -ND wrote:

Anyone can suggest how to query on blob column via CQL3. I get  bad
request error saying cannot parse data. I want to lookup on key column
which is defined as blob.

But I am able to lookup data via opscenter data explorer.  Is there a
conversion functions I need to use?




Sent from my Galaxy S®III





  1   2   >