Re: Cassandra p95 latencies

2023-08-25 Thread Andrew Weaver
Do you have the SSTables per read metric for before and after you increased
the key cache size? If it was high before, that may have been the culprit
meaning compaction tuning is in order.

On Fri, Aug 25, 2023, 12:35 PM Shaurya Gupta  wrote:

> Thanks everyone.
> Updating this thread -
> We increased the key cache size from 100 MB to 200 MB and we believe that
> has brought down the latency from 40 ms p95 to 6 ms p95. I think there is
> still scope for improvement as both writes and reads are presently at p95 6
> ms. I would expect writes to be lower. But we are good with 6 ms for now at
> least.
>
> On Mon, Aug 14, 2023 at 11:56 AM Elliott Sims via user <
> user@cassandra.apache.org> wrote:
>
>> 1.  Check for Nagle/delayed-ack, but probably nodelay is getting set by
>> the driver so it shouldn't be a problem.
>> 2.  Check for network latency (just regular old ping among hosts, during
>> traffic)
>> 3.  Check your GC metrics and see if garbage collections line up with
>> outliers.  Some tuning can help there, depending on the pattern, but 40ms
>> p99 at least would be fairly normal for G1GC.
>> 4.  Check actual local write times, and I/O times with iostat.  If you
>> have spinning drives 40ms is fairly expected.  It's high but not totally
>> unexpected for consumer-grade SSDs.  For enterprise-grade SSDs commit times
>> that long would be very unusual.  What are your commitlog_sync settings?
>>
>> On Mon, Aug 14, 2023 at 8:43 AM Josh McKenzie 
>> wrote:
>>
>>> The queries are rightly designed
>>>
>>> Data modeling in Cassandra is 100% gray space; there unfortunately is no
>>> right or wrong design. You'll need to share basic shapes / contours of your
>>> data model for other folks to help you; seemingly innocuous things in a
>>> data model can cause unexpected issues w/C*'s storage engine paradigm
>>> thanks to the partitioning and data storage happening under the hood.
>>>
>>> If you were seeing single digit ms on 3.0.X or 3.11.X and 40ms p95 on
>>> 4.0 I'd immediately look to the DB as being the culprit. For all other
>>> cases, you should be seeing single digit ms as queries in C* generally boil
>>> down to key/value lookups (partition key) to a list of rows you either
>>> point query (key/value #2) or range scan via clustering keys and pull back
>>> out.
>>>
>>> There's also paging to take into consideration (whether you're using it
>>> or not, what your page size is) and the data itself (do you have thousands
>>> of columns? Multi-MB blobs you're pulling back out? etc). All can play into
>>> this.
>>>
>>> On Fri, Aug 11, 2023, at 3:40 PM, Jeff Jirsa wrote:
>>>
>>> You’re going to have to help us help you
>>>
>>> 4.0 is pretty widely deployed. I’m not aware of a perf regression
>>>
>>> Can you give us a schema (anonymized) and queries and show us a trace ?
>>>
>>>
>>> On Aug 10, 2023, at 10:18 PM, Shaurya Gupta 
>>> wrote:
>>>
>>> 
>>> The queries are rightly designed as I already explained. 40 ms is way
>>> too high as compared to what I seen with other DBs and many a times with
>>> Cassandra 3.x versions.
>>> CPU consumed as I mentioned is not high, it is around 20%.
>>>
>>> On Thu, Aug 10, 2023 at 5:14 PM MyWorld  wrote:
>>>
>>> Hi,
>>> P95 should not be a problem if rightly designed. Levelled compaction
>>> strategy further reduces this, however it consume some resources. For read,
>>> caching is also helpful.
>>> Can you check your cpu iowait as it could be the reason for delay
>>>
>>> Regards,
>>> Ashish
>>>
>>> On Fri, 11 Aug, 2023, 04:58 Shaurya Gupta, 
>>> wrote:
>>>
>>> Hi community
>>>
>>> What is the expected P95 latency for Cassandra Read and Write queries
>>> executed with Local_Quorum over a table with 3 replicas ? The queries are
>>> done using the partition + clustering key and row size in bytes is not too
>>> much, maybe 1-2 KB maximum.
>>> Assuming CPU is not a crunch ?
>>>
>>> We observe those to be 40 ms P95 Reads and same for Writes. This looks
>>> very high as compared to what we expected. We are using Cassandra 4.0.
>>>
>>> Any documentation / numbers will be helpful.
>>>
>>> Thanks
>>> --
>>> Shaurya Gupta
>>>
>>>
>>>
>>> --
>>> Shaurya Gupta
>>>
>>>
>>>
>> This email, including its contents and any attachment(s), may contain
>> confidential and/or proprietary information and is solely for the review
>> and use of the intended recipient(s). If you have received this email in
>> error, please notify the sender and permanently delete this email, its
>> content, and any attachment(s). Any disclosure, copying, or taking of any
>> action in reliance on an email received in error is strictly prohibited.
>>
>
>
> --
> Shaurya Gupta
>
>
>


Re: Testing Cassandra connectivity at application startup

2023-08-25 Thread Andrew Weaver
For a readiness probe and for ongoing ECV checks, just making sure the
driver is initialized is enough. I've seen problems recently with
applications running "select cluster_name from system.local" for ECV
checks.  We haven't dug into it in detail yet but with a large number of
clients it puts a lot of load on just a handful of nodes in the cluster. So
as always, test it well before putting something like this in production
especially for large deployments.

On Fri, Aug 25, 2023, 1:19 PM Raphael Mazelier  wrote:

> That's a good way to it!
> On 25/08/2023 20:10, Shaurya Gupta wrote:
>
> We don't plan to open a new connection. It should use the same
> connection(s) which the application will use.
>
> On Fri, Aug 25, 2023 at 10:59 AM Raphael Mazelier 
> wrote:
>
>> Mind that a new connection is really costly for C*.
>> So at startup it's fine. but not in a  liveness or readiness check imo.
>> For the query why not select 1; ?
>>
>> --
>>
>> Raphael Mazelier
>>
>>
>> On 25/08/2023 19:38, Shaurya Gupta wrote:
>>
>> Hi community
>>
>> We want to validate cassandra connectivity from the application container
>> when it starts up and before it reports as healthy to K8s. Is doing
>>
>>> select * from our_keyspace.table limit 1
>>
>> fine Or is it an inefficient query and should not be fired on a prod
>> cluster ?
>>
>> Any other suggestions ?
>>
>> --
>> Shaurya Gupta
>>
>>
>>
>
> --
> Shaurya Gupta
>
>
>


Re: Testing Cassandra connectivity at application startup

2023-08-25 Thread Raphael Mazelier
That's a good way to it!

On 25/08/2023 20:10, Shaurya Gupta wrote:

> We don't plan to open a new connection. It should use the same connection(s) 
> which the application will use.
>
> On Fri, Aug 25, 2023 at 10:59 AM Raphael Mazelier  wrote:
>
>> Mind that a new connection is really costly for C*.
>> So at startup it's fine. but not in a liveness or readiness check imo. For 
>> the query why not select 1; ?
>>
>> --
>>
>> Raphael Mazelier
>>
>> On 25/08/2023 19:38, Shaurya Gupta wrote:
>>
>>> Hi community
>>>
>>> We want to validate cassandra connectivity from the application container 
>>> when it starts up and before it reports as healthy to K8s. Is doing
>>>
 select * from our_keyspace.table limit 1
>>>
>>> fine Or is it an inefficient query and should not be fired on a prod 
>>> cluster ?
>>>
>>> Any other suggestions ?
>>> --
>>>
>>> Shaurya Gupta
>
> --
>
> Shaurya Gupta

Re: Testing Cassandra connectivity at application startup

2023-08-25 Thread Shaurya Gupta
We don't plan to open a new connection. It should use the same
connection(s) which the application will use.

On Fri, Aug 25, 2023 at 10:59 AM Raphael Mazelier  wrote:

> Mind that a new connection is really costly for C*.
> So at startup it's fine. but not in a  liveness or readiness check imo.
> For the query why not select 1; ?
>
> --
>
> Raphael Mazelier
>
>
> On 25/08/2023 19:38, Shaurya Gupta wrote:
>
> Hi community
>
> We want to validate cassandra connectivity from the application container
> when it starts up and before it reports as healthy to K8s. Is doing
>
>> select * from our_keyspace.table limit 1
>
> fine Or is it an inefficient query and should not be fired on a prod
> cluster ?
>
> Any other suggestions ?
>
> --
> Shaurya Gupta
>
>
>

-- 
Shaurya Gupta


Re: Testing Cassandra connectivity at application startup

2023-08-25 Thread C. Scott Andreas
“select * from …” without a predicate from a user table would be very 
expensive, yes.

A query from a small, node-local system table such as “select * from 
system.peers” would make a better health check. 

- Scott

> On Aug 25, 2023, at 10:58 AM, Raphael Mazelier  wrote:
> 
> 
> Mind that a new connection is really costly for C*.
> So at startup it's fine. but not in a  liveness or readiness check imo. For 
> the query why not select 1; ?
> 
> --
> 
> Raphael Mazelier
> 
> 
> 
> On 25/08/2023 19:38, Shaurya Gupta wrote:
>> Hi community
>> 
>> We want to validate cassandra connectivity from the application container 
>> when it starts up and before it reports as healthy to K8s. Is doing 
>>> select * from our_keyspace.table limit 1
>> fine Or is it an inefficient query and should not be fired on a prod cluster 
>> ?
>> 
>> Any other suggestions ?
>> 
>> --
>> Shaurya Gupta
>> 
>> 


Re: Testing Cassandra connectivity at application startup

2023-08-25 Thread Raphael Mazelier
Mind that a new connection is really costly for C*.
So at startup it's fine. but not in a liveness or readiness check imo. For the 
query why not select 1; ?

--

Raphael Mazelier

On 25/08/2023 19:38, Shaurya Gupta wrote:

> Hi community
>
> We want to validate cassandra connectivity from the application container 
> when it starts up and before it reports as healthy to K8s. Is doing
>
>> select * from our_keyspace.table limit 1
>
> fine Or is it an inefficient query and should not be fired on a prod cluster ?
>
> Any other suggestions ?
> --
>
> Shaurya Gupta

Testing Cassandra connectivity at application startup

2023-08-25 Thread Shaurya Gupta
Hi community

We want to validate cassandra connectivity from the application container
when it starts up and before it reports as healthy to K8s. Is doing

> select * from our_keyspace.table limit 1

fine Or is it an inefficient query and should not be fired on a prod
cluster ?

Any other suggestions ?

-- 
Shaurya Gupta


Re: Cassandra p95 latencies

2023-08-25 Thread Shaurya Gupta
Thanks everyone.
Updating this thread -
We increased the key cache size from 100 MB to 200 MB and we believe that
has brought down the latency from 40 ms p95 to 6 ms p95. I think there is
still scope for improvement as both writes and reads are presently at p95 6
ms. I would expect writes to be lower. But we are good with 6 ms for now at
least.

On Mon, Aug 14, 2023 at 11:56 AM Elliott Sims via user <
user@cassandra.apache.org> wrote:

> 1.  Check for Nagle/delayed-ack, but probably nodelay is getting set by
> the driver so it shouldn't be a problem.
> 2.  Check for network latency (just regular old ping among hosts, during
> traffic)
> 3.  Check your GC metrics and see if garbage collections line up with
> outliers.  Some tuning can help there, depending on the pattern, but 40ms
> p99 at least would be fairly normal for G1GC.
> 4.  Check actual local write times, and I/O times with iostat.  If you
> have spinning drives 40ms is fairly expected.  It's high but not totally
> unexpected for consumer-grade SSDs.  For enterprise-grade SSDs commit times
> that long would be very unusual.  What are your commitlog_sync settings?
>
> On Mon, Aug 14, 2023 at 8:43 AM Josh McKenzie 
> wrote:
>
>> The queries are rightly designed
>>
>> Data modeling in Cassandra is 100% gray space; there unfortunately is no
>> right or wrong design. You'll need to share basic shapes / contours of your
>> data model for other folks to help you; seemingly innocuous things in a
>> data model can cause unexpected issues w/C*'s storage engine paradigm
>> thanks to the partitioning and data storage happening under the hood.
>>
>> If you were seeing single digit ms on 3.0.X or 3.11.X and 40ms p95 on 4.0
>> I'd immediately look to the DB as being the culprit. For all other cases,
>> you should be seeing single digit ms as queries in C* generally boil down
>> to key/value lookups (partition key) to a list of rows you either point
>> query (key/value #2) or range scan via clustering keys and pull back out.
>>
>> There's also paging to take into consideration (whether you're using it
>> or not, what your page size is) and the data itself (do you have thousands
>> of columns? Multi-MB blobs you're pulling back out? etc). All can play into
>> this.
>>
>> On Fri, Aug 11, 2023, at 3:40 PM, Jeff Jirsa wrote:
>>
>> You’re going to have to help us help you
>>
>> 4.0 is pretty widely deployed. I’m not aware of a perf regression
>>
>> Can you give us a schema (anonymized) and queries and show us a trace ?
>>
>>
>> On Aug 10, 2023, at 10:18 PM, Shaurya Gupta 
>> wrote:
>>
>> 
>> The queries are rightly designed as I already explained. 40 ms is way too
>> high as compared to what I seen with other DBs and many a times with
>> Cassandra 3.x versions.
>> CPU consumed as I mentioned is not high, it is around 20%.
>>
>> On Thu, Aug 10, 2023 at 5:14 PM MyWorld  wrote:
>>
>> Hi,
>> P95 should not be a problem if rightly designed. Levelled compaction
>> strategy further reduces this, however it consume some resources. For read,
>> caching is also helpful.
>> Can you check your cpu iowait as it could be the reason for delay
>>
>> Regards,
>> Ashish
>>
>> On Fri, 11 Aug, 2023, 04:58 Shaurya Gupta, 
>> wrote:
>>
>> Hi community
>>
>> What is the expected P95 latency for Cassandra Read and Write queries
>> executed with Local_Quorum over a table with 3 replicas ? The queries are
>> done using the partition + clustering key and row size in bytes is not too
>> much, maybe 1-2 KB maximum.
>> Assuming CPU is not a crunch ?
>>
>> We observe those to be 40 ms P95 Reads and same for Writes. This looks
>> very high as compared to what we expected. We are using Cassandra 4.0.
>>
>> Any documentation / numbers will be helpful.
>>
>> Thanks
>> --
>> Shaurya Gupta
>>
>>
>>
>> --
>> Shaurya Gupta
>>
>>
>>
> This email, including its contents and any attachment(s), may contain
> confidential and/or proprietary information and is solely for the review
> and use of the intended recipient(s). If you have received this email in
> error, please notify the sender and permanently delete this email, its
> content, and any attachment(s). Any disclosure, copying, or taking of any
> action in reliance on an email received in error is strictly prohibited.
>


-- 
Shaurya Gupta