Re: COUNTER timeout

2021-09-15 Thread Joe Obernberger

Thank you!
Clocks were out of sync; chronyd wasn't chrony'ding.
Going so much faster now!  Cheers.

-Joe

On 9/15/2021 4:07 PM, Bowen Song wrote:


Well, the log says cross node timeout, latency a bit over 44 seconds. 
Here's a few most likely causes:


1. The clocks are not in sync - please check the time on each server, 
and ensure NTP client is running on all Cassandra servers


2. Long stop the world GC pauses - please check the GC logs and make 
sure this isn't the case


3. Overload - please monitor the CPU usage and disk IO when timeout 
happens and make sure they are not the bottleneck



On 15/09/2021 20:34, Joe Obernberger wrote:


Thank you Erick - looking through all the logs on the nodes I found this:

INFO  [CompactionExecutor:17551] 2021-09-15 15:13:20,524 
CompactionTask.java:245 - Compacted 
(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to 
[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,] 
to level=0.  9.762MiB to 9.672MiB (~99% of original) in 3,873ms.  
Read Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row 
Throughput = ~125,729/s.  255,171 total partitions merged to 
251,458.  Partition merge counts were {1:247758, 2:3687, 3:13, }
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642 
InitialConnectionHandler.java:121 - Response to STARTUP sent, 
configuring pipeline for 5/v5
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643 
InitialConnectionHandler.java:153 - Configured pipeline: 
DefaultChannelPipeline{(frameDecoder = 
org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder = 
org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor = 
org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler 
= 
org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}
INFO  [ScheduledTasks:1] 2021-09-15 15:13:21,976 
MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were 
dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal 
dropped latency: 0 ms and Mean cross-node dropped latency: 44285 ms


So - yes, nodes are dropping mutations.  I did find a node where one 
of the drives was pegged.  Fixed that - but it's still happening.  
This happened after adding a relatively large node (.44) to the cluster:


nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load    Tokens  Owns (effective) Host 
ID   Rack
UN  172.16.100.251  526.35 GiB  200 35.1% 
660f476c-a124-4ca0-b55f-75efe56370da  rack1
UN  172.16.100.252  537.14 GiB  200 34.8% 
e83aa851-69b4-478f-88f6-60e657ea6539  rack1
UN  172.16.100.249  548.82 GiB  200 34.6% 
49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
UN  172.16.100.36   561.85 GiB  200 35.0% 
d9702f96-256e-45ae-8e12-69a42712be50  rack1
UN  172.16.100.39   547.86 GiB  200 34.2% 
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
UN  172.16.100.253  11.52 GiB   4   0.7% 
a1a16910-9167-4174-b34b-eb859d36347e  rack1
UN  172.16.100.248  560.63 GiB  200 35.0% 
4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
UN  172.16.100.44   432.76 GiB  200 34.7% 
b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
UN  172.16.100.37   331.31 GiB  120 20.5% 
08a19658-40be-4e55-8709-812b3d4ac750  rack1
UN  172.16.100.250  501.62 GiB  200 35.3% 
b74b6e65-af63-486a-b07f-9e304ec30a39  rack1


At this point I'm not sure what's going on.  Some repairs have failed 
over the past few days.


-Joe

On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it 
would be interesting to know how often you're issuing the counter 
updates. Also, how are the commit log disks performing on the nodes? 
If you have monitoring in place, check the IO stats/metrics. And 
finally, review the logs on the nodes to see if they are indeed 
dropping mutations. Cheers!


 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: COUNTER timeout

2021-09-15 Thread Bowen Song
Well, the log says cross node timeout, latency a bit over 44 seconds. 
Here's a few most likely causes:


1. The clocks are not in sync - please check the time on each server, 
and ensure NTP client is running on all Cassandra servers


2. Long stop the world GC pauses - please check the GC logs and make 
sure this isn't the case


3. Overload - please monitor the CPU usage and disk IO when timeout 
happens and make sure they are not the bottleneck



On 15/09/2021 20:34, Joe Obernberger wrote:


Thank you Erick - looking through all the logs on the nodes I found this:

INFO  [CompactionExecutor:17551] 2021-09-15 15:13:20,524 
CompactionTask.java:245 - Compacted 
(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to 
[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,] 
to level=0.  9.762MiB to 9.672MiB (~99% of original) in 3,873ms.  Read 
Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row Throughput 
= ~125,729/s.  255,171 total partitions merged to 251,458.  Partition 
merge counts were {1:247758, 2:3687, 3:13, }
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642 
InitialConnectionHandler.java:121 - Response to STARTUP sent, 
configuring pipeline for 5/v5
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643 
InitialConnectionHandler.java:153 - Configured pipeline: 
DefaultChannelPipeline{(frameDecoder = 
org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder = 
org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor = 
org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler = 
org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}
INFO  [ScheduledTasks:1] 2021-09-15 15:13:21,976 
MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were dropped 
in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped 
latency: 0 ms and Mean cross-node dropped latency: 44285 ms


So - yes, nodes are dropping mutations.  I did find a node where one 
of the drives was pegged.  Fixed that - but it's still happening.  
This happened after adding a relatively large node (.44) to the cluster:


nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load    Tokens  Owns (effective)  Host 
ID   Rack
UN  172.16.100.251  526.35 GiB  200 35.1% 
660f476c-a124-4ca0-b55f-75efe56370da  rack1
UN  172.16.100.252  537.14 GiB  200 34.8% 
e83aa851-69b4-478f-88f6-60e657ea6539  rack1
UN  172.16.100.249  548.82 GiB  200 34.6% 
49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
UN  172.16.100.36   561.85 GiB  200 35.0% 
d9702f96-256e-45ae-8e12-69a42712be50  rack1
UN  172.16.100.39   547.86 GiB  200 34.2% 
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
UN  172.16.100.253  11.52 GiB   4   0.7% 
a1a16910-9167-4174-b34b-eb859d36347e  rack1
UN  172.16.100.248  560.63 GiB  200 35.0% 
4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
UN  172.16.100.44   432.76 GiB  200 34.7% 
b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
UN  172.16.100.37   331.31 GiB  120 20.5% 
08a19658-40be-4e55-8709-812b3d4ac750  rack1
UN  172.16.100.250  501.62 GiB  200 35.3% 
b74b6e65-af63-486a-b07f-9e304ec30a39  rack1


At this point I'm not sure what's going on.  Some repairs have failed 
over the past few days.


-Joe

On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it 
would be interesting to know how often you're issuing the counter 
updates. Also, how are the commit log disks performing on the nodes? 
If you have monitoring in place, check the IO stats/metrics. And 
finally, review the logs on the nodes to see if they are indeed 
dropping mutations. Cheers!


 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: COUNTER timeout

2021-09-15 Thread Joe Obernberger

Thank you Erick - looking through all the logs on the nodes I found this:

INFO  [CompactionExecutor:17551] 2021-09-15 15:13:20,524 
CompactionTask.java:245 - Compacted 
(fb0cdca0-1658-11ec-9098-dd70c3a3487a) 4 sstables to 
[/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96619-big,] 
to level=0.  9.762MiB to 9.672MiB (~99% of original) in 3,873ms.  Read 
Throughput = 2.520MiB/s, Write Throughput = 2.497MiB/s, Row Throughput = 
~125,729/s.  255,171 total partitions merged to 251,458.  Partition 
merge counts were {1:247758, 2:3687, 3:13, }
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,524 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96618-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,525 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96575-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,526 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96607-big
INFO  [NonPeriodicTasks:1] 2021-09-15 15:13:20,532 SSTable.java:111 - 
Deleting sstable: 
/data/7/cassandra/data/doc/fieldcounts-03b67080ada111ebade9fdc1d34336d3/nb-96554-big
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,642 
InitialConnectionHandler.java:121 - Response to STARTUP sent, 
configuring pipeline for 5/v5
DEBUG [epollEventLoopGroup-5-85] 2021-09-15 15:13:20,643 
InitialConnectionHandler.java:153 - Configured pipeline: 
DefaultChannelPipeline{(frameDecoder = 
org.apache.cassandra.net.FrameDecoderCrc), (frameEncoder = 
org.apache.cassandra.net.FrameEncoderCrc), (cqlProcessor = 
org.apache.cassandra.transport.CQLMessageHandler), (exceptionHandler = 
org.apache.cassandra.transport.ExceptionHandlers$PostV5ExceptionHandler)}
INFO  [ScheduledTasks:1] 2021-09-15 15:13:21,976 
MessagingMetrics.java:206 - COUNTER_MUTATION_RSP messages were dropped 
in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped 
latency: 0 ms and Mean cross-node dropped latency: 44285 ms


So - yes, nodes are dropping mutations.  I did find a node where one of 
the drives was pegged.  Fixed that - but it's still happening.  This 
happened after adding a relatively large node (.44) to the cluster:


nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load    Tokens  Owns (effective)  Host 
ID   Rack
UN  172.16.100.251  526.35 GiB  200 35.1% 
660f476c-a124-4ca0-b55f-75efe56370da  rack1
UN  172.16.100.252  537.14 GiB  200 34.8% 
e83aa851-69b4-478f-88f6-60e657ea6539  rack1
UN  172.16.100.249  548.82 GiB  200 34.6% 
49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
UN  172.16.100.36   561.85 GiB  200 35.0% 
d9702f96-256e-45ae-8e12-69a42712be50  rack1
UN  172.16.100.39   547.86 GiB  200 34.2% 
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
UN  172.16.100.253  11.52 GiB   4   0.7% 
a1a16910-9167-4174-b34b-eb859d36347e  rack1
UN  172.16.100.248  560.63 GiB  200 35.0% 
4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
UN  172.16.100.44   432.76 GiB  200 34.7% 
b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
UN  172.16.100.37   331.31 GiB  120 20.5% 
08a19658-40be-4e55-8709-812b3d4ac750  rack1
UN  172.16.100.250  501.62 GiB  200 35.3% 
b74b6e65-af63-486a-b07f-9e304ec30a39  rack1


At this point I'm not sure what's going on.  Some repairs have failed 
over the past few days.


-Joe

On 9/14/2021 7:23 PM, Erick Ramirez wrote:
The obvious conclusion is to say that the nodes can't keep up so it 
would be interesting to know how often you're issuing the counter 
updates. Also, how are the commit log disks performing on the nodes? 
If you have monitoring in place, check the IO stats/metrics. And 
finally, review the logs on the nodes to see if they are indeed 
dropping mutations. Cheers!


 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: COUNTER timeout

2021-09-15 Thread Bowen Song
Check the logs on the Cassandra servers first. Many different things can 
cause the same result, and you will have to dig in deeper to discover 
the true cause.


On 14/09/2021 23:55, Joe Obernberger wrote:

I'm getting a lot of the following errors during ingest of data:

com.datastax.oss.driver.api.core.servererrors.WriteTimeoutException: 
Cassandra timeout during COUNTER write query at consistency ONE (1 
replica were required but only 0 acknowledged the write)
    at 
com.datastax.oss.driver.api.core.servererrors.WriteTimeoutException.copy(WriteTimeoutException.java:96)
    at 
com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at 
com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at 
com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)


The CQL being executed is:
"update doc.seq set doccount=doccount+? where id=?"

Table is:

CREATE TABLE doc.seq (
    id text PRIMARY KEY,
    doccount counter
) WITH additional_write_policy = '99p'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

Total rows in the doc.seq table is 356.  What could cause this timeout 
error?

Thank you!

-Joe



Re: COUNTER timeout

2021-09-14 Thread Erick Ramirez
The obvious conclusion is to say that the nodes can't keep up so it would
be interesting to know how often you're issuing the counter updates. Also,
how are the commit log disks performing on the nodes? If you have
monitoring in place, check the IO stats/metrics. And finally, review the
logs on the nodes to see if they are indeed dropping mutations. Cheers!


COUNTER timeout

2021-09-14 Thread Joe Obernberger

I'm getting a lot of the following errors during ingest of data:

com.datastax.oss.driver.api.core.servererrors.WriteTimeoutException: 
Cassandra timeout during COUNTER write query at consistency ONE (1 
replica were required but only 0 acknowledged the write)
    at 
com.datastax.oss.driver.api.core.servererrors.WriteTimeoutException.copy(WriteTimeoutException.java:96)
    at 
com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
    at 
com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
    at 
com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
    at 
com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)


The CQL being executed is:
"update doc.seq set doccount=doccount+? where id=?"

Table is:

CREATE TABLE doc.seq (
    id text PRIMARY KEY,
    doccount counter
) WITH additional_write_policy = '99p'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

Total rows in the doc.seq table is 356.  What could cause this timeout 
error?

Thank you!

-Joe