Data Corruption due to multiple Cassandra 2.1 processes?

2018-08-06 Thread Steinmaurer, Thomas
Hello,

with 2.1, in case a second Cassandra process/instance is started on a host (by 
accident), may this result in some sort of corruption, although Cassandra will 
exit at some point in time due to not being able to bind TCP ports already in 
use?

What we have seen in this scenario is something like that:

ERROR [main] 2018-08-05 21:10:24,046 CassandraDaemon.java:120 - Error starting 
local jmx server:
java.rmi.server.ExportException: Port already in use: 7199; nested exception is:
java.net.BindException: Address already in use (Bind failed)
...

But then continuing with stuff like opening system and even user tables:

INFO  [main] 2018-08-05 21:10:24,060 CacheService.java:110 - Initializing key 
cache with capacity of 100 MBs.
INFO  [main] 2018-08-05 21:10:24,067 CacheService.java:132 - Initializing row 
cache with capacity of 0 MBs
INFO  [main] 2018-08-05 21:10:24,073 CacheService.java:149 - Initializing 
counter cache with capacity of 50 MBs
INFO  [main] 2018-08-05 21:10:24,074 CacheService.java:160 - Scheduling counter 
cache save to every 7200 seconds (going to save all keys).
INFO  [main] 2018-08-05 21:10:24,161 ColumnFamilyStore.java:365 - Initializing 
system.sstable_activity
INFO  [SSTableBatchOpen:2] 2018-08-05 21:10:24,692 SSTableReader.java:475 - 
Opening 
/var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-165
 (2023 bytes)
INFO  [SSTableBatchOpen:3] 2018-08-05 21:10:24,692 SSTableReader.java:475 - 
Opening 
/var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-167
 (2336 bytes)
INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,692 SSTableReader.java:475 - 
Opening 
/var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-166
 (2686 bytes)
INFO  [main] 2018-08-05 21:10:24,755 ColumnFamilyStore.java:365 - Initializing 
system.hints
INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,758 SSTableReader.java:475 - 
Opening 
/var/opt/xxx-managed/cassandra/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-377
 (46210621 bytes)
INFO  [main] 2018-08-05 21:10:24,766 ColumnFamilyStore.java:365 - Initializing 
system.compaction_history
INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,768 SSTableReader.java:475 - 
Opening 
/var/opt/xxx-managed/cassandra/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/system-compaction_history-ka-129
 (91269 bytes)
...

Replaying commit logs:

...
INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:267 - Replaying 
/var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-4-1533133668366.log
INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:270 - Replaying 
/var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-4-1533133668366.log 
(CL version 4, messaging version 8)
...

Even writing memtables already (below just pasted system tables, but also user 
tables):

...
INFO  [MemtableFlushWriter:4] 2018-08-05 21:11:52,524 Memtable.java:347 - 
Writing Memtable-size_estimates@1941663179(2.655MiB serialized bytes, 325710 
ops, 2%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:3] 2018-08-05 21:11:52,552 Memtable.java:347 - 
Writing Memtable-peer_events@1474667699(0.199KiB serialized bytes, 4 ops, 0%/0% 
of on/off-heap limit)
...

Until it comes to a point where it can't bind ports like the storage port 7000:

ERROR [main] 2018-08-05 21:11:54,350 CassandraDaemon.java:395 - Fatal 
configuration error
org.apache.cassandra.exceptions.ConfigurationException: /XXX:7000 is in use by 
another process.  Change listen_address:storage_port in cassandra.yaml to 
values that do not conflict with other services
at 
org.apache.cassandra.net.MessagingService.getServerSockets(MessagingService.java:495)
 ~[apache-cassandra-2.1.18.jar:2.1.18]
...

Until Cassandra stops:

...
INFO  [StorageServiceShutdownHook] 2018-08-05 21:11:54,361 Gossiper.java:1454 - 
Announcing shutdown
...


So, we have around 2 minutes where Cassandra is mangling with existing data, 
although it shouldn't.

Sounds like a potential candidate for data corruption, right? E.g. later on we 
then see things like (still while being in progress to shutdown?):

WARN  [SharedPool-Worker-1] 2018-08-05 21:11:58,181 
AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-1,5,main]: {}
java.lang.RuntimeException: java.io.FileNotFoundException: 
/var/opt/xxx-managed/cassandra/xxx/xxx-fdc68b70950611e8ad7179f2d5bfa3cf/xxx-xxx-ka-15-Data.db
 (No such file or directory)
at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:52)
 ~[apache-cassandra-2.1.18.jar:2.1.18]
at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.createPooledReader(CompressedPoolingSegmentedFile.java:95)
 ~[apache-cassandra-2.1.18.jar:2.1.18]
at 

Re: Hinted Handoff

2018-08-06 Thread kurt greaves
>
> Does Cassandra TTL out the hints after max_hint_window_in_ms? From my
> understanding, Cassandra only stops collecting hints after
> max_hint_window_in_ms but can still keep replaying the hints if the node
> comes back again. Is this correct? Is there a way to TTL out hints?


No, but it won't send hints that have passed HH window. Also, this
shouldn't be caused by HH as the hints maintain the original timestamp with
which they were written.

Honestly, this sounds more like a use case for a distributed cache rather
than Cassandra. Keeping data for 30 minutes and then deleting it is going
to be a nightmare to manage in Cassandra.

On 7 August 2018 at 07:20, Agrawal, Pratik 
wrote:

> Does Cassandra TTL out the hints after max_hint_window_in_ms? From my
> understanding, Cassandra only stops collecting hints after
> max_hint_window_in_ms but can still keep replaying the hints if the node
> comes back again. Is this correct? Is there a way to TTL out hints?
>
>
>
> Thanks,
>
> Pratik
>
>
>
> *From: *Kyrylo Lebediev 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, August 6, 2018 at 4:10 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Hinted Handoff
>
>
>
> Small gc_grace_seconds value lowers max allowed node downtime, which is 15
> minutes in your case. After 15 minutes of downtime you'll need to replace
> the node, as you described. This interval looks too short to be able to do
> planned maintenance. So, in case you set larger value for gc_grace_seconds
> (lets say, hours or a day) will you get visible read amplification / waste
> a lot of disk space / issues with compactions?
>
>
>
> Hinted handoff may be the reason in case hinted handoff window is longer
> than gc_grace_seconds. To me it looks like hinted handoff window
> (max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must always
> be set to a value less than gc_grace_seconds.
>
>
>
> Regards,
>
> Kyrill
> --
>
> *From:* Agrawal, Pratik 
> *Sent:* Monday, August 6, 2018 8:22:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Hinted Handoff
>
>
>
> Hello all,
>
> We use Cassandra in non-conventional way, where our data is short termed
> (life cycle of about 20-30 minutes) where each record is updated ~5 times
> and then deleted. We have GC grace of 15 minutes.
>
> We are seeing 2 problems
>
> 1.) A certain number of Cassandra nodes goes down and then we remove it
> from the cluster using Cassandra removenode command and replace the dead
> nodes with new nodes. While new nodes are joining in, we see more nodes
> down (which are not actually down) but we see following errors in the log
>
> “Gossip not settled after 321 polls. Gossip Stage
> active/pending/completed: 1/816/0”
>
>
>
> To fix the issue, I restarted the server and the nodes now appear to be up
> and the problem is solved
>
>
>
> Can this problem be related to https://issues.apache.org/
> jira/browse/CASSANDRA-6590 ?
>
>
>
> 2.) Meanwhile, after restarting the nodes mentioned above, we see that
> some old deleted data is resurrected (because of short lifecycle of our
> data). My guess at the moment is that these data is resurrected due to
> hinted handoff. Interesting point to note here is that data keeps
> resurrecting at periodic intervals (like an hour) and then finally stops.
> Could this be caused by hinted handoff? if so is there any setting which we
> can set to specify that “invalidate, hinted handoff data after 5-10
> minutes”.
>
>
>
> Thanks,
> Pratik
>


Re: Bootstrap OOM issues with Cassandra 3.11.1

2018-08-06 Thread Jeff Jirsa


Upgrading to 3.11.3 May fix it (there were some memory recycling bugs fixed 
recently), but analyzing the heap will be the best option

If you can print out the heap histogram and stack trace or open a heap dump in 
your kit or visualvm or MAT and show us what’s at the top of the reclaimed 
objects, we may be able to figure out what’s going on

-- 
Jeff Jirsa


> On Aug 6, 2018, at 5:42 PM, Jeff Jirsa  wrote:
> 
> Are you using materialized views or secondary indices? 
> 
> -- 
> Jeff Jirsa
> 
> 
>> On Aug 6, 2018, at 3:49 PM, Laszlo Szabo  
>> wrote:
>> 
>> Hello All,
>> 
>> I'm having JVM unstable / OOM errors when attempting to auto bootstrap a 9th 
>> node to an existing 8 node cluster (256 tokens).  Each machine has 24 cores 
>> 148GB RAM and 10TB (2TB used).  Under normal operation the 8 nodes have JVM 
>> memory configured with Xms35G and Xmx35G, and handle 2-4 billion inserts per 
>> day.  There are never updates, deletes, or sparsely populated rows.  
>> 
>> For the bootstrap node, I've tried memory values from 35GB to 135GB in 10GB 
>> increments. I've tried using both memtable_allocation_types (heap_buffers 
>> and offheap_buffers).  I've not tried modifying the 
>> memtable_cleanup_threshold but instead have tried memtable_flush_writers 
>> from 2 to 8.  I've tried memtable_(off)heap_space_in_mb from 2 to 6. 
>>  I've tried both CMS and G1 garbage collection with various settings.  
>> 
>> Typically, after streaming about ~2TB of data, CPU load will hit a maximum, 
>> and the "nodetool info" heap memory will, over the course of an hour, 
>> approach the maximum.  At that point, CPU load will drop to a single thread 
>> with minimal activity until the system becomes unstable and eventually the 
>> OOM error occurs.
>> 
>> Excerpt of the system log is below, and what I consistently see is the 
>> MemtableFlushWriter and the MemtableReclaimMemory pending queues grow as the 
>> memory becomes depleted, but the number of completed seems to stop changing 
>> a few minutes after the CPU load spikes.
>> 
>> One other data point is there seems to be a huge number of mutations that 
>> occur after most of the stream has occured.  Concurrent_writes is set at 256 
>> with the queue getting as high as 200K before dropping down.  
>> 
>> Any suggestions for yaml changes or jvm changes?  JVM.options is currently 
>> the default with the memory set to the max, the current YAML file is below.
>> 
>> Thanks!
>> 
>> 
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,329 StatusLogger.java:51 - 
>>> MutationStage 1 2  191498052 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,331 StatusLogger.java:51 - 
>>> ViewMutationStage 0 0  0 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,338 StatusLogger.java:51 - 
>>> PerDiskMemtableFlushWriter_0 0 0   5865 0   
>>>   0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,343 StatusLogger.java:51 - 
>>> ReadStage 0 0  0 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,347 StatusLogger.java:51 - 
>>> ValidationExecutor0 0  0 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,360 StatusLogger.java:51 - 
>>> RequestResponseStage  0 0  8 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,380 StatusLogger.java:51 - 
>>> Sampler   0 0  0 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,382 StatusLogger.java:51 - 
>>> MemtableFlushWriter   8 74293   4716 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,388 StatusLogger.java:51 - 
>>> ReadRepairStage   0 0  0 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,389 StatusLogger.java:51 - 
>>> CounterMutationStage  0 0  0 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,404 StatusLogger.java:51 - 
>>> MiscStage 0 0  0 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,407 StatusLogger.java:51 - 
>>> CompactionExecutor813493 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,410 StatusLogger.java:51 - 
>>> InternalResponseStage 0 0 16 0  
>>>0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,413 StatusLogger.java:51 - 
>>> MemtableReclaimMemory 1  6066356 0  
>>>0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,421 StatusLogger.java:51 - 
>>> 

Re: Bootstrap OOM issues with Cassandra 3.11.1

2018-08-06 Thread Jeff Jirsa
Are you using materialized views or secondary indices? 

-- 
Jeff Jirsa


> On Aug 6, 2018, at 3:49 PM, Laszlo Szabo  
> wrote:
> 
> Hello All,
> 
> I'm having JVM unstable / OOM errors when attempting to auto bootstrap a 9th 
> node to an existing 8 node cluster (256 tokens).  Each machine has 24 cores 
> 148GB RAM and 10TB (2TB used).  Under normal operation the 8 nodes have JVM 
> memory configured with Xms35G and Xmx35G, and handle 2-4 billion inserts per 
> day.  There are never updates, deletes, or sparsely populated rows.  
> 
> For the bootstrap node, I've tried memory values from 35GB to 135GB in 10GB 
> increments. I've tried using both memtable_allocation_types (heap_buffers and 
> offheap_buffers).  I've not tried modifying the memtable_cleanup_threshold 
> but instead have tried memtable_flush_writers from 2 to 8.  I've tried 
> memtable_(off)heap_space_in_mb from 2 to 6.  I've tried both CMS and 
> G1 garbage collection with various settings.  
> 
> Typically, after streaming about ~2TB of data, CPU load will hit a maximum, 
> and the "nodetool info" heap memory will, over the course of an hour, 
> approach the maximum.  At that point, CPU load will drop to a single thread 
> with minimal activity until the system becomes unstable and eventually the 
> OOM error occurs.
> 
> Excerpt of the system log is below, and what I consistently see is the 
> MemtableFlushWriter and the MemtableReclaimMemory pending queues grow as the 
> memory becomes depleted, but the number of completed seems to stop changing a 
> few minutes after the CPU load spikes.
> 
> One other data point is there seems to be a huge number of mutations that 
> occur after most of the stream has occured.  Concurrent_writes is set at 256 
> with the queue getting as high as 200K before dropping down.  
> 
> Any suggestions for yaml changes or jvm changes?  JVM.options is currently 
> the default with the memory set to the max, the current YAML file is below.
> 
> Thanks!
> 
> 
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,329 StatusLogger.java:51 - 
>> MutationStage 1 2  191498052 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,331 StatusLogger.java:51 - 
>> ViewMutationStage 0 0  0 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,338 StatusLogger.java:51 - 
>> PerDiskMemtableFlushWriter_0 0 0   5865 0
>>  0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,343 StatusLogger.java:51 - 
>> ReadStage 0 0  0 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,347 StatusLogger.java:51 - 
>> ValidationExecutor0 0  0 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,360 StatusLogger.java:51 - 
>> RequestResponseStage  0 0  8 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,380 StatusLogger.java:51 - 
>> Sampler   0 0  0 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,382 StatusLogger.java:51 - 
>> MemtableFlushWriter   8 74293   4716 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,388 StatusLogger.java:51 - 
>> ReadRepairStage   0 0  0 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,389 StatusLogger.java:51 - 
>> CounterMutationStage  0 0  0 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,404 StatusLogger.java:51 - 
>> MiscStage 0 0  0 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,407 StatusLogger.java:51 - 
>> CompactionExecutor813493 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,410 StatusLogger.java:51 - 
>> InternalResponseStage 0 0 16 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,413 StatusLogger.java:51 - 
>> MemtableReclaimMemory 1  6066356 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,421 StatusLogger.java:51 - 
>> AntiEntropyStage  0 0  0 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,430 StatusLogger.java:51 - 
>> CacheCleanupExecutor  0 0  0 0   
>>   0
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,431 StatusLogger.java:51 - 
>> PendingRangeCalculator0 0  9 0   
>>   0
>> INFO  [Service Thread] 2018-08-06 17:49:26,436 StatusLogger.java:61 - 
>> 

Bootstrap OOM issues with Cassandra 3.11.1

2018-08-06 Thread Laszlo Szabo
 Hello All,

I'm having JVM unstable / OOM errors when attempting to auto bootstrap a
9th node to an existing 8 node cluster (256 tokens).  Each machine has 24
cores 148GB RAM and 10TB (2TB used).  Under normal operation the 8 nodes
have JVM memory configured with Xms35G and Xmx35G, and handle 2-4 billion
inserts per day.  There are never updates, deletes, or sparsely populated
rows.

For the bootstrap node, I've tried memory values from 35GB to 135GB in 10GB
increments. I've tried using both memtable_allocation_types (heap_buffers
and offheap_buffers).  I've not tried modifying the
memtable_cleanup_threshold but instead have tried memtable_flush_writers
from 2 to 8.  I've tried memtable_(off)heap_space_in_mb from 2 to
6.  I've tried both CMS and G1 garbage collection with various
settings.

Typically, after streaming about ~2TB of data, CPU load will hit a maximum,
and the "nodetool info" heap memory will, over the course of an hour,
approach the maximum.  At that point, CPU load will drop to a single thread
with minimal activity until the system becomes unstable and eventually the
OOM error occurs.

Excerpt of the system log is below, and what I consistently see is the
MemtableFlushWriter and the MemtableReclaimMemory pending queues grow as
the memory becomes depleted, but the number of completed seems to stop
changing a few minutes after the CPU load spikes.

One other data point is there seems to be a huge number of mutations that
occur after most of the stream has occured.  Concurrent_writes is set at
256 with the queue getting as high as 200K before dropping down.

Any suggestions for yaml changes or jvm changes?  JVM.options is currently
the default with the memory set to the max, the current YAML file is below.

Thanks!


INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,329 StatusLogger.java:51 -
> MutationStage 1 2  191498052 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,331 StatusLogger.java:51 -
> ViewMutationStage 0 0  0 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,338 StatusLogger.java:51 -
> PerDiskMemtableFlushWriter_0 0 0   5865 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,343 StatusLogger.java:51 -
> ReadStage 0 0  0 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,347 StatusLogger.java:51 -
> ValidationExecutor0 0  0 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,360 StatusLogger.java:51 -
> RequestResponseStage  0 0  8 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,380 StatusLogger.java:51 -
> Sampler   0 0  0 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,382 StatusLogger.java:51 - 
> *MemtableFlushWriter
>  8 74293   4716 0  *   0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,388 StatusLogger.java:51 -
> ReadRepairStage   0 0  0 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,389 StatusLogger.java:51 -
> CounterMutationStage  0 0  0 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,404 StatusLogger.java:51 -
> MiscStage 0 0  0 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,407 StatusLogger.java:51 -
> CompactionExecutor813493 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,410 StatusLogger.java:51 -
> InternalResponseStage 0 0 16 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,413 StatusLogger.java:51 - 
> *MemtableReclaimMemory
>1  6066356 0 *0
> INFO  [Service Thread] 2018-08-06 17:49:26,421 StatusLogger.java:51 -
> AntiEntropyStage  0 0  0 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,430 StatusLogger.java:51 -
> CacheCleanupExecutor  0 0  0 0
>0
> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,431 StatusLogger.java:51 -
> PendingRangeCalculator0 0  9 0
>0
> INFO  [Service Thread] 2018-08-06 17:49:26,436 StatusLogger.java:61 -
> CompactionManager 819




 Current Yaml

num_tokens: 256

hinted_handoff_enabled: true

hinted_handoff_throttle_in_kb: 10240

max_hints_delivery_threads: 8

hints_flush_period_in_ms: 1

max_hints_file_size_in_mb: 128

batchlog_replay_throttle_in_kb: 10240

authenticator: AllowAllAuthenticator

authorizer: AllowAllAuthorizer

role_manager: CassandraRoleManager


Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-06 Thread Kyrylo Lebediev
Thank you for replying, Alain!


Better use of cache for 'pinned' requests explains good the case when CL=ONE.


But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is sent 
to all replicas waiting for first 2 to reply.

When dynamic snitching is turned on, "data" request is sent to "the fastest 
replica", and "digest" requests - to the rest of replicas.

But anyway digest is the same read operation [from SSTables through filesystem 
cache] + calculating and sending hash to coordinator. Looks like the only 
change for dynamic_snitch=false is that "data" request is sent to a determined 
node instead of "currently the fastest one".

So, if there are no mistakes in above description, improvement shouldn't be 
much visible for CL=*QUORUM...


Did you get improved response for CL=ONE only or for higher CL's as well?


Indeed an interesting thread in Jira.


Thanks,

Kyrill


From: Alain RODRIGUEZ 
Sent: Monday, August 6, 2018 8:26:43 PM
To: user cassandra.apache.org
Subject: Re: dynamic_snitch=false, prioritisation/order or reads from replicas

Hello,

There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

I confirm that I have seen this improvement on clusters under pressure.

What effects stand behind this improvement?

My understanding is that this is due to the fact that the clients are then 
'pinned', more sticking to specific nodes when the dynamic snitching is off. I 
guess there is a better use of caches and in-memory structures, reducing the 
amount of disk read needed, which can lead to way more performances than 
switching from node to node as soon as the score of some node is not good 
enough.
I am also not sure that the score calculation is always relevant, thus 
increasing the threshold before switching reads to another node is still often 
worst than disabling it completely. I am not sure if the score calculation was 
fixed, but in most cases, I think it's safer to run with 'dynamic_snitch: 
false'. Anyway, it's possible to test it on a canary node (or entire rack) and 
look at the p99 for read latencies for example :).

This ticket is old, but was precisely on that topic: 
https://issues.apache.org/jira/browse/CASSANDRA-6908

C*heers
---
Alain Rodriguez - @arodream - 
al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-04 15:37 GMT+02:00 Kyrylo Lebediev 
mailto:kyrylo_lebed...@epam.com.invalid>>:

Hello!


In case when dynamic snitching is enabled data is read from 'the fastest 
replica' and other replicas send digests for CL=QUORUM/LOCAL_QUORUM .

When dynamic snitching is disabled, as the concept of the fastest replica 
disappears, which rules are used to choose from which replica to read actual 
data (not digests):

 1) when all replicas are online

 2) when the node primarily responsible for the token range is offline.


There are reports (in this ML too) that disabling dynamic snitching decreases 
response time.

What effects stand behind this improvement?


Regards,

Kyrill



Re: Hinted Handoff

2018-08-06 Thread Agrawal, Pratik
Does Cassandra TTL out the hints after max_hint_window_in_ms? From my 
understanding, Cassandra only stops collecting hints after 
max_hint_window_in_ms but can still keep replaying the hints if the node comes 
back again. Is this correct? Is there a way to TTL out hints?

Thanks,
Pratik

From: Kyrylo Lebediev 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, August 6, 2018 at 4:10 PM
To: "user@cassandra.apache.org" 
Subject: Re: Hinted Handoff


Small gc_grace_seconds value lowers max allowed node downtime, which is 15 
minutes in your case. After 15 minutes of downtime you'll need to replace the 
node, as you described. This interval looks too short to be able to do planned 
maintenance. So, in case you set larger value for gc_grace_seconds (lets say, 
hours or a day) will you get visible read amplification / waste a lot of disk 
space / issues with compactions?



Hinted handoff may be the reason in case hinted handoff window is longer than 
gc_grace_seconds. To me it looks like hinted handoff window 
(max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must always be 
set to a value less than gc_grace_seconds.



Regards,

Kyrill


From: Agrawal, Pratik 
Sent: Monday, August 6, 2018 8:22:27 PM
To: user@cassandra.apache.org
Subject: Hinted Handoff


Hello all,

We use Cassandra in non-conventional way, where our data is short termed (life 
cycle of about 20-30 minutes) where each record is updated ~5 times and then 
deleted. We have GC grace of 15 minutes.

We are seeing 2 problems

1.) A certain number of Cassandra nodes goes down and then we remove it from 
the cluster using Cassandra removenode command and replace the dead nodes with 
new nodes. While new nodes are joining in, we see more nodes down (which are 
not actually down) but we see following errors in the log

“Gossip not settled after 321 polls. Gossip Stage active/pending/completed: 
1/816/0”



To fix the issue, I restarted the server and the nodes now appear to be up and 
the problem is solved



Can this problem be related to 
https://issues.apache.org/jira/browse/CASSANDRA-6590 ?



2.) Meanwhile, after restarting the nodes mentioned above, we see that some old 
deleted data is resurrected (because of short lifecycle of our data). My guess 
at the moment is that these data is resurrected due to hinted handoff. 
Interesting point to note here is that data keeps resurrecting at periodic 
intervals (like an hour) and then finally stops. Could this be caused by hinted 
handoff? if so is there any setting which we can set to specify that 
“invalidate, hinted handoff data after 5-10 minutes”.



Thanks,
Pratik


Re: Hinted Handoff

2018-08-06 Thread Kyrylo Lebediev
Small gc_grace_seconds value lowers max allowed node downtime, which is 15 
minutes in your case. After 15 minutes of downtime you'll need to replace the 
node, as you described. This interval looks too short to be able to do planned 
maintenance. So, in case you set larger value for gc_grace_seconds (lets say, 
hours or a day) will you get visible read amplification / waste a lot of disk 
space / issues with compactions?


Hinted handoff may be the reason in case hinted handoff window is longer than 
gc_grace_seconds. To me it looks like hinted handoff window 
(max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must always be 
set to a value less than gc_grace_seconds.


Regards,

Kyrill


From: Agrawal, Pratik 
Sent: Monday, August 6, 2018 8:22:27 PM
To: user@cassandra.apache.org
Subject: Hinted Handoff


Hello all,

We use Cassandra in non-conventional way, where our data is short termed (life 
cycle of about 20-30 minutes) where each record is updated ~5 times and then 
deleted. We have GC grace of 15 minutes.

We are seeing 2 problems

1.) A certain number of Cassandra nodes goes down and then we remove it from 
the cluster using Cassandra removenode command and replace the dead nodes with 
new nodes. While new nodes are joining in, we see more nodes down (which are 
not actually down) but we see following errors in the log

“Gossip not settled after 321 polls. Gossip Stage active/pending/completed: 
1/816/0”



To fix the issue, I restarted the server and the nodes now appear to be up and 
the problem is solved



Can this problem be related to 
https://issues.apache.org/jira/browse/CASSANDRA-6590 ?



2.) Meanwhile, after restarting the nodes mentioned above, we see that some old 
deleted data is resurrected (because of short lifecycle of our data). My guess 
at the moment is that these data is resurrected due to hinted handoff. 
Interesting point to note here is that data keeps resurrecting at periodic 
intervals (like an hour) and then finally stops. Could this be caused by hinted 
handoff? if so is there any setting which we can set to specify that 
“invalidate, hinted handoff data after 5-10 minutes”.



Thanks,
Pratik


ETL options from Hive/Presto/s3 to cassandra

2018-08-06 Thread srimugunthan dhandapani
Hi all,
We have data that gets filled into Hive/ presto  every few hours.
We want that data to be transferred to cassandra tables.
What are some of the high performance ETL options for transferring data
between hive  or presto into cassandra?

Also does anybody have any performance numbers comparing
- loading data from S3 to cassandra using SStableloader
- and loading data from S3 to cassandra using other means (like spark-api)?

Thanks,
mugunthan


Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-06 Thread Alain RODRIGUEZ
Hello,


> There are reports (in this ML too) that disabling dynamic snitching
> decreases response time.


I confirm that I have seen this improvement on clusters under pressure.

What effects stand behind this improvement?
>

My understanding is that this is due to the fact that the clients are then
'pinned', more sticking to specific nodes when the dynamic snitching is
off. I guess there is a better use of caches and in-memory structures,
reducing the amount of disk read needed, which can lead to way more
performances than switching from node to node as soon as the score of some
node is not good enough.
I am also not sure that the score calculation is always relevant, thus
increasing the threshold before switching reads to another node is still
often worst than disabling it completely. I am not sure if the score
calculation was fixed, but in most cases, I think it's safer to run with
'dynamic_snitch: false'. Anyway, it's possible to test it on a canary node
(or entire rack) and look at the p99 for read latencies for example :).

This ticket is old, but was precisely on that topic:
https://issues.apache.org/jira/browse/CASSANDRA-6908

C*heers
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-04 15:37 GMT+02:00 Kyrylo Lebediev :

> Hello!
>
>
> In case when dynamic snitching is enabled data is read from 'the fastest
> replica' and other replicas send digests for CL=QUORUM/LOCAL_QUORUM .
>
> When dynamic snitching is disabled, as the concept of the fastest replica
> disappears, which rules are used to choose from which replica to read
> actual data (not digests):
>
>  1) when all replicas are online
>
>  2) when the node primarily responsible for the token range is offline.
>
>
> There are reports (in this ML too) that disabling dynamic snitching
> decreases response time.
>
> What effects stand behind this improvement?
>
>
> Regards,
>
> Kyrill
>


Hinted Handoff

2018-08-06 Thread Agrawal, Pratik
Hello all,

We use Cassandra in non-conventional way, where our data is short termed (life 
cycle of about 20-30 minutes) where each record is updated ~5 times and then 
deleted. We have GC grace of 15 minutes.

We are seeing 2 problems

1.) A certain number of Cassandra nodes goes down and then we remove it from 
the cluster using Cassandra removenode command and replace the dead nodes with 
new nodes. While new nodes are joining in, we see more nodes down (which are 
not actually down) but we see following errors in the log

“Gossip not settled after 321 polls. Gossip Stage active/pending/completed: 
1/816/0”



To fix the issue, I restarted the server and the nodes now appear to be up and 
the problem is solved



Can this problem be related to 
https://issues.apache.org/jira/browse/CASSANDRA-6590 ?



2.) Meanwhile, after restarting the nodes mentioned above, we see that some old 
deleted data is resurrected (because of short lifecycle of our data). My guess 
at the moment is that these data is resurrected due to hinted handoff. 
Interesting point to note here is that data keeps resurrecting at periodic 
intervals (like an hour) and then finally stops. Could this be caused by hinted 
handoff? if so is there any setting which we can set to specify that 
“invalidate, hinted handoff data after 5-10 minutes”.



Thanks,
Pratik