[jira] [Commented] (CASSANDRA-15041) UncheckedExecutionException if authentication/authorization query fails

2019-06-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860306#comment-16860306
 ] 

Per Otterström commented on CASSANDRA-15041:


Had to make some adjustments while implementing this.

When we fail to perform authorization it is not always possible to convert 
whatever-exception-we-get into an {{UnavailableException}} since the 
{{UnavailableException}} constructor requires a bunch of parameters (CL plus 
required and live nodes). I didn't feel comfortable to change this to achieve 
our goals here, so I went with the other proposal to convert this into an 
{{UnauthorizedException}} instead. But I'm happy to discuss options. Worth 
considering, since IAuthorizer is a public plug-in interface it should define a 
generic behavior. And, for example, it would be somewhat awkward for an 
{{LDAPAuthorizer}} to throw an {{UnavailableException}} if it fails to contact 
the LDAP server, so the {{UnauthorizedException}} may be a better fit anyway.

A side effect of signaling {{UnauthorizedException}} instead of 
{{UnavailableException}} is that the issue with the stale entries from the 
Caffeine cache don'ẗ show any more. This is because the driver will not retry 
on {{UnauthrizedException}}, and the Caffeine issue only shows if it get 
repeated queries on failing keys. But IMO we should still see to this. I 
created CASSANDRA-15153 for this.

Also, had a setback with one of the goals of this ticket - to make the 
background cache reload thread mute if it fails. Turns out the error message is 
buried deep down in the Guava {{LoadingCache}}. Only option I see for pre-4.0 
is to mute this in the logback config.

PR for [dtest|https://github.com/apache/cassandra-dtest/pull/52].

The patches for Cassandra differs a bit on 2.2/3.0 vs. 3.11 vs. trunk. Not sure 
what's the best way to provide these patches to simplify review and merge into 
upstream repo. Below are links to the individual branches on my github clone 
without merge commits, is that OK? Lots if dtests are timing out since I only 
have the free service, but will try to run failing tests locally

||Patch||CI||
|[15041-cassandra-2.2|https://github.com/eperott/cassandra/tree/15041-cassandra-2.2]|[CircleCI|https://circleci.com/gh/eperott/workflows/cassandra/tree/cci%2F15041-cassandra-2.2]|
|[15041-cassandra-3.0|https://github.com/eperott/cassandra/tree/15041-cassandra-3.0]|[CircleCI|https://circleci.com/gh/eperott/workflows/cassandra/tree/cci%2F15041-cassandra-3.0]|
|[15041-cassandra-3.11|https://github.com/eperott/cassandra/tree/15041-cassandra-3.11]|[CircleCI|https://circleci.com/gh/eperott/workflows/cassandra/tree/cci%2F15041-cassandra-3.11]|
|[15041-trunk|https://github.com/eperott/cassandra/tree/15041-trunk]|[CircleCI|https://circleci.com/gh/eperott/workflows/cassandra/tree/cci%2F15041-trunk]|

> UncheckedExecutionException if authentication/authorization query fails
> ---
>
> Key: CASSANDRA-15041
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15041
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Authorization
>Reporter: Per Otterström
>Assignee: Per Otterström
>Priority: Normal
> Fix For: 2.2.15, 3.0.19, 3.11.5, 4.0
>
>
> If cache update for permissions/credentials/roles fails with 
> UnavailableException this comes back to client as UncheckedExecutionException.
> Stack trace on server side:
> {noformat}
> ERROR [Native-Transport-Requests-1] 2019-03-04 16:30:51,537 
> ErrorMessage.java:384 - Unexpected exception during request
> com.google.common.util.concurrent.UncheckedExecutionException: 
> com.google.common.util.concurrent.UncheckedExecutionException: 
> java.lang.RuntimeException: 
> org.apache.cassandra.exceptions.UnavailableException: Cannot achieve 
> consistency level QUORUM
> at 
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) 
> ~[guava-18.0.jar:na]
> at com.google.common.cache.LocalCache.get(LocalCache.java:3937) 
> ~[guava-18.0.jar:na]
> at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) 
> ~[guava-18.0.jar:na]
> at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
>  ~[guava-18.0.jar:na]
> at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:97) 
> ~[apache-cassandra-3.11.4.jar:3.11.4]
> at 
> org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
>  ~[apache-cassandra-3.11.4.jar:3.11.4]
> at 
> org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
>  ~[apache-cassandra-3.11.4.jar:3.11.4]
> at 
> org.apache.cassandra.service.ClientState.authorize(ClientState.java:439) 
> ~[apache-cassandra-3.11.4.jar:3.11.4]
> at 
> 

[jira] [Commented] (CASSANDRA-15066) Improvements to Internode Messaging

2019-06-10 Thread Aleksey Yeschenko (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860129#comment-16860129
 ] 

Aleksey Yeschenko commented on CASSANDRA-15066:
---

Agreed on readiness to commit in current state. To complete the list 
(non-exhaustively), below are some notable changes on my part.

To start with, the largest change has been the redesign of large message 
handling - suggested by [~xedin] during review. Whereas
previously we'd have a companion thread deserializing the large message as new 
frames kept coming, now, instead, we accumulate all
the frames needed for the large message deserialization - and then schedule a 
task directly on that message's verb's {{Stage}}, a task
that desrializes the message and executes the verb handler in one go. This not 
only simplified the logic in {{InboundMessageHandler}},
but also increases locality, and reduces the lifetime of the large messages on 
heap.

Other changes include: 
- Fixed a bug with double-release of permits on deser exceptions in 
{{InboundMessageHandler}}
- Fixed forgetting to signal a {{WaitQueue}} when releasing permits back in 
case of partial allocate failure
- Fixed {{FrameDecoder}} not propagating {{channelClose()}} to 
{{InboundMessageHandler}}
- Fixed several legacy handshake issues
- Fixed legacy LZ4 frame encoder and decoder performance (broken Netty xxhash 
behaviour)
- Fixed mutation forwarding to remote DCs mistakenly including the picked 
forwarder node itself (spotted by [~jmeredithco])
- Started immediately expiring callbacks for all forwarded mutation 
destinations when failing to send to the forwarder
- Introduced inbound backpressure counters (throttled count and nanos)
- Started treating all deserialize exceptions as non-fatal, to prevent 
unnecessary message loss and reconnects
- Factored out header fields from {{Message}} into a standalone {{Header}} 
class to prevent double-deserialization of some fields and to clean up callback 
signatures
- Introduced max message size config param, akin to max mutation size - set to 
endpoint reserve capacity by default
- Introduced an MPSC linked queue with volatile offer semantics and 
non-blocking {{poll()}} and {{drain()}} and used it to fix visibility issues or 
blocking behaviour in {{OutboundMessageQueue}}, 
{{InboundMessageHandler.WaitQueue}}, and Netty's event loops; then used it to 
minimise amount of signalling done when {{InboundMessageHandler}} get 
registered on the wait queue
- Refactored callbacks and callback map ({{RequestCallbacks}}) to allow reusing 
the same request ID for multiple messages, got rid of an extra object per entry
- Building on the refactoring above, reduced and mostly eliminated allocation 
of extra {{Message}} objects, allowing to save on {{serializedSize}} 
invocations and some garbage
- Reworked integration between {{InboundMessageHandler}} and {{FrameDecoder}} 
for clarity and performance
- Fixed {{FrameDecoder}} over-issuing {{channel.read()}} calls in some 
circumstances
- Refactored {{InboundMessageHandler}} frame handling and callbacks
- Push processing exception handling to callbacks/message sink
- Added a lot of comments/documentation, tests, made various logging 
improvements, better thread names

Also, some changes made by [~ifesdjeen] directly - in addition to his many 
helpful review corrections:
- Introduced in-JVM proxy to test expirations and closure, added tests for 
inbound expirations
- Fixed a bug in outbound virtual table (overflow_count/overflow_bytes swapped 
values), and same in outbound metrics
- Introduced {{UnknownColumnsException}} to more places instead of 
{{RuntimeException}}
- Fixed {{Message.Builder.builder(Message)}} to copy over original flags

> Improvements to Internode Messaging
> ---
>
> Key: CASSANDRA-15066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Internode
>Reporter: Benedict
>Assignee: Benedict
>Priority: High
> Fix For: 4.0
>
> Attachments: 20k_backfill.png, 60k_RPS.png, 
> 60k_RPS_CPU_bottleneck.png, backfill_cass_perf_ft_msg_tst.svg, 
> baseline_patch_vs_30x.png, increasing_reads_latency.png, 
> many_reads_cass_perf_ft_msg_tst.svg
>
>
> CASSANDRA-8457 introduced asynchronous networking to internode messaging, but 
> there have been several follow-up endeavours to improve some semantic issues. 
>  CASSANDRA-14503 and CASSANDRA-13630 are the latest such efforts, and were 
> combined some months ago into a single overarching refactor of the original 
> work, to address some of the issues that have been discovered.  Given the 
> criticality of this work to the project, we wanted to bring some more eyes to 
> bear to ensure the release goes ahead smoothly.  In doing so, we 

[jira] [Comment Edited] (CASSANDRA-15066) Improvements to Internode Messaging

2019-06-10 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860113#comment-16860113
 ] 

Benedict edited comment on CASSANDRA-15066 at 6/10/19 4:13 PM:
---

The patch is now ready to commit from my point of view, with many small fixes 
and clarity improvements.  On my part, the main work since our last discussion 
has been to introduce {{ConnectionBurnTest}} and its corresponding 
{{Verifier}}, that together exercise and verify a wide variety of connection 
behaviours.  This isn’t completely exhaustive, but it is _close_, and helped 
shake out many of the bugs we have fixed.  [~ifesdjeen] has also been offering 
excellent review feedback and bug reports, that have been incorporated into 
both the stylistic improvements (particularly the {{OutboundConnection}} state 
machine) and fixes below.

h3. Improvements
* {{OutboundConnection}}: introduce simple state machine for connection status
* Hooks for verification, including {{OutboundMessageCallbacks}} and 
{{OutboundDebugCallbacks}}
* Log canonical and actual addresses on connection
* Simplify/clarify {{OutboundConnectionSettings}} and 
{{OutboundConnection.template}} semantics
* Abstract {{MonotonicClock}} source we use, so that users can opt to pay the 
cost of {{System.nanoTime}} and avoid the trade-offs inherent in using a clock 
with {{~2ms+scheduler error}} granularity if they prefer.

h3. Fixes
* Event Loop delivery schedules self for later execution if work remains after 
completing a batch, instead of spinning until link saturated
* Large message delivery did not handle exception recovery correctly
* Could have attempted to continue delivery after invalidating connection
* Could have failed to re-schedule after invalidation
* Stop using autoRead to prevent queued reads
* Accurately handle error bounds in when using approximate time 
* Ensure impossible to overflow stack in OMQ.lock by moving to iterative vs 
recursive evaluation
* Data race between large and event loop thread when closing connection
* Do not immediately propagate connection close in decoder; defer until all 
received messages have been processed
* {{FrameDecoderLegacy}} should not {{discard}} itself on encountering a 
corrupt frame; waits for IMH to do so
* {{OutboundConnectionInitiator}} cancellation would leak connection if 
cancelled after established but before handshake completed
* LARGE_MESSAGE_THRESHOLD includes frame overheads

h3. Unrelated Fixes
* CASSANDRA-10726 introduced an unnecessary digest calculation for CL {{ONE}} 
reads



was (Author: benedict):
The patch is now ready to commit from my point of view, with many small fixes 
and clarity improvements.  On my part, the main work since our last discussion 
has been to introduce {{ConnectionBurnTest}} and its corresponding 
{{Verifier}}, that together exercise and verify a wide variety of connection 
behaviours.  This isn’t completely exhaustive, but it is _close_, and helped 
shake out many of the bugs we have fixed.  [~ifesdjeen] has also been offering 
excellent review feedback and bug reports, that have been incorporated into 
both the stylistic improvements (particularly the OutboundConnection state 
machine) and fixes below.

h3. Improvements
* {{OutboundConnection}}: introduce simple state machine for connection status
* Hooks for verification, including {{OutboundMessageCallbacks}} and 
{{OutboundDebugCallbacks}}
* Log canonical and actual addresses on connection
* Simplify/clarify {{OutboundConnectionSettings}} and 
{{OutboundConnection.template}} semantics
* Abstract {{MonotonicClock}} source we use, so that users can opt to pay the 
cost of {{System.nanoTime}} and avoid the trade-offs inherent in using a clock 
with {{~2ms+scheduler error}} granularity if they prefer.

h3. Fixes
* Event Loop delivery schedules self for later execution if work remains after 
completing a batch, instead of spinning until link saturated
* Large message delivery did not handle exception recovery correctly
* Could have attempted to continue delivery after invalidating connection
* Could have failed to re-schedule after invalidation
* Stop using autoRead to prevent queued reads
* Accurately handle error bounds in when using approximate time 
* Ensure impossible to overflow stack in OMQ.lock by moving to iterative vs 
recursive evaluation
* Data race between large and event loop thread when closing connection
* Do not immediately propagate connection close in decoder; defer until all 
received messages have been processed
* {{FrameDecoderLegacy}} should not {{discard}} itself on encountering a 
corrupt frame; waits for IMH to do so
* {{OutboundConnectionInitiator}} cancellation would leak connection if 
cancelled after established but before handshake completed
* LARGE_MESSAGE_THRESHOLD includes frame overheads

h3. Unrelated Fixes
* CASSANDRA-10726 introduced an unnecessary digest 

[jira] [Commented] (CASSANDRA-15066) Improvements to Internode Messaging

2019-06-10 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860113#comment-16860113
 ] 

Benedict commented on CASSANDRA-15066:
--

The patch is now ready to commit from my point of view, with many small fixes 
and clarity improvements.  On my part, the main work since our last discussion 
has been to introduce {{ConnectionBurnTest}} and its corresponding 
{{Verifier}}, that together exercise and verify a wide variety of connection 
behaviours.  This isn’t completely exhaustive, but it is _close_, and helped 
shake out many of the bugs we have fixed.  [~ifesdjeen] has also been offering 
excellent review feedback and bug reports, that have been incorporated into 
both the stylistic improvements (particularly the OutboundConnection state 
machine) and fixes below.

h3. Improvements
* {{OutboundConnection}}: introduce simple state machine for connection status
* Hooks for verification, including {{OutboundMessageCallbacks}} and 
{{OutboundDebugCallbacks}}
* Log canonical and actual addresses on connection
* Simplify/clarify {{OutboundConnectionSettings}} and 
{{OutboundConnection.template}} semantics
* Abstract {{MonotonicClock}} source we use, so that users can opt to pay the 
cost of {{System.nanoTime}} and avoid the trade-offs inherent in using a clock 
with {{~2ms+scheduler error}} granularity if they prefer.

h3. Fixes
* Event Loop delivery schedules self for later execution if work remains after 
completing a batch, instead of spinning until link saturated
* Large message delivery did not handle exception recovery correctly
* Could have attempted to continue delivery after invalidating connection
* Could have failed to re-schedule after invalidation
* Stop using autoRead to prevent queued reads
* Accurately handle error bounds in when using approximate time 
* Ensure impossible to overflow stack in OMQ.lock by moving to iterative vs 
recursive evaluation
* Data race between large and event loop thread when closing connection
* Do not immediately propagate connection close in decoder; defer until all 
received messages have been processed
* {{FrameDecoderLegacy}} should not {{discard}} itself on encountering a 
corrupt frame; waits for IMH to do so
* {{OutboundConnectionInitiator}} cancellation would leak connection if 
cancelled after established but before handshake completed
* LARGE_MESSAGE_THRESHOLD includes frame overheads

h3. Unrelated Fixes
* CASSANDRA-10726 introduced an unnecessary digest calculation for CL {{ONE}} 
reads


> Improvements to Internode Messaging
> ---
>
> Key: CASSANDRA-15066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Internode
>Reporter: Benedict
>Assignee: Benedict
>Priority: High
> Fix For: 4.0
>
> Attachments: 20k_backfill.png, 60k_RPS.png, 
> 60k_RPS_CPU_bottleneck.png, backfill_cass_perf_ft_msg_tst.svg, 
> baseline_patch_vs_30x.png, increasing_reads_latency.png, 
> many_reads_cass_perf_ft_msg_tst.svg
>
>
> CASSANDRA-8457 introduced asynchronous networking to internode messaging, but 
> there have been several follow-up endeavours to improve some semantic issues. 
>  CASSANDRA-14503 and CASSANDRA-13630 are the latest such efforts, and were 
> combined some months ago into a single overarching refactor of the original 
> work, to address some of the issues that have been discovered.  Given the 
> criticality of this work to the project, we wanted to bring some more eyes to 
> bear to ensure the release goes ahead smoothly.  In doing so, we uncovered a 
> number of issues with messaging, some of which long standing, that we felt 
> needed to be addressed.  This patch widens the scope of CASSANDRA-14503 and 
> CASSANDRA-13630 in an effort to close the book on the messaging service, at 
> least for the foreseeable future.
> The patch includes a number of clarifying refactors that touch outside of the 
> {{net.async}} package, and a number of semantic changes to the {{net.async}} 
> packages itself.  We believe it clarifies the intent and behaviour of the 
> code while improving system stability, which we will outline in comments 
> below.
> https://github.com/belliottsmith/cassandra/tree/messaging-improvements



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15150) Update the contact us/community page to point to Slack rather than IRC

2019-06-10 Thread Jeremy Hanna (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Hanna updated CASSANDRA-15150:
-
Status: Open  (was: Resolved)

Found a few more instances - will clean it up and submit an updated patch.

> Update the contact us/community page to point to Slack rather than IRC
> --
>
> Key: CASSANDRA-15150
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15150
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation/Website
>Reporter: Jeremy Hanna
>Assignee: Jeremy Hanna
>Priority: Low
> Fix For: 4.0
>
> Attachments: CASSANDRA-15150.txt
>
>
> Update the contact us/community page to point to ASF Slack rather than IRC.  
> We can remove cassandra-builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15153) Caffeine cache return stale entries

2019-06-10 Thread JIRA
Per Otterström created CASSANDRA-15153:
--

 Summary: Caffeine cache return stale entries
 Key: CASSANDRA-15153
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15153
 Project: Cassandra
  Issue Type: Bug
Reporter: Per Otterström


Version 2.3.5 of the Caffeine cache that we're using in various places can hand 
out stale entries in some cases. This seem to happen when an update fails 
repeatedly, in which case Caffeine may return a previously loaded value. For 
instance, the AuthCache may hand out permissions even though the reload 
operation is failing, see CASSANDRA-15041.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15152) Batch Log - Mutation too large while bootstrapping a newly added node

2019-06-10 Thread Avraham Kalvo (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859724#comment-16859724
 ] 

Avraham Kalvo commented on CASSANDRA-15152:
---

Switching log level to trace has disclosed the following, just before the error 
we’re getting:
`TRACE [BatchlogTasks:1] 2019-06-10 05:45:40,251 BatchlogManager.java:309 - 
Replaying batch 5694cca0-8834-11e9-b262-b3ace0831935`

How should one query the `system.batches` table to see the actual mutation(s) 
list (Blob to Text? Casting?)
Would this table disclose the exact keyspace.table the mutations is related to? 
thanks.



> Batch Log - Mutation too large while bootstrapping a newly added node
> -
>
> Key: CASSANDRA-15152
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15152
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Batch Log
>Reporter: Avraham Kalvo
>Priority: Normal
>
> Scaling our six nodes cluster by three more nodes, we came upon behavior in 
> which bootstrap appears hung under `UJ` (two previously added were joined 
> within approximately 2.5 hours).
> Examining the logs the following became apparent shortly after the bootstrap 
> process has commenced for this node:
> ```
> ERROR [BatchlogTasks:1] 2019-06-05 14:43:46,508 CassandraDaemon.java:207 - 
> Exception in thread Thread[BatchlogTasks:1,5,main]
> java.lang.IllegalArgumentException: Mutation of 108035175 bytes is too large 
> for the maximum size of 16777216
> at 
> org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:256) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:520) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Keyspace.applyNotDeferrable(Keyspace.java:399) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.db.Mutation.apply(Mutation.java:213) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendSingleReplayMutation(BatchlogManager.java:427)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendReplays(BatchlogManager.java:402)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.replay(BatchlogManager.java:318)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.batchlog.BatchlogManager.processBatchlogEntries(BatchlogManager.java:238)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.batchlog.BatchlogManager.replayFailedBatches(BatchlogManager.java:207)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_201]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_201]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_201]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_201]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_201]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_201]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> ```
> And since then, repeating itself in the logs.
> We decided to discard the newly added apparently still joining node by doing 
> the following:
> 1. at first - simply restarting it, which resulted in it starting up 
> apparently normally 
> 2. then - decommission it by issuing `nodetool decommission`, this took long 
> (over 2.5 hours) and eventually was terminated by issuing `nodetool 
> removenode`
> 3. node removal was hung on a specific token, which led us to complete it by 
> force.
> 4. forcing the node removal has generated a corruption with one of the 
> `system.batches` table SSTables, which was removed (backed up) from its 
> underlying data dir as mitigation (78MB worth)
> 5. cluster-wide repair was run
> 6. `Mutation too large` error is now repeating itself in three different 
> permutations (alerted sizes) under three different nodes (our standard 
> replication factor is of three)
> We're not sure whether we're hitting 
>