[
https://issues.apache.org/jira/browse/CASSANDRA-11974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323462#comment-15323462
]
Sean Thornton commented on CASSANDRA-11974:
---
Agree. I was trying to focus on a single issue here but there are a number of
places in the code where critical threads can exit without proper handling and
it would be better, in my opinion, that these events are recognized and
handled, even if that is shutting the JVM down) vs. continuing to _appear_ to
be up and running normally (compaction thread, I'm looking at you). I think the
use of the Java assert keyword is possibly the root cause of this in a number
of places due to its raising of a true Error. Most people don't think or don't
know how to handle this appropriately (and really shouldn't). I would much
prefer to see something in the pattern of Spring's Assert or common-lang's
Validate be used.
I do think it's better for the community to provide concrete instances for the
developers to address one-by-one though. It's difficult to address more general
items without a larger effort and there are already a number of those.
> Failed assert causes OutboundTcpConnection to exit
> --
>
> Key: CASSANDRA-11974
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11974
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
>Reporter: Sean Thornton
>
> I am seeing the following in a client's cluster:
> {noformat}
> ERROR [MessagingService-Outgoing-/10.0.0.1] 2016-06-06 03:38:19,305
> CassandraDaemon.java:229 - Exception in thread
> Thread[MessagingService-Outgoing-/10.0.0.1,5,main]
> java.lang.AssertionError: 635174
> at
> org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUtil.java:290)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(AbstractCType.java:392)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(AbstractCType.java:381)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.java:271)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.java:259)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQueryFilter.java:503)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQueryFilter.java:490)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.SliceFromReadCommandSerializer.serialize(SliceFromReadCommand.java:168)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:143)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:132)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:121)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConnection.java:330)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:282)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> at
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:218)
> ~[cassandra-all-2.1.12.1046.jar:2.1.12.1046]
> {noformat}
> Obviously they somehow exceeded a 64K limit (quick and dirty suspects -
> https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html) but
> that is neither here nor there.
> The problem I see when this happens is
> {{ByteBufferUtil.writeWithShortLength}} can throw a
> {{java.lang.AssertionError}} which is a true {{Error}} that bubbles up and
> totally bypasses the {{catch (Exception e)}} clause in the message processing
> loop in {{OutboundTcpConnection.run()}} _which causes the thread to exit and
> that node to no longer communicate outgoing messages to other nodes_.
> At least from my perspective, there are two things I would like to see
> handled differently -
> * In the event of _any_ problem, I would like to see whatever details
> possible be logged about the problem Message - partition key, CF data,
> anything. Right now it can be very difficult to track this down
> * The {{java.lang.Error}} possibility needs to be handled somehow. If it's
> an assertion error, it