Cassandra DC2 nodes down after increasing write requests on DC1 nodes

2014-11-16 Thread Gabriel Menegatti
Hello,

We are using Cassandra 2.1.2 in a multi dc cluster (30 servers on DC1 and
10 on DC2) with a key space replication factor of 1 on DC1 and 2 on DC2.

For some reason when we increase the volume of write requests on DC1 (using
ONE or LOCAL_ONE), the Cassandra java process on DC2 nodes goes down
randomly.

At the time DC2 nodes starts to go down, the load average on DC1 nodes are
around 3-5 and on DC2 around 7-10.. so not big deal.

*Taking a look at the Cassandra's system.log, we found some exceptions:*

ERROR [SharedPool-Worker-43] 2014-11-15 00:39:48,596
JVMStabilityInspector.java:94 - JVM state determined to be unstable.
Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
ERROR [CompactionExecutor:8] 2014-11-15 00:39:48,596
CassandraDaemon.java:153 - Exception in thread
Thread[CompactionExecutor:8,1,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [Thrift-Selector_2] 2014-11-15 00:39:48,596 Message.java:238 - Got an
IOException during write!
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
~[na:1.8.0_25]
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
~[na:1.8.0_25]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
~[na:1.8.0_25]
at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_25]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
~[na:1.8.0_25]
at
org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:164)
~[libthrift-0.9.1.jar:0.9.1]
at
com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
~[thrift-server-0.3.7.jar:na]
at
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
~[thrift-server-0.3.7.jar:na]
at com.thinkaurelius.thrift.Message.write(Message.java:222)
~[thrift-server-0.3.7.jar:na]
at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
[thrift-server-0.3.7.jar:na]
at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
[thrift-server-0.3.7.jar:na]
at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
[thrift-server-0.3.7.jar:na]
at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
[thrift-server-0.3.7.jar:na]
ERROR [Thread-94] 2014-11-15 00:39:48,597 CassandraDaemon.java:153 -
Exception in thread Thread[Thread-94,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
~[na:1.8.0_25]
at
org.apache.cassandra.db.composites.AbstractCType.sliceBytes(AbstractCType.java:369)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.composites.AbstractCompoundCellNameType.fromByteBuffer(AbstractCompoundCellNameType.java:101)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:397)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:381)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:117)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:109)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:106)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:101)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:110)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:272)
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:168)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:150)
~[apache-cassandra-2.1.2.jar:2.1.2]
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:82)

Re: Cassandra DC2 nodes down after increasing write requests on DC1 nodes

2014-11-16 Thread Eric Stevens
 load average on DC1 nodes are around 3-5 and on DC2 around 7-10

Anecdotally I can say that loads in the 7-10 range have been dangerously
high.  When we had a cluster running in this range, the cluster was falling
behind on important tasks such as compaction, and we really struggled to
successfully bootstrap or repair in that DC (2.1.1 cluster).
On Sun Nov 16 2014 at 6:49:31 AM Gabriel Menegatti gabr...@s1mbi0se.com.br
wrote:

 Hello,

 We are using Cassandra 2.1.2 in a multi dc cluster (30 servers on DC1 and
 10 on DC2) with a key space replication factor of 1 on DC1 and 2 on DC2.

 For some reason when we increase the volume of write requests on DC1
 (using ONE or LOCAL_ONE), the Cassandra java process on DC2 nodes goes down
 randomly.

 At the time DC2 nodes starts to go down, the load average on DC1 nodes are
 around 3-5 and on DC2 around 7-10.. so not big deal.

 *Taking a look at the Cassandra's system.log, we found some exceptions:*

 ERROR [SharedPool-Worker-43] 2014-11-15 00:39:48,596
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 ERROR [CompactionExecutor:8] 2014-11-15 00:39:48,596
 CassandraDaemon.java:153 - Exception in thread
 Thread[CompactionExecutor:8,1,main]
 java.lang.OutOfMemoryError: Java heap space
 ERROR [Thrift-Selector_2] 2014-11-15 00:39:48,596 Message.java:238 - Got
 an IOException during write!
 java.io.IOException: Broken pipe
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
 ~[na:1.8.0_25]
 at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
 ~[na:1.8.0_25]
 at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
 ~[na:1.8.0_25]
 at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_25]
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
 ~[na:1.8.0_25]
 at
 org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:164)
 ~[libthrift-0.9.1.jar:0.9.1]
 at
 com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
 ~[thrift-server-0.3.7.jar:na]
 at
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
 ~[thrift-server-0.3.7.jar:na]
 at com.thinkaurelius.thrift.Message.write(Message.java:222)
 ~[thrift-server-0.3.7.jar:na]
 at
 com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
 [thrift-server-0.3.7.jar:na]
 at
 com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
 [thrift-server-0.3.7.jar:na]
 at
 com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
 [thrift-server-0.3.7.jar:na]
 at
 com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
 [thrift-server-0.3.7.jar:na]
 ERROR [Thread-94] 2014-11-15 00:39:48,597 CassandraDaemon.java:153 -
 Exception in thread Thread[Thread-94,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
 ~[na:1.8.0_25]
 at
 org.apache.cassandra.db.composites.AbstractCType.sliceBytes(AbstractCType.java:369)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType.fromByteBuffer(AbstractCompoundCellNameType.java:101)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:397)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:381)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:117)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:109)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:106)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:101)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:110)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
 at
 org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:272)
 

Re: Cassandra DC2 nodes down after increasing write requests on DC1 nodes

2014-11-16 Thread Gabriel Menegatti
Hi Eric,

Thanks for your reply.

I said that load was not a big deal, because ops center shows this loads as 
green, not as yellow or red at all.

Also, our servers have many processors/threads, so I *think* this load is not 
problematic.

My assumption is that for some reason the DC2 10 nodes are not being able to 
handle the volume of requests from DC1, as it was 30 nodes. Even so, on my 
point of view the load of the DC2 nodes should go really high before Cassandra 
goes down, but its not doing so.

Regards,
Gabriel

Enviado pelo celular / Sent from mobile.

 Em 16/11/2014, às 12:25, Eric Stevens migh...@gmail.com escreveu:
 
  load average on DC1 nodes are around 3-5 and on DC2 around 7-10
 
 Anecdotally I can say that loads in the 7-10 range have been dangerously 
 high.  When we had a cluster running in this range, the cluster was falling 
 behind on important tasks such as compaction, and we really struggled to 
 successfully bootstrap or repair in that DC (2.1.1 cluster).
 On Sun Nov 16 2014 at 6:49:31 AM Gabriel Menegatti gabr...@s1mbi0se.com.br 
 wrote:
 Hello,
 
 We are using Cassandra 2.1.2 in a multi dc cluster (30 servers on DC1 and 10 
 on DC2) with a key space replication factor of 1 on DC1 and 2 on DC2.
 
 For some reason when we increase the volume of write requests on DC1 (using 
 ONE or LOCAL_ONE), the Cassandra java process on DC2 nodes goes down 
 randomly.
 
 At the time DC2 nodes starts to go down, the load average on DC1 nodes are 
 around 3-5 and on DC2 around 7-10.. so not big deal.
 
 Taking a look at the Cassandra's system.log, we found some exceptions:
 
 ERROR [SharedPool-Worker-43] 2014-11-15 00:39:48,596 
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.  
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 ERROR [CompactionExecutor:8] 2014-11-15 00:39:48,596 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:8,1,main]
 java.lang.OutOfMemoryError: Java heap space
 ERROR [Thrift-Selector_2] 2014-11-15 00:39:48,596 Message.java:238 - Got an 
 IOException during write!
 java.io.IOException: Broken pipe
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_25]
 at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) 
 ~[na:1.8.0_25]
 at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) 
 ~[na:1.8.0_25]
 at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_25]
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) 
 ~[na:1.8.0_25]
 at 
 org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:164)
  ~[libthrift-0.9.1.jar:0.9.1]
 at com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) 
 ~[thrift-server-0.3.7.jar:na]
 at 
 com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
  ~[thrift-server-0.3.7.jar:na]
 at com.thinkaurelius.thrift.Message.write(Message.java:222) 
 ~[thrift-server-0.3.7.jar:na]
 at 
 com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
  [thrift-server-0.3.7.jar:na]
 at 
 com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
  [thrift-server-0.3.7.jar:na]
 at 
 com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
  [thrift-server-0.3.7.jar:na]
 at 
 com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
  [thrift-server-0.3.7.jar:na]
 ERROR [Thread-94] 2014-11-15 00:39:48,597 CassandraDaemon.java:153 - 
 Exception in thread Thread[Thread-94,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107) 
 ~[na:1.8.0_25]
 at 
 org.apache.cassandra.db.composites.AbstractCType.sliceBytes(AbstractCType.java:369)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType.fromByteBuffer(AbstractCompoundCellNameType.java:101)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:397)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:381)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:117)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.composites.AbstractCellNameType$5.deserialize(AbstractCellNameType.java:109)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:106)
  ~[apache-cassandra-2.1.2.jar:2.1.2]
 at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:101)
  

Re: Cassandra DC2 nodes down after increasing write requests on DC1 nodes

2014-11-16 Thread Tim Heckman
Hello Gabriel,

On Sun, Nov 16, 2014 at 7:25 AM, Gabriel Menegatti
gabr...@s1mbi0se.com.br wrote:
 I said that load was not a big deal, because ops center shows this loads as
 green, not as yellow or red at all.

 Also, our servers have many processors/threads, so I *think* this load is
 not problematic.

I've seen Cassandra clusters fall over with less load on the boxes.
So, not sure how trusting I am of Opscenter.

However, the impact is dependent on the system resources you have
available to you. How many CPU cores do these systems have, how much
total and free memory, are the underlying disks SSD or spinning
platters of rust?

 My assumption is that for some reason the DC2 10 nodes are not being able to
 handle the volume of requests from DC1, as it was 30 nodes. Even so, on my
 point of view the load of the DC2 nodes should go really high before
 Cassandra goes down, but its not doing so.

That would make sense if the nodes are under-provisioned for the work
you are trying to throw at them. The load averages and OOM in the heap
seems to indicate that may be a problem. However, without more details
it's hard to say.

 Regards,
 Gabriel

Cheers!
-Tim