Node hang on shutdown

2013-10-31 Thread Mikhail Mazursky
Hi.

I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I
successfully upgraded two nodes but the last one did not shutdown properly.
Does somebody see anything suspicious in the attached thread dump?

Regards,
Mikhail.
2013-10-31 12:09:12
Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode):

Attach Listener daemon prio=10 tid=0x7f15e0007800 nid=0x3467 waiting on 
condition [0x]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
- None

StorageServiceShutdownHook prio=10 tid=0x7f15e4678800 nid=0x343e waiting 
on condition [0x7f15dc97b000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xc6fddb10 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468)
at 
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:502)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:724)

   Locked ownable synchronizers:
- None

SIGTERM handler daemon prio=10 tid=0x7f15e001c800 nid=0x343c in 
Object.wait() [0x7f15da198000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xc6fe9218 (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1260)
- locked 0xc6fe9218 (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1334)
at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked 0xc6c5e8c0 (a java.lang.Class for java.lang.Shutdown)
at java.lang.Terminator$1.handle(Terminator.java:52)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:724)

   Locked ownable synchronizers:
- None

MigrationStage:3 daemon prio=10 tid=0x7f15ec00f000 nid=0x343a waiting on 
condition [0x7f15dc62e000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xc6fcdb10 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

   Locked ownable synchronizers:
- None

MutationStage:2718 daemon prio=10 tid=0x7f15f4026000 nid=0x33ea waiting 
on condition [0x7f15d857a000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xc2a186a0 (a 
java.util.concurrent.FutureTask$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:248)
at java.util.concurrent.FutureTask.get(FutureTask.java:111)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:407)
at 
org.apache.cassandra.db.ColumnFamilyStore.forceBlockingFlush(ColumnFamilyStore.java:818)
at 
org.apache.cassandra.db.ColumnFamilyStore.truncateBlocking(ColumnFamilyStore.java:1913)
at 
org.apache.cassandra.db.TruncateVerbHandler.doVerb(TruncateVerbHandler.java:40)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 

RE: Node hang on shutdown

2013-10-31 Thread Romain HARDOUIN
Hi,

So you had to kill -9 the process?
Is there something interesting in system.log?
Can you restart the node or are there any errors on startup?

Romain

Mikhail Mazursky ash...@gmail.com a écrit sur 31/10/2013 08:02:22 :

 De : Mikhail Mazursky ash...@gmail.com
 A : user@cassandra.apache.org, 
 Date : 31/10/2013 08:04
 Objet : Node hang on shutdown
 
 Hi.

 I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I 
 successfully upgraded two nodes but the last one did not shutdown 
 properly. Does somebody see anything suspicious in the attached thread 
dump?

 Regards,
 Mikhail.

Re: Node hang on shutdown

2013-10-31 Thread Mikhail Mazursky
Romain,

yes, I had to kill -9 to stop it.

INFO [RequestResponseStage:54] 2013-10-31 11:59:10,413 Gossiper.java (line
789) InetAddress /192.168.0.197 is now UP
 INFO [GossipStage:1] 2013-10-31 11:59:10,706 StorageService.java (line
1298) Node /192.168.0.197 state jump to normal
 INFO [GossipStage:1] 2013-10-31 12:00:55,905 Gossiper.java (line 806)
InetAddress /192.168.0.251 is now DOWN
 INFO [HANDSHAKE-/192.168.0.251] 2013-10-31 12:01:35,848
OutboundTcpConnection.java (line 386) Handshaking version with /
192.168.0.251
 INFO [GossipStage:1] 2013-10-31 12:01:35,978 Gossiper.java (line 824) Node
/192.168.0.251 has restarted, now UP
 INFO [MemoryMeter:1] 2013-10-31 12:01:35,980 Memtable.java (line 442)
CFS(Keyspace='system', ColumnFamily='peers') liveRatio is
17.361702127659573 (just-counted was 17.361702127659573).  calculation took
0m
s for 10 columns
 INFO [HANDSHAKE-/192.168.0.251] 2013-10-31 12:01:35,989
OutboundTcpConnection.java (line 386) Handshaking version with /
192.168.0.251
 INFO [RequestResponseStage:55] 2013-10-31 12:01:36,065 Gossiper.java (line
789) InetAddress /192.168.0.251 is now UP
ERROR [MigrationStage:2] 2013-10-31 12:01:36,176 CassandraDaemon.java (line
185) Exception in thread Thread[MigrationStage:2,5,main]
java.lang.RuntimeException: java.io.FileNotFoundException:
/var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-95-Index.db
(Too many open files)
at
org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:102)
at
org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:90)
at
org.apache.cassandra.io.sstable.SSTableReader.openIndexReader(SSTableReader.java:1337)
at
org.apache.cassandra.io.sstable.SSTableScanner.init(SSTableScanner.java:68)
at
org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1115)
at
org.apache.cassandra.db.RowIteratorFactory.getIterator(RowIteratorFactory.java:69)
at
org.apache.cassandra.db.ColumnFamilyStore.getSequentialIterator(ColumnFamilyStore.java:1507)
at
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1626)
at
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1564)
at
org.apache.cassandra.db.SystemKeyspace.serializedSchema(SystemKeyspace.java:722)
at
org.apache.cassandra.db.SystemKeyspace.serializeSchema(SystemKeyspace.java:743)
at
org.apache.cassandra.db.SystemKeyspace.serializeSchema(SystemKeyspace.java:733)
at
org.apache.cassandra.db.MigrationRequestVerbHandler.doVerb(MigrationRequestVerbHandler.java:42)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.FileNotFoundException:
/var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-95-Index.db
(Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
at
org.apache.cassandra.io.util.RandomAccessReader.init(RandomAccessReader.java:58)
at
org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:98)
... 16 more
 INFO [GossipStage:1] 2013-10-31 12:01:36,549 StorageService.java (line
1298) Node /192.168.0.251 state jump to normal
 INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,067
ThriftServer.java (line 141) Stop listening to thrift clients
 INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,174 Server.java
(line 174) Stop listening for CQL clients
 INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,175 Gossiper.java
(line 1129) Announcing shutdown
 INFO [StorageServiceShutdownHook] 2013-10-31 12:02:40,175
MessagingService.java (line 665) Waiting for messaging service to quiesce
 INFO [ACCEPT-/192.168.0.232] 2013-10-31 12:02:40,176 MessagingService.java
(line 846) MessagingService shutting down server thread.

And there are a lot of such exceptions (the above one is the last before it
hanged).
There are no exceptions after upgrade and startup.
Looks like C* is leaking file handles (nothing else is running on that VM).
I will try to debug it.

Thanks for pointing me to the logs.

Regards,
Mikhail.



2013/10/31 Romain HARDOUIN romain.hardo...@urssaf.fr

 Hi,

 So you had to kill -9 the process?
 Is there something interesting in system.log?
 Can you restart the node or are there any errors on startup?

 Romain

 Mikhail Mazursky ash...@gmail.com a écrit sur 31/10/2013 08:02:22 :

  De : Mikhail Mazursky ash...@gmail.com
  A : user@cassandra.apache.org,
  Date : 31/10/2013 08:04
  Objet : Node hang on shutdown
 
  Hi.

  I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I
  successfully upgraded two nodes

Re: Node hang on shutdown

2013-10-31 Thread Romain HARDOUIN
Too many open files are common if you haven't set limits properly 
(/etc/security/limits.conf).
But it this case it might be a file descriptor leak.

This link can help and is still relevant for C* 2.0:
http://www.datastax.com/docs/1.1/troubleshooting/index#java-reports-an-error-saying-there-are-too-many-open-files


Re: Node hang on shutdown

2013-10-31 Thread Mikhail Mazursky
Looks like a bug. See
https://issues.apache.org/jira/browse/CASSANDRA-6275for more details.