Node hang on shutdown
Hi. I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I successfully upgraded two nodes but the last one did not shutdown properly. Does somebody see anything suspicious in the attached thread dump? Regards, Mikhail. 2013-10-31 12:09:12 Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode): Attach Listener daemon prio=10 tid=0x7f15e0007800 nid=0x3467 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None StorageServiceShutdownHook prio=10 tid=0x7f15e4678800 nid=0x343e waiting on condition [0x7f15dc97b000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xc6fddb10 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:502) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) Locked ownable synchronizers: - None SIGTERM handler daemon prio=10 tid=0x7f15e001c800 nid=0x343c in Object.wait() [0x7f15da198000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc6fe9218 (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1260) - locked 0xc6fe9218 (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1334) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked 0xc6c5e8c0 (a java.lang.Class for java.lang.Shutdown) at java.lang.Terminator$1.handle(Terminator.java:52) at sun.misc.Signal$1.run(Signal.java:212) at java.lang.Thread.run(Thread.java:724) Locked ownable synchronizers: - None MigrationStage:3 daemon prio=10 tid=0x7f15ec00f000 nid=0x343a waiting on condition [0x7f15dc62e000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xc6fcdb10 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Locked ownable synchronizers: - None MutationStage:2718 daemon prio=10 tid=0x7f15f4026000 nid=0x33ea waiting on condition [0x7f15d857a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xc2a186a0 (a java.util.concurrent.FutureTask$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:248) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:407) at org.apache.cassandra.db.ColumnFamilyStore.forceBlockingFlush(ColumnFamilyStore.java:818) at org.apache.cassandra.db.ColumnFamilyStore.truncateBlocking(ColumnFamilyStore.java:1913) at org.apache.cassandra.db.TruncateVerbHandler.doVerb(TruncateVerbHandler.java:40) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
RE: Node hang on shutdown
Hi, So you had to kill -9 the process? Is there something interesting in system.log? Can you restart the node or are there any errors on startup? Romain Mikhail Mazursky ash...@gmail.com a écrit sur 31/10/2013 08:02:22 : De : Mikhail Mazursky ash...@gmail.com A : user@cassandra.apache.org, Date : 31/10/2013 08:04 Objet : Node hang on shutdown Hi. I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I successfully upgraded two nodes but the last one did not shutdown properly. Does somebody see anything suspicious in the attached thread dump? Regards, Mikhail.
Re: Node hang on shutdown
Romain, yes, I had to kill -9 to stop it. INFO [RequestResponseStage:54] 2013-10-31 11:59:10,413 Gossiper.java (line 789) InetAddress /192.168.0.197 is now UP INFO [GossipStage:1] 2013-10-31 11:59:10,706 StorageService.java (line 1298) Node /192.168.0.197 state jump to normal INFO [GossipStage:1] 2013-10-31 12:00:55,905 Gossiper.java (line 806) InetAddress /192.168.0.251 is now DOWN INFO [HANDSHAKE-/192.168.0.251] 2013-10-31 12:01:35,848 OutboundTcpConnection.java (line 386) Handshaking version with / 192.168.0.251 INFO [GossipStage:1] 2013-10-31 12:01:35,978 Gossiper.java (line 824) Node /192.168.0.251 has restarted, now UP INFO [MemoryMeter:1] 2013-10-31 12:01:35,980 Memtable.java (line 442) CFS(Keyspace='system', ColumnFamily='peers') liveRatio is 17.361702127659573 (just-counted was 17.361702127659573). calculation took 0m s for 10 columns INFO [HANDSHAKE-/192.168.0.251] 2013-10-31 12:01:35,989 OutboundTcpConnection.java (line 386) Handshaking version with / 192.168.0.251 INFO [RequestResponseStage:55] 2013-10-31 12:01:36,065 Gossiper.java (line 789) InetAddress /192.168.0.251 is now UP ERROR [MigrationStage:2] 2013-10-31 12:01:36,176 CassandraDaemon.java (line 185) Exception in thread Thread[MigrationStage:2,5,main] java.lang.RuntimeException: java.io.FileNotFoundException: /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-95-Index.db (Too many open files) at org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:102) at org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:90) at org.apache.cassandra.io.sstable.SSTableReader.openIndexReader(SSTableReader.java:1337) at org.apache.cassandra.io.sstable.SSTableScanner.init(SSTableScanner.java:68) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1115) at org.apache.cassandra.db.RowIteratorFactory.getIterator(RowIteratorFactory.java:69) at org.apache.cassandra.db.ColumnFamilyStore.getSequentialIterator(ColumnFamilyStore.java:1507) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1626) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1564) at org.apache.cassandra.db.SystemKeyspace.serializedSchema(SystemKeyspace.java:722) at org.apache.cassandra.db.SystemKeyspace.serializeSchema(SystemKeyspace.java:743) at org.apache.cassandra.db.SystemKeyspace.serializeSchema(SystemKeyspace.java:733) at org.apache.cassandra.db.MigrationRequestVerbHandler.doVerb(MigrationRequestVerbHandler.java:42) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.FileNotFoundException: /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-jb-95-Index.db (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:233) at org.apache.cassandra.io.util.RandomAccessReader.init(RandomAccessReader.java:58) at org.apache.cassandra.io.util.RandomAccessReader.open(RandomAccessReader.java:98) ... 16 more INFO [GossipStage:1] 2013-10-31 12:01:36,549 StorageService.java (line 1298) Node /192.168.0.251 state jump to normal INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,067 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,174 Server.java (line 174) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2013-10-31 12:02:38,175 Gossiper.java (line 1129) Announcing shutdown INFO [StorageServiceShutdownHook] 2013-10-31 12:02:40,175 MessagingService.java (line 665) Waiting for messaging service to quiesce INFO [ACCEPT-/192.168.0.232] 2013-10-31 12:02:40,176 MessagingService.java (line 846) MessagingService shutting down server thread. And there are a lot of such exceptions (the above one is the last before it hanged). There are no exceptions after upgrade and startup. Looks like C* is leaking file handles (nothing else is running on that VM). I will try to debug it. Thanks for pointing me to the logs. Regards, Mikhail. 2013/10/31 Romain HARDOUIN romain.hardo...@urssaf.fr Hi, So you had to kill -9 the process? Is there something interesting in system.log? Can you restart the node or are there any errors on startup? Romain Mikhail Mazursky ash...@gmail.com a écrit sur 31/10/2013 08:02:22 : De : Mikhail Mazursky ash...@gmail.com A : user@cassandra.apache.org, Date : 31/10/2013 08:04 Objet : Node hang on shutdown Hi. I was upgrading my 3 node testing cluster from 2.0.1 to 2.0.2. I successfully upgraded two nodes
Re: Node hang on shutdown
Too many open files are common if you haven't set limits properly (/etc/security/limits.conf). But it this case it might be a file descriptor leak. This link can help and is still relevant for C* 2.0: http://www.datastax.com/docs/1.1/troubleshooting/index#java-reports-an-error-saying-there-are-too-many-open-files
Re: Node hang on shutdown
Looks like a bug. See https://issues.apache.org/jira/browse/CASSANDRA-6275for more details.