I have a cluster of 3 0.7 beta 2 nodes (built today from the latest trunk) running on Large, EBS-backed, x64 EC2 instances; RF=3. I attempted to write somewhere near 500,000 records every 15 minutes from a total of 5 different computers (using Pelops and multi-threading). Though my network blew up and I'm not quite sure how many records were inserted, I lost a node a couple hours later, and the other 2 were at severely high memory useage. Is this a memory leak of some kind, or something I can configure / watch for in the future?
A nodetool does this: [ec2-u...@xxx bin]$ ./nodetool -h localhost ring Address Status State Load Token XXX ipXXX Down Normal 564.76 MB XXX ipXXX Up Normal 564.83 MB XXX ipXXX Up Normal 563.06 MB XXX A top on the box that is down shows this: (dual core x64) Cpu(s): 19.9%us, 5.9%sy, 0.0%ni, 8.8%id, 57.3%wa, 0.0%hi, 0.0%si, 8.1%st Mem: 7651528k total, 7611112k used, 40416k free, 66056k buffers Swap: 0k total, 0k used, 0k free, 3294076k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22514 root 20 0 5790m 4.0g 167m S 91.9 54.8 152:45.08 java I see this error in the log file: ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318 AbstractCassandraDaemon.java (line 88) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1368) at org.apache.cassandra.db.Table.snapshot(Table.java:163) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:232) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:106) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:84) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:475) at org.apache.cassandra.io.util.FileUtils.createHardLinkWithExec(FileUtils.java:263) at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:229) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1360) ... 9 more Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.<init>(UNIXProcess.java:164) at java.lang.ProcessImpl.start(ProcessImpl.java:81) at java.lang.ProcessBuilder.start(ProcessBuilder.java:468) ... 12 more On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > can you reproduce this by, say, running nodeprobe ring in a bash while > loop? > > On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w...@gmail.com> wrote: > > One of my Cassandra server crashed with the following: > > > > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419 > > CassandraDaemon.java (line 82) Uncaught exception in thread > > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main] > > java.lang.OutOfMemoryError: unable to create new native thread > > at java.lang.Thread.start0(Native Method) > > at java.lang.Thread.start(Thread.java:597) > > at > > > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533) > > > > > > I took threads dump in the JVM on all the other Cassandra severs in my > > cluster. They all have thousand of threads looking like this: > > > > "JMX server connection timeout 183373" daemon prio=10 > tid=0x00002aad230db800 > > nid=0x5cf6 in Object.wait() [0x00002aad7a316000] > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > at > > > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150) > > - locked <0x00002aab056ccee0> (a [I) > > at java.lang.Thread.run(Thread.java:619) > > > > It seems to me that there is a JMX threads leak in Cassandra. NodeProbe > > creates a JMXConnector but never calls its close() method. I tried > setting > > jmx.remote.x.server.connection.timeout to 0 hoping that would disable the > > JMX server connection timeout threads. But that did not make any > > difference. > > > > Has anyone else seen this? > > > > Bill > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >