For the "Too many open files" error see: http://www.datastax.com/docs/0.8/troubleshooting/index#java-reports-an-error-saying-there-are-too-many-open-files
Restart the node and see if the node is able to complete the pending repair this time. Your node may have just been stuck on this error that caused everything else to halt. Joaquin Casares DataStax Software Engineer/Support On Mon, Jul 18, 2011 at 3:12 PM, Sameer Farooqui <cassandral...@gmail.com>wrote: > I'm running into a quirky issue with Brisk 1.0 Beta 2 (w/ Cassandra 0.8.1). > > I think the last node in our cluster is having problems (10.201.x.x). > OpsCenter and nodetool ring (run from that node) show the node as down, but > the rest of the cluster sees it as up. > > If I run nodetool ring from one of the first 11 nodes, I get this output... > everything is up: > > ubuntu@ip-10-85-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost > ring > Address DC Rack Status State Load > Owns Token > > 148873535527910577765226390751398592512 > 10.2.x.x DC1 RAC1 Up Normal 901.57 GB > 12.50% 0 > 10.116.x.x DC2 RAC1 Up Normal 258.22 GB > 6.25% 10633823966279326983230456482242756608 > 10.110.x.x DC1 RAC1 Up Normal 129.07 GB > 6.25% 21267647932558653966460912964485513216 > 10.2.x.x DC1 RAC1 Up Normal 128.5 GB > 12.50% 42535295865117307932921825928971026432 > 10.114.x.x DC2 RAC1 Up Normal 257.31 GB > 6.25% 53169119831396634916152282411213783040 > 10.210.x.x DC1 RAC1 Up Normal 128.66 GB > 6.25% 63802943797675961899382738893456539648 > 10.207.x.x DC1 RAC2 Up Normal 643.12 GB > 12.50% 85070591730234615865843651857942052864 > 10.85.x.x DC2 RAC1 Up Normal 256.76 GB > 6.25% 95704415696513942849074108340184809472 > 10.2.x.x DC1 RAC2 Up Normal 128.95 GB > 6.25% 106338239662793269832304564822427566080 > 10.96.x.x DC1 RAC2 Up Normal 128.29 GB 12.50% > 127605887595351923798765477786913079296 > 10.194.x.x DC2 RAC1 Up Normal 257.14 GB > 6.25% 138239711561631250781995934269155835904 > 10.201.x.x DC1 RAC2 Up Normal 129.45 GB > 6.25% 148873535527910577765226390751398592512 > > However, OpsCenter shows the last node (10.201.x.x) as unresponsive: > http://blueplastic.com/accenture/unresponsive.PNG > > And if I try to run nodetool ring from the 10.201.x.x node, I get > connection errors like this: > > ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h > localhost ring > Error connection to remote JMX agent! > java.io.IOException: Failed to retrieve RMIServer stub: > javax.naming.CommunicationException [Root exception is > java.rmi.ConnectIOException: error during JRMP connection establishment; > nested exception is: > java.net.SocketTimeoutException: Read timed out] > at > javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338) > at > javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) > at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:141) > at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:111) > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:559) > Caused by: javax.naming.CommunicationException [Root exception is > java.rmi.ConnectIOException: error during JRMP connection establishment; > nested exception is: > java.net.SocketTimeoutException: Read timed out] > at > com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101) > > > tpstats command also didn't work: > > ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h > localhost tpstats > Error connection to remote JMX agent! > java.io.IOException: Failed to retrieve RMIServer stub: > javax.naming.CommunicationException [Root exception is java.rmi.Co > nnectIOException: error during JRMP connection establishment; nested > exception is: > java.net.SocketTimeoutException: Read timed out] > at > javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338) > > > Looking at what's listening on 7199 on the node shows a bunch of results: > > ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ sudo netstat -anp | grep > 7199 > tcp 0 0 0.0.0.0:7199 0.0.0.0:* > LISTEN 1459/java > tcp 8 0 10.194.x.x:7199 10.2.x.x:40135 CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:49835 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:55087 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:49837 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:55647 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:49833 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:52935 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:52940 > CLOSE_WAIT - > tcp 8 0 10.194.x.x:7199 10.2.x.x:40141 CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:52936 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:55646 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:39098 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:39095 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:55086 > CLOSE_WAIT - > tcp 8 0 127.0.0.1:7199 127.0.0.1:50575 > CLOSE_WAIT - > [list truncated, there are about 20 more lines] > > > The /var/log/cassandra dir shows that none of the system.log files were > touched in the last two days. The system.log.1 file's tail shows: > > FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,120 Configuration.java (line > 1256) error parsing conf file: java.io.FileNotFoundException: > /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) > FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,121 Configuration.java (line > 1256) error parsing conf file: java.io.FileNotFoundException: > /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) > FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,123 Configuration.java (line > 1256) error parsing conf file: java.io.FileNotFoundException: > /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) > FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,124 Configuration.java (line > 1256) error parsing conf file: java.io.FileNotFoundException: > /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files) > > Also, this node might be going through a memory leak or spinning tread. > Check out the top output from it (specifically the CPU & MEM): > http://blueplastic.com/accenture/top.PNG > > > Anything else I can do to troubleshoot this? Is this a known issue that I > can just ignore and reboot the node? > > - Sameer > >