For the "Too many open files" error see:
http://www.datastax.com/docs/0.8/troubleshooting/index#java-reports-an-error-saying-there-are-too-many-open-files

Restart the node and see if the node is able to complete the pending repair
this time. Your node may have just been stuck on this error that caused
everything else to halt.

Joaquin Casares
DataStax
Software Engineer/Support



On Mon, Jul 18, 2011 at 3:12 PM, Sameer Farooqui <cassandral...@gmail.com>wrote:

> I'm running into a quirky issue with Brisk 1.0 Beta 2 (w/ Cassandra 0.8.1).
>
> I think the last node in our cluster is having problems (10.201.x.x).
> OpsCenter and nodetool ring (run from that node) show the node as down, but
> the rest of the cluster sees it as up.
>
> If I run nodetool ring from one of the first 11 nodes, I get this output...
> everything is up:
>
> ubuntu@ip-10-85-x-x:~/brisk/resources/cassandra$ bin/nodetool -h localhost
> ring
> Address         DC          Rack        Status State   Load
> Owns    Token
>
> 148873535527910577765226390751398592512
> 10.2.x.x        DC1         RAC1        Up     Normal  901.57 GB
> 12.50%  0
> 10.116.x.x    DC2         RAC1        Up     Normal  258.22 GB
> 6.25%   10633823966279326983230456482242756608
> 10.110.x.x        DC1         RAC1        Up     Normal  129.07 GB
> 6.25%   21267647932558653966460912964485513216
> 10.2.x.x          DC1         RAC1        Up     Normal  128.5 GB
> 12.50%  42535295865117307932921825928971026432
> 10.114.x.x       DC2         RAC1        Up     Normal  257.31 GB
> 6.25%   53169119831396634916152282411213783040
> 10.210.x.x       DC1         RAC1        Up     Normal  128.66 GB
> 6.25%   63802943797675961899382738893456539648
> 10.207.x.x       DC1         RAC2        Up     Normal  643.12 GB
> 12.50%  85070591730234615865843651857942052864
> 10.85.x.x        DC2         RAC1        Up     Normal  256.76 GB
> 6.25%   95704415696513942849074108340184809472
> 10.2.x.x        DC1         RAC2        Up     Normal  128.95 GB
> 6.25%   106338239662793269832304564822427566080
> 10.96.x.x    DC1         RAC2        Up     Normal  128.29 GB       12.50%
> 127605887595351923798765477786913079296
> 10.194.x.x    DC2         RAC1        Up     Normal  257.14 GB
> 6.25%   138239711561631250781995934269155835904
> 10.201.x.x    DC1         RAC2        Up     Normal  129.45 GB
> 6.25%   148873535527910577765226390751398592512
>
> However, OpsCenter shows the last node (10.201.x.x) as unresponsive:
> http://blueplastic.com/accenture/unresponsive.PNG
>
> And if I try to run nodetool ring from the 10.201.x.x node, I get
> connection errors like this:
>
> ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h
> localhost ring
> Error connection to remote JMX agent!
> java.io.IOException: Failed to retrieve RMIServer stub:
> javax.naming.CommunicationException [Root exception is
> java.rmi.ConnectIOException: error during JRMP connection establishment;
> nested exception is:
>         java.net.SocketTimeoutException: Read timed out]
>         at
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338)
>         at
> javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
>         at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:141)
>         at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:111)
>         at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:559)
> Caused by: javax.naming.CommunicationException [Root exception is
> java.rmi.ConnectIOException: error during JRMP connection establishment;
> nested exception is:
>         java.net.SocketTimeoutException: Read timed out]
>         at
> com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
>
>
> tpstats command also didn't work:
>
> ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ bin/nodetool -h
> localhost tpstats
> Error connection to remote JMX agent!
> java.io.IOException: Failed to retrieve RMIServer stub:
> javax.naming.CommunicationException [Root exception is java.rmi.Co
> nnectIOException: error during JRMP connection establishment; nested
> exception is:
>         java.net.SocketTimeoutException: Read timed out]
>         at
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338)
>
>
> Looking at what's listening on 7199 on the node shows a bunch of results:
>
> ubuntu@ip-10-194-x-x:~/brisk/resources/cassandra$ sudo netstat -anp | grep
> 7199
> tcp        0      0 0.0.0.0:7199            0.0.0.0:*
> LISTEN      1459/java
> tcp        8      0 10.194.x.x:7199     10.2.x.x:40135      CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:49835
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:55087
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:49837
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:55647
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:49833
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:52935
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:52940
> CLOSE_WAIT  -
> tcp        8      0 10.194.x.x:7199     10.2.x.x:40141      CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:52936
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:55646
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:39098
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:39095
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:55086
> CLOSE_WAIT  -
> tcp        8      0 127.0.0.1:7199          127.0.0.1:50575
> CLOSE_WAIT  -
> [list truncated, there are about 20 more lines]
>
>
> The /var/log/cassandra dir shows that none of the system.log files were
> touched in the last two days. The system.log.1 file's tail shows:
>
> FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,120 Configuration.java (line
> 1256) error parsing conf file: java.io.FileNotFoundException:
> /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
> FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,121 Configuration.java (line
> 1256) error parsing conf file: java.io.FileNotFoundException:
> /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
> FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,123 Configuration.java (line
> 1256) error parsing conf file: java.io.FileNotFoundException:
> /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
> FATAL [TASK-TRACKER-INIT] 2011-07-16 07:48:19,124 Configuration.java (line
> 1256) error parsing conf file: java.io.FileNotFoundException:
> /home/ubuntu/brisk/resources/hadoop/conf/core-site.xml (Too many open files)
>
> Also, this node might be going through a memory leak or spinning tread.
> Check out the top output from it (specifically the CPU & MEM):
> http://blueplastic.com/accenture/top.PNG
>
>
> Anything else I can do to troubleshoot this? Is this a known issue that I
> can just ignore and reboot the node?
>
> - Sameer
>
>

Reply via email to