Re: Repair fails with java.io.IOError: java.io.EOFException
If they are and repair has completed use node tool cleanup to remove the data the node is no longer responsible. See bootstrap section above. I've seen that said a few times so allow me to correct. Cleanup is useless after a repair. 'nodetool cleanup' removes rows the node is not responsible anymore and is thus useful only after operations that change the range a node is responsible for (bootstrap, move, decommission). After a repair, you will need compaction to kick in to see you disk usage come back to normal. -- Sylvain Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 12:44, Sameer Farooqui wrote: Looks like the repair finished successfully the second time. However, the cluster is still severely unbalanced. I was hoping the repair would balance the nodes. We're using random partitioner. One node has 900GB and others have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely high. Not sure why or how to troubleshoot. On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com wrote: I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is
Re: Repair fails with java.io.IOError: java.io.EOFException
Was guessing something like a token move may have happened in the past. Good suggestion to also kick off a major compaction. I've seen that make a big difference even for apps that do not do deletes, but do do overwrites. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote: If they are and repair has completed use node tool cleanup to remove the data the node is no longer responsible. See bootstrap section above. I've seen that said a few times so allow me to correct. Cleanup is useless after a repair. 'nodetool cleanup' removes rows the node is not responsible anymore and is thus useful only after operations that change the range a node is responsible for (bootstrap, move, decommission). After a repair, you will need compaction to kick in to see you disk usage come back to normal. -- Sylvain Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 12:44, Sameer Farooqui wrote: Looks like the repair finished successfully the second time. However, the cluster is still severely unbalanced. I was hoping the repair would balance the nodes. We're using random partitioner. One node has 900GB and others have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely high. Not sure why or how to troubleshoot. On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com wrote: I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more
Re: Repair fails with java.io.IOError: java.io.EOFException
Thanks for the info guys. I'm running compaction on the two very highly loaded nodes now in hopes of the data volume going down. But I'm skeptical because I don't see how it got so unbalanced in the first place (all nodes were up while the writes were being injected). I should have an update tomorrow on whether compaction rebalanced the nodes. The tokens are evenly distributed across the ring: Address DC Rack Status State Load Owns Token 148873535527910577765226390751398592512 10.192.143.x DC1 RAC1 Up Normal 643.42 GB 12.50% 0 10.192.171.x DC1 RAC1 Up Normal 128.96 GB 6.25% 21267647932558653966460912964485513216 10.210.95.x DC1 RAC1 Up Normal 128.34 GB 12.50% 42535295865117307932921825928971026432 10.211.19.x DC1 RAC1 Up Normal 128.55 GB 6.25% 63802943797675961899382738893456539648 10.68.58.x DC1 RAC2 Up Normal 643.05 GB 12.50% 85070591730234615865843651857942052864 10.110.31.x DC1 RAC2 Up Normal 128.84 GB 6.25% 106338239662793269832304564822427566080 10.96.58.x DC1 RAC2 Up Normal 128.11 GB 12.50% 127605887595351923798765477786913079296 10.210.195.x DC1 RAC2 Up Normal 129.33 GB 6.25% 148873535527910577765226390751398592512 10.114.138.x DC2 RAC1 Up Normal 258.04 GB 6.25% 10633823966279326983230456482242756608 10.203.79.x DC2 RAC1 Up Normal 257.14 GB 6.25% 53169119831396634916152282411213783040 10.242.209.x DC2 RAC1 Up Normal 256.58 GB 6.25% 95704415696513942849074108340184809472 10.38.25.x DC2 RAC1 Up Normal 257.08 GB 6.25% 138239711561631250781995934269155835904 On Tue, Jul 26, 2011 at 1:59 AM, aaron morton aa...@thelastpickle.comwrote: Was guessing something like a token move may have happened in the past. Good suggestion to also kick off a major compaction. I've seen that make a big difference even for apps that do not do deletes, but do do overwrites. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote: If they are and repair has completed use node tool cleanup to remove the data the node is no longer responsible. See bootstrap section above. I've seen that said a few times so allow me to correct. Cleanup is useless after a repair. 'nodetool cleanup' removes rows the node is not responsible anymore and is thus useful only after operations that change the range a node is responsible for (bootstrap, move, decommission). After a repair, you will need compaction to kick in to see you disk usage come back to normal. -- Sylvain Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 12:44, Sameer Farooqui wrote: Looks like the repair finished successfully the second time. However, the cluster is still severely unbalanced. I was hoping the repair would balance the nodes. We're using random partitioner. One node has 900GB and others have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely high. Not sure why or how to troubleshoot. On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com wrote: I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO
Re: Repair fails with java.io.IOError: java.io.EOFException
Looks like the repair finished successfully the second time. However, the cluster is still severely unbalanced. I was hoping the repair would balance the nodes. We're using random partitioner. One node has 900GB and others have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely high. Not sure why or how to troubleshoot. On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.comwrote: I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21
Re: Repair fails with java.io.IOError: java.io.EOFException
Background: http://wiki.apache.org/cassandra/Operations Use node tool ring to check if the tokens are evenly distributed. If not then check the Load Balancing and Moving Nodes sections in the page above. If they are and repair has completed use node tool cleanup to remove the data the node is no longer responsible. See bootstrap section above. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 Jul 2011, at 12:44, Sameer Farooqui wrote: Looks like the repair finished successfully the second time. However, the cluster is still severely unbalanced. I was hoping the repair would balance the nodes. We're using random partitioner. One node has 900GB and others have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely high. Not sure why or how to troubleshoot. On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com wrote: I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144
Re: Repair fails with java.io.IOError: java.io.EOFException
I don't see a JVM crashlog ( hs_err_pid[pid].log) in ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? We're running a pretty up to date with Sun Java: ubuntu@ip-10-2-x-x:/tmp$ java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) I'm gonna restart the Repair process in a few more hours. If there are any additional debug or troubleshooting logs you'd like me to enable first, please let me know. - Sameer On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main]
Repair fails with java.io.IOError: java.io.EOFException
While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer
Re: Repair fails with java.io.IOError: java.io.EOFException
Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.comwrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.comwrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote: Looks harmless to me. On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui cassandral...@gmail.com wrote: While running Repair on a 0.8.1 node, we got this error in the system.log: ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-23,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) There's just a bunch of informational messages about Gossip before this. Looks like the file or stream unexpectedly ended? http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html Is this a bug or something wrong in our environment? - Sameer -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Repair fails with java.io.IOError: java.io.EOFException
We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.comwrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.comwrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging enabled. Is there something else I can check or that you suspect? Should we turn the logging up to debug and retry the Repair? - Sameer On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.comwrote: Looks harmless to me. On Thu, Jul 21,
Re: Repair fails with java.io.IOError: java.io.EOFException
Did you check for a JVM crash log? You should make sure you're running the latest Sun JVM, older versions and OpenJDK in particular are prone to segfaulting. On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui cassandral...@gmail.com wrote: We are starting Cassandra with brisk cassandra, so as a stand-alone process, not a service. The syslog on the node doesn't show anything regarding the Cassandra Java process around the time the last entries were made in the Cassandra system.log (2011-07-21 13:01:51): Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v debian-sa1 /dev/null debian-sa1 1 1) Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = /proc/kmsg started. Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start The last thing in the Cassandra log before INFO Logging initialized is: INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 I can start Repair again, but am worried that it will crash Cassandra again, so I want to turn on any debugging or helpful logs to diagnose the crash if it happens again. - Sameer On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com wrote: The default init.d script will direct std out/err to that file, how are you starting brisk / cassandra ? Check the syslog and other logs in /var/log to see if the OS killed cassandra. Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: Hey Aaron, I don't have any output.log files in that folder: ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls system.log system.log.11 system.log.4 system.log.7 system.log.1 system.log.2 system.log.5 system.log.8 system.log.10 system.log.3 system.log.6 system.log.9 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote: Check /var/log/cassandra/output.log (assuming the default init scripts) A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: Hmm. Just looked at the log more closely. So, what actually happened is while Repair was running on this specific node, the Cassandra java process terminated itself automatically. The last entries in the log are: INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 4030726144 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 4030726144 When we came in this morning, nodetool ring from another node showed the 1st node as down and OpsCenter also reported it as down. Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the status of the Cassandra PID and it was not running. We then started Cassandra: INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialized INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) Heap size: 3894411264/3894411264 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall successful INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) Loading settings from file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml It was during this start process that the java.io.EOFException was seen, but yes, like you said Jonathan, the Cassandra process started back up and joined the ring. We're now wondering why the Repair failed and why Cassandra crashed in the first place. We only had default level logging