Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-26 Thread Sylvain Lebresne
 If they are and repair has completed use node tool cleanup to remove the
 data the node is no longer responsible. See bootstrap section above.

I've seen that said a few times so allow me to correct. Cleanup is useless after
a repair. 'nodetool cleanup' removes rows the node is not responsible anymore
and is thus useful only after operations that change the range a node is
responsible for (bootstrap, move, decommission). After a repair, you will need
compaction to kick in to see you disk usage come back to normal.

--
Sylvain

 Hope that helps.

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:

 Looks like the repair finished successfully the second time. However, the
 cluster is still severely unbalanced. I was hoping the repair would balance
 the nodes. We're using random partitioner. One node has 900GB and others
 have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
 insanely high. Not sure why or how to troubleshoot.



 On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com
 wrote:

 I don't see a JVM crashlog ( hs_err_pid[pid].log) in
 ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?

 We're running a pretty up to date with Sun Java:

 ubuntu@ip-10-2-x-x:/tmp$ java -version
 java version 1.6.0_24
 Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
 Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

 I'm gonna restart the Repair process in a few more hours. If there are any
 additional debug or troubleshooting logs you'd like me to enable first,
 please let me know.

 - Sameer



 On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Did you check for a JVM crash log?

 You should make sure you're running the latest Sun JVM, older versions
 and OpenJDK in particular are prone to segfaulting.

 On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  We are starting Cassandra with brisk cassandra, so as a stand-alone
  process, not a service.
 
  The syslog on the node doesn't show anything regarding the Cassandra
  Java
  process around the time the last entries were made in the Cassandra
  system.log (2011-07-21 13:01:51):
 
  Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
  /proc/kmsg started.
  Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
  swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;]
  (re)start
 
 
  The last thing in the Cassandra log before INFO Logging initialized is:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
  128)
  GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
  is
  4030726144
 
 
  I can start Repair again, but am worried that it will crash Cassandra
  again,
  so I want to turn on any debugging or helpful logs to diagnose the
  crash if
  it happens again.
 
 
  - Sameer
 
 
  On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  The default init.d script will direct std out/err to that file, how
  are
  you starting brisk / cassandra ?
  Check the syslog and other logs in /var/log to see if the OS killed
  cassandra.
  Also, what was the last thing in the casandra log before INFO [main]
  2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
  initialised ?
 
  Cheers
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
 
  Hey Aaron,
 
  I don't have any output.log files in that folder:
 
  ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
  ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
  system.log system.log.11  system.log.4  system.log.7
  system.log.1   system.log.2   system.log.5  system.log.8
  system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
  On Thu, Jul 21, 2011 at 3:40 PM, aaron morton
  aa...@thelastpickle.com
  wrote:
 
  Check /var/log/cassandra/output.log (assuming the default init
  scripts)
  A
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
  Hmm. Just looked at the log more closely.
 
  So, what actually happened is while Repair was running on this
  specific
  node, the Cassandra java process terminated itself automatically. The
  last
  entries in the log are:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
  (line
  128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
  used; max
  is 

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-26 Thread aaron morton
Was guessing something like a token move may have happened in the past. 

Good suggestion to also kick off a major compaction. I've seen that make a big 
difference even for apps that do not do deletes, but do do overwrites. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote:

 If they are and repair has completed use node tool cleanup to remove the
 data the node is no longer responsible. See bootstrap section above.
 
 I've seen that said a few times so allow me to correct. Cleanup is useless 
 after
 a repair. 'nodetool cleanup' removes rows the node is not responsible anymore
 and is thus useful only after operations that change the range a node is
 responsible for (bootstrap, move, decommission). After a repair, you will need
 compaction to kick in to see you disk usage come back to normal.
 
 --
 Sylvain
 
 Hope that helps.
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:
 
 Looks like the repair finished successfully the second time. However, the
 cluster is still severely unbalanced. I was hoping the repair would balance
 the nodes. We're using random partitioner. One node has 900GB and others
 have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
 insanely high. Not sure why or how to troubleshoot.
 
 
 
 On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com
 wrote:
 
 I don't see a JVM crashlog ( hs_err_pid[pid].log) in
 ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
 
 We're running a pretty up to date with Sun Java:
 
 ubuntu@ip-10-2-x-x:/tmp$ java -version
 java version 1.6.0_24
 Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
 Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
 
 I'm gonna restart the Repair process in a few more hours. If there are any
 additional debug or troubleshooting logs you'd like me to enable first,
 please let me know.
 
 - Sameer
 
 
 
 On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote:
 
 Did you check for a JVM crash log?
 
 You should make sure you're running the latest Sun JVM, older versions
 and OpenJDK in particular are prone to segfaulting.
 
 On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
 We are starting Cassandra with brisk cassandra, so as a stand-alone
 process, not a service.
 
 The syslog on the node doesn't show anything regarding the Cassandra
 Java
 process around the time the last entries were made in the Cassandra
 system.log (2011-07-21 13:01:51):
 
 Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
 /proc/kmsg started.
 Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
 swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;]
 (re)start
 
 
 The last thing in the Cassandra log before INFO Logging initialized is:
 
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128)
 GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is
 4030726144
 
 
 I can start Repair again, but am worried that it will crash Cassandra
 again,
 so I want to turn on any debugging or helpful logs to diagnose the
 crash if
 it happens again.
 
 
 - Sameer
 
 
 On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
 wrote:
 
 The default init.d script will direct std out/err to that file, how
 are
 you starting brisk / cassandra ?
 Check the syslog and other logs in /var/log to see if the OS killed
 cassandra.
 Also, what was the last thing in the casandra log before INFO [main]
 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
 initialised ?
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
 
 Hey Aaron,
 
 I don't have any output.log files in that folder:
 
 ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
 ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
 system.log system.log.11  system.log.4  system.log.7
 system.log.1   system.log.2   system.log.5  system.log.8
 system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton
 aa...@thelastpickle.com
 wrote:
 
 Check /var/log/cassandra/output.log (assuming the default init
 scripts)
 A
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
 Hmm. Just looked at the log more 

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-26 Thread Sameer Farooqui
Thanks for the info guys.

I'm running compaction on the two very highly loaded nodes now in hopes of
the data volume going down. But I'm skeptical because I don't see how it got
so unbalanced in the first place (all nodes were up while the writes were
being injected).

I should have an update tomorrow on whether compaction rebalanced the nodes.

The tokens are evenly distributed across the ring:

Address DC Rack Status State Load Owns Token
148873535527910577765226390751398592512
10.192.143.x DC1 RAC1 Up Normal 643.42 GB 12.50% 0
10.192.171.x DC1 RAC1 Up Normal 128.96 GB 6.25%
21267647932558653966460912964485513216
10.210.95.x DC1 RAC1 Up Normal 128.34 GB 12.50%
42535295865117307932921825928971026432
10.211.19.x DC1 RAC1 Up Normal 128.55 GB 6.25%
63802943797675961899382738893456539648
10.68.58.x DC1 RAC2 Up Normal 643.05 GB 12.50%
85070591730234615865843651857942052864
10.110.31.x DC1 RAC2 Up Normal 128.84 GB 6.25%
106338239662793269832304564822427566080
10.96.58.x DC1 RAC2 Up Normal 128.11 GB 12.50%
127605887595351923798765477786913079296
10.210.195.x DC1 RAC2 Up Normal 129.33 GB 6.25%
148873535527910577765226390751398592512
10.114.138.x DC2 RAC1 Up Normal 258.04 GB 6.25%
10633823966279326983230456482242756608
10.203.79.x DC2 RAC1 Up Normal 257.14 GB 6.25%
53169119831396634916152282411213783040
10.242.209.x DC2 RAC1 Up Normal 256.58 GB 6.25%
95704415696513942849074108340184809472
10.38.25.x DC2 RAC1 Up Normal 257.08 GB 6.25%
138239711561631250781995934269155835904

On Tue, Jul 26, 2011 at 1:59 AM, aaron morton aa...@thelastpickle.comwrote:

 Was guessing something like a token move may have happened in the past.

 Good suggestion to also kick off a major compaction. I've seen that make a
 big difference even for apps that do not do deletes, but do do overwrites.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote:

  If they are and repair has completed use node tool cleanup to remove the
  data the node is no longer responsible. See bootstrap section above.
 
  I've seen that said a few times so allow me to correct. Cleanup is
 useless after
  a repair. 'nodetool cleanup' removes rows the node is not responsible
 anymore
  and is thus useful only after operations that change the range a node is
  responsible for (bootstrap, move, decommission). After a repair, you will
 need
  compaction to kick in to see you disk usage come back to normal.
 
  --
  Sylvain
 
  Hope that helps.
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:
 
  Looks like the repair finished successfully the second time. However,
 the
  cluster is still severely unbalanced. I was hoping the repair would
 balance
  the nodes. We're using random partitioner. One node has 900GB and others
  have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
  insanely high. Not sure why or how to troubleshoot.
 
 
 
  On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui 
 cassandral...@gmail.com
  wrote:
 
  I don't see a JVM crashlog ( hs_err_pid[pid].log) in
  ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
 
  We're running a pretty up to date with Sun Java:
 
  ubuntu@ip-10-2-x-x:/tmp$ java -version
  java version 1.6.0_24
  Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
  Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
 
  I'm gonna restart the Repair process in a few more hours. If there are
 any
  additional debug or troubleshooting logs you'd like me to enable first,
  please let me know.
 
  - Sameer
 
 
 
  On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  Did you check for a JVM crash log?
 
  You should make sure you're running the latest Sun JVM, older versions
  and OpenJDK in particular are prone to segfaulting.
 
  On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
  cassandral...@gmail.com wrote:
  We are starting Cassandra with brisk cassandra, so as a stand-alone
  process, not a service.
 
  The syslog on the node doesn't show anything regarding the Cassandra
  Java
  process around the time the last entries were made in the Cassandra
  system.log (2011-07-21 13:01:51):
 
  Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
  /proc/kmsg started.
  Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
  swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;]
  (re)start
 
 
  The last thing in the Cassandra log before INFO Logging initialized
 is:
 
   INFO 

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-25 Thread Sameer Farooqui
Looks like the repair finished successfully the second time. However, the
cluster is still severely unbalanced. I was hoping the repair would balance
the nodes. We're using random partitioner. One node has 900GB and others
have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
insanely high. Not sure why or how to troubleshoot.



On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.comwrote:

 I don't see a JVM crashlog ( hs_err_pid[pid].log) in
 ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?

 We're running a pretty up to date with Sun Java:

 ubuntu@ip-10-2-x-x:/tmp$ java -version
 java version 1.6.0_24
 Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
 Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

 I'm gonna restart the Repair process in a few more hours. If there are any
 additional debug or troubleshooting logs you'd like me to enable first,
 please let me know.

 - Sameer




 On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Did you check for a JVM crash log?

 You should make sure you're running the latest Sun JVM, older versions
 and OpenJDK in particular are prone to segfaulting.

 On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  We are starting Cassandra with brisk cassandra, so as a stand-alone
  process, not a service.
 
  The syslog on the node doesn't show anything regarding the Cassandra
 Java
  process around the time the last entries were made in the Cassandra
  system.log (2011-07-21 13:01:51):
 
  Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
  /proc/kmsg started.
  Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
  swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;]
 (re)start
 
 
  The last thing in the Cassandra log before INFO Logging initialized is:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128)
  GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is
  4030726144
 
 
  I can start Repair again, but am worried that it will crash Cassandra
 again,
  so I want to turn on any debugging or helpful logs to diagnose the crash
 if
  it happens again.
 
 
  - Sameer
 
 
  On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  The default init.d script will direct std out/err to that file, how are
  you starting brisk / cassandra ?
  Check the syslog and other logs in /var/log to see if the OS killed
  cassandra.
  Also, what was the last thing in the casandra log before INFO [main]
  2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
  initialised ?
 
  Cheers
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
 
  Hey Aaron,
 
  I don't have any output.log files in that folder:
 
  ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
  ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
  system.log system.log.11  system.log.4  system.log.7
  system.log.1   system.log.2   system.log.5  system.log.8
  system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
  On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com
 
  wrote:
 
  Check /var/log/cassandra/output.log (assuming the default init
 scripts)
  A
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
  Hmm. Just looked at the log more closely.
 
  So, what actually happened is while Repair was running on this
 specific
  node, the Cassandra java process terminated itself automatically. The
 last
  entries in the log are:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
 (line
  128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java
 (line
  128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java
 (line
  128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java
 (line
  128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java
 (line
  128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-25 Thread aaron morton
Background: http://wiki.apache.org/cassandra/Operations

Use node tool ring to check if the tokens are evenly distributed. If not then 
check the Load Balancing and Moving Nodes sections in the page above.

If they are and repair has completed use node tool cleanup to remove the data 
the node is no longer responsible. See bootstrap section above. 

Hope that helps. 


-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:

 Looks like the repair finished successfully the second time. However, the 
 cluster is still severely unbalanced. I was hoping the repair would balance 
 the nodes. We're using random partitioner. One node has 900GB and others have 
 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just insanely 
 high. Not sure why or how to troubleshoot.
 
 
 
 On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui cassandral...@gmail.com 
 wrote:
 I don't see a JVM crashlog ( hs_err_pid[pid].log) in 
 ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
 
 We're running a pretty up to date with Sun Java:
 
 ubuntu@ip-10-2-x-x:/tmp$ java -version
 java version 1.6.0_24
 Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
 Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
 
 I'm gonna restart the Repair process in a few more hours. If there are any 
 additional debug or troubleshooting logs you'd like me to enable first, 
 please let me know.
 
 - Sameer
 
 
 
 
 On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote:
 Did you check for a JVM crash log?
 
 You should make sure you're running the latest Sun JVM, older versions
 and OpenJDK in particular are prone to segfaulting.
 
 On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  We are starting Cassandra with brisk cassandra, so as a stand-alone
  process, not a service.
 
  The syslog on the node doesn't show anything regarding the Cassandra Java
  process around the time the last entries were made in the Cassandra
  system.log (2011-07-21 13:01:51):
 
  Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
  /proc/kmsg started.
  Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
  swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start
 
 
  The last thing in the Cassandra log before INFO Logging initialized is:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128)
  GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is
  4030726144
 
 
  I can start Repair again, but am worried that it will crash Cassandra again,
  so I want to turn on any debugging or helpful logs to diagnose the crash if
  it happens again.
 
 
  - Sameer
 
 
  On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  The default init.d script will direct std out/err to that file, how are
  you starting brisk / cassandra ?
  Check the syslog and other logs in /var/log to see if the OS killed
  cassandra.
  Also, what was the last thing in the casandra log before INFO [main]
  2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
  initialised ?
 
  Cheers
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
 
  Hey Aaron,
 
  I don't have any output.log files in that folder:
 
  ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
  ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
  system.log system.log.11  system.log.4  system.log.7
  system.log.1   system.log.2   system.log.5  system.log.8
  system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
  On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  Check /var/log/cassandra/output.log (assuming the default init scripts)
  A
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
  Hmm. Just looked at the log more closely.
 
  So, what actually happened is while Repair was running on this specific
  node, the Cassandra java process terminated itself automatically. The last
  entries in the log are:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line
  128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; 
  max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line
  128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; 
  max
  is 4030726144
   

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-22 Thread Sameer Farooqui
I don't see a JVM crashlog ( hs_err_pid[pid].log) in
~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?

We're running a pretty up to date with Sun Java:

ubuntu@ip-10-2-x-x:/tmp$ java -version
java version 1.6.0_24
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

I'm gonna restart the Repair process in a few more hours. If there are any
additional debug or troubleshooting logs you'd like me to enable first,
please let me know.

- Sameer



On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Did you check for a JVM crash log?

 You should make sure you're running the latest Sun JVM, older versions
 and OpenJDK in particular are prone to segfaulting.

 On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  We are starting Cassandra with brisk cassandra, so as a stand-alone
  process, not a service.
 
  The syslog on the node doesn't show anything regarding the Cassandra Java
  process around the time the last entries were made in the Cassandra
  system.log (2011-07-21 13:01:51):
 
  Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
  debian-sa1  /dev/null  debian-sa1 1 1)
  Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
  /proc/kmsg started.
  Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
  swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start
 
 
  The last thing in the Cassandra log before INFO Logging initialized is:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128)
  GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is
  4030726144
 
 
  I can start Repair again, but am worried that it will crash Cassandra
 again,
  so I want to turn on any debugging or helpful logs to diagnose the crash
 if
  it happens again.
 
 
  - Sameer
 
 
  On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  The default init.d script will direct std out/err to that file, how are
  you starting brisk / cassandra ?
  Check the syslog and other logs in /var/log to see if the OS killed
  cassandra.
  Also, what was the last thing in the casandra log before INFO [main]
  2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
  initialised ?
 
  Cheers
 
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
 
  Hey Aaron,
 
  I don't have any output.log files in that folder:
 
  ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
  ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
  system.log system.log.11  system.log.4  system.log.7
  system.log.1   system.log.2   system.log.5  system.log.8
  system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
  On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  Check /var/log/cassandra/output.log (assuming the default init scripts)
  A
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
  Hmm. Just looked at the log more closely.
 
  So, what actually happened is while Repair was running on this specific
  node, the Cassandra java process terminated itself automatically. The
 last
  entries in the log are:
 
   INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line
  128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line
  128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line
  128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line
  128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line
  128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176
 used; max
  is 4030726144
   INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
  128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600
 used; max
  is 4030726144
 
  When we came in this morning, nodetool ring from another node showed
 the
  1st node as down and OpsCenter also reported it as down.
 
  Next we ran sudo netstat -anp | grep 7199 from the 1st node to see
 the
  status of the Cassandra PID and it was not running.
 
  We then started Cassandra:
 
  INFO [main] 

Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Sameer Farooqui
While running Repair on a 0.8.1 node, we got this error in the system.log:

ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line
113) Fatal exception in thread Thread[Thread-23,5,main]
java.io.IOError: java.io.EOFException
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)

There's just a bunch of informational messages about Gossip before this.

Looks like the file or stream unexpectedly ended?
http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html

Is this a bug or something wrong in our environment?


- Sameer


Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Jonathan Ellis
Looks harmless to me.

On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
cassandral...@gmail.com wrote:
 While running Repair on a 0.8.1 node, we got this error in the system.log:

 ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line
 113) Fatal exception in thread Thread[Thread-23,5,main]
 java.io.IOError: java.io.EOFException
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)

 There's just a bunch of informational messages about Gossip before this.

 Looks like the file or stream unexpectedly ended?
 http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html

 Is this a bug or something wrong in our environment?


 - Sameer




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Sameer Farooqui
Hmm. Just looked at the log more closely.

So, what actually happened is while Repair was running on this specific
node, the Cassandra java process terminated itself automatically. The last
entries in the log are:

 INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128)
GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is
4030726144
 INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128)
GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is
4030726144
 INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128)
GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is
4030726144
 INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128)
GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is
4030726144
 INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128)
GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is
4030726144
 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128)
GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is
4030726144

When we came in this morning, nodetool ring from another node showed the 1st
node as down and OpsCenter also reported it as down.

Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the
status of the Cassandra PID and it was not running.

We then started Cassandra:

INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78)
Logging initialized
 INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96)
Heap size: 3894411264/3894411264
 INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall
successful
 INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121)
Loading settings from
file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml


It was during this start process that the java.io.EOFException was seen, but
yes, like you said Jonathan, the Cassandra process started back up and
joined the ring.

We're now wondering why the Repair failed and why Cassandra crashed in the
first place. We only had default level logging enabled. Is there something
else I can check or that you suspect?

Should we turn the logging up to debug and retry the Repair?


- Sameer


On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Looks harmless to me.

 On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  While running Repair on a 0.8.1 node, we got this error in the
 system.log:
 
  ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java
 (line
  113) Fatal exception in thread Thread[Thread-23,5,main]
  java.io.IOError: java.io.EOFException
  at
 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:375)
  at
 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
 
  There's just a bunch of informational messages about Gossip before this.
 
  Looks like the file or stream unexpectedly ended?
 
 http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
 
  Is this a bug or something wrong in our environment?
 
 
  - Sameer
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread aaron morton
Check /var/log/cassandra/output.log (assuming the default init scripts)

A
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:

 Hmm. Just looked at the log more closely.
 
 So, what actually happened is while Repair was running on this specific node, 
 the Cassandra java process terminated itself automatically. The last entries 
 in the log are:
 
  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) 
 GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) 
 GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) 
 GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) 
 GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) 
 GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) 
 GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 
 4030726144
  
 When we came in this morning, nodetool ring from another node showed the 1st 
 node as down and OpsCenter also reported it as down.
 
 Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the 
 status of the Cassandra PID and it was not running.
 
 We then started Cassandra:
 
 INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) 
 Logging initialized
  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) 
 Heap size: 3894411264/3894411264
  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall 
 successful
  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) 
 Loading settings from 
 file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
 
 
 It was during this start process that the java.io.EOFException was seen, but 
 yes, like you said Jonathan, the Cassandra process started back up and joined 
 the ring. 
 
 We're now wondering why the Repair failed and why Cassandra crashed in the 
 first place. We only had default level logging enabled. Is there something 
 else I can check or that you suspect?
 
 Should we turn the logging up to debug and retry the Repair?
 
 
 - Sameer
 
 
 On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote:
 Looks harmless to me.
 
 On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  While running Repair on a 0.8.1 node, we got this error in the system.log:
 
  ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java (line
  113) Fatal exception in thread Thread[Thread-23,5,main]
  java.io.IOError: java.io.EOFException
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:375)
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
 
  There's just a bunch of informational messages about Gossip before this.
 
  Looks like the file or stream unexpectedly ended?
  http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
 
  Is this a bug or something wrong in our environment?
 
 
  - Sameer
 
 
 
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 



Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Sameer Farooqui
Hey Aaron,

I don't have any output.log files in that folder:

ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
system.log system.log.11  system.log.4  system.log.7
system.log.1   system.log.2   system.log.5  system.log.8
system.log.10  system.log.3   system.log.6  system.log.9



On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.comwrote:

 Check /var/log/cassandra/output.log (assuming the default init scripts)

 A
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:

 Hmm. Just looked at the log more closely.

 So, what actually happened is while Repair was running on this specific
 node, the Cassandra java process terminated itself automatically. The last
 entries in the log are:

  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line
 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line
 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line
 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line
 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line
 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is 4030726144

 When we came in this morning, nodetool ring from another node showed the
 1st node as down and OpsCenter also reported it as down.

 Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the
 status of the Cassandra PID and it was not running.

 We then started Cassandra:

 INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78)
 Logging initialized
  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96)
 Heap size: 3894411264/3894411264
  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall
 successful
  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121)
 Loading settings from
 file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml


 It was during this start process that the java.io.EOFException was seen,
 but yes, like you said Jonathan, the Cassandra process started back up and
 joined the ring.

 We're now wondering why the Repair failed and why Cassandra crashed in the
 first place. We only had default level logging enabled. Is there something
 else I can check or that you suspect?

 Should we turn the logging up to debug and retry the Repair?


 - Sameer


 On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Looks harmless to me.

 On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  While running Repair on a 0.8.1 node, we got this error in the
 system.log:
 
  ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java
 (line
  113) Fatal exception in thread Thread[Thread-23,5,main]
  java.io.IOError: java.io.EOFException
  at
 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:375)
  at
 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
 
  There's just a bunch of informational messages about Gossip before this.
 
  Looks like the file or stream unexpectedly ended?
 
 http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
 
  Is this a bug or something wrong in our environment?
 
 
  - Sameer
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com






Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread aaron morton
The default init.d script will direct std out/err to that file, how are you 
starting brisk / cassandra ?

Check the syslog and other logs in /var/log to see if the OS killed cassandra. 

Also, what was the last thing in the casandra log before INFO [main] 2011-07-21 
15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging initialised ?


Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:

 Hey Aaron,
 
 I don't have any output.log files in that folder:
 
 ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
 ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
 system.log system.log.11  system.log.4  system.log.7
 system.log.1   system.log.2   system.log.5  system.log.8
 system.log.10  system.log.3   system.log.6  system.log.9
 
 
 
 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com wrote:
 Check /var/log/cassandra/output.log (assuming the default init scripts)
 
 A
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
 
 Hmm. Just looked at the log more closely.
 
 So, what actually happened is while Repair was running on this specific 
 node, the Cassandra java process terminated itself automatically. The last 
 entries in the log are:
 
  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line 128) 
 GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line 128) 
 GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line 128) 
 GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line 128) 
 GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line 128) 
 GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max is 
 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128) 
 GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is 
 4030726144
  
 When we came in this morning, nodetool ring from another node showed the 1st 
 node as down and OpsCenter also reported it as down.
 
 Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the 
 status of the Cassandra PID and it was not running.
 
 We then started Cassandra:
 
 INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) 
 Logging initialized
  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line 96) 
 Heap size: 3894411264/3894411264
  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall 
 successful
  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121) 
 Loading settings from 
 file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
 
 
 It was during this start process that the java.io.EOFException was seen, but 
 yes, like you said Jonathan, the Cassandra process started back up and 
 joined the ring. 
 
 We're now wondering why the Repair failed and why Cassandra crashed in the 
 first place. We only had default level logging enabled. Is there something 
 else I can check or that you suspect?
 
 Should we turn the logging up to debug and retry the Repair?
 
 
 - Sameer
 
 
 On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.com wrote:
 Looks harmless to me.
 
 On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
 cassandral...@gmail.com wrote:
  While running Repair on a 0.8.1 node, we got this error in the system.log:
 
  ERROR [Thread-23] 2011-07-21 15:48:43,868 AbstractCassandraDaemon.java 
  (line
  113) Fatal exception in thread Thread[Thread-23,5,main]
  java.io.IOError: java.io.EOFException
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:375)
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
 
  There's just a bunch of informational messages about Gossip before this.
 
  Looks like the file or stream unexpectedly ended?
  http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
 
  Is this a bug or something wrong in our environment?
 
 
  - Sameer
 
 
 
 
 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 
 
 



Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Sameer Farooqui
We are starting Cassandra with brisk cassandra, so as a stand-alone
process, not a service.

The syslog on the node doesn't show anything regarding the Cassandra Java
process around the time the last entries were made in the Cassandra
system.log (2011-07-21 13:01:51):

Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
debian-sa1  /dev/null  debian-sa1 1 1)
Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
debian-sa1  /dev/null  debian-sa1 1 1)
Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
debian-sa1  /dev/null  debian-sa1 1 1)
Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
/proc/kmsg started.
Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start


The last thing in the Cassandra log before INFO Logging initialized is:

 INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128)
GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is
4030726144


I can start Repair again, but am worried that it will crash Cassandra again,
so I want to turn on any debugging or helpful logs to diagnose the crash if
it happens again.


- Sameer


On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.comwrote:

 The default init.d script will direct std out/err to that file, how are you
 starting brisk / cassandra ?

 Check the syslog and other logs in /var/log to see if the OS killed
 cassandra.

 Also, what was the last thing in the casandra log before INFO [main]
 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
 initialised ?


 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:

 Hey Aaron,

 I don't have any output.log files in that folder:

 ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
 ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
 system.log system.log.11  system.log.4  system.log.7
 system.log.1   system.log.2   system.log.5  system.log.8
 system.log.10  system.log.3   system.log.6  system.log.9



 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.comwrote:

 Check /var/log/cassandra/output.log (assuming the default init scripts)

 A
  -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:

 Hmm. Just looked at the log more closely.

 So, what actually happened is while Repair was running on this specific
 node, the Cassandra java process terminated itself automatically. The last
 entries in the log are:

  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line
 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line
 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line
 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line
 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line
 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is 4030726144

 When we came in this morning, nodetool ring from another node showed the
 1st node as down and OpsCenter also reported it as down.

 Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the
 status of the Cassandra PID and it was not running.

 We then started Cassandra:

 INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78)
 Logging initialized
  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line
 96) Heap size: 3894411264/3894411264
  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA mlockall
 successful
  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121)
 Loading settings from
 file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml


 It was during this start process that the java.io.EOFException was seen,
 but yes, like you said Jonathan, the Cassandra process started back up and
 joined the ring.

 We're now wondering why the Repair failed and why Cassandra crashed in the
 first place. We only had default level logging enabled. Is there something
 else I can check or that you suspect?

 Should we turn the logging up to debug and retry the Repair?


 - Sameer


 On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Looks harmless to me.

 On Thu, Jul 21, 

Re: Repair fails with java.io.IOError: java.io.EOFException

2011-07-21 Thread Jonathan Ellis
Did you check for a JVM crash log?

You should make sure you're running the latest Sun JVM, older versions
and OpenJDK in particular are prone to segfaulting.

On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
cassandral...@gmail.com wrote:
 We are starting Cassandra with brisk cassandra, so as a stand-alone
 process, not a service.

 The syslog on the node doesn't show anything regarding the Cassandra Java
 process around the time the last entries were made in the Cassandra
 system.log (2011-07-21 13:01:51):

 Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
 debian-sa1  /dev/null  debian-sa1 1 1)
 Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
 /proc/kmsg started.
 Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software=rsyslogd
 swVersion=4.2.0 x-pid=663 x-info=http://www.rsyslog.com;] (re)start


 The last thing in the Cassandra log before INFO Logging initialized is:

  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line 128)
 GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max is
 4030726144


 I can start Repair again, but am worried that it will crash Cassandra again,
 so I want to turn on any debugging or helpful logs to diagnose the crash if
 it happens again.


 - Sameer


 On Thu, Jul 21, 2011 at 4:30 PM, aaron morton aa...@thelastpickle.com
 wrote:

 The default init.d script will direct std out/err to that file, how are
 you starting brisk / cassandra ?
 Check the syslog and other logs in /var/log to see if the OS killed
 cassandra.
 Also, what was the last thing in the casandra log before INFO [main]
 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
 initialised ?

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:

 Hey Aaron,

 I don't have any output.log files in that folder:

 ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
 ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
 system.log system.log.11  system.log.4  system.log.7
 system.log.1   system.log.2   system.log.5  system.log.8
 system.log.10  system.log.3   system.log.6  system.log.9



 On Thu, Jul 21, 2011 at 3:40 PM, aaron morton aa...@thelastpickle.com
 wrote:

 Check /var/log/cassandra/output.log (assuming the default init scripts)
 A
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:

 Hmm. Just looked at the log more closely.

 So, what actually happened is while Repair was running on this specific
 node, the Cassandra java process terminated itself automatically. The last
 entries in the log are:

  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java (line
 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java (line
 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java (line
 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 193120 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java (line
 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java (line
 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 used; max
 is 4030726144
  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
 is 4030726144

 When we came in this morning, nodetool ring from another node showed the
 1st node as down and OpsCenter also reported it as down.

 Next we ran sudo netstat -anp | grep 7199 from the 1st node to see the
 status of the Cassandra PID and it was not running.

 We then started Cassandra:

 INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line
 78) Logging initialized
  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java (line
 96) Heap size: 3894411264/3894411264
  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA
 mlockall successful
  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line 121)
 Loading settings from
 file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml


 It was during this start process that the java.io.EOFException was seen,
 but yes, like you said Jonathan, the Cassandra process started back up and
 joined the ring.

 We're now wondering why the Repair failed and why Cassandra crashed in
 the first place. We only had default level logging