Re: Cassandra process exiting mysteriously

2014-08-26 Thread Or Sher
Hi Clint,
I think I kind of found the reason for my problem, I doubt you have the
exact same problem but here it is:

We're using Zabbix as our monitoring system and it uses /usr/bin/at to
schedule it monitoring runs.
Every time the at command adds another scheduled task, it send a kill
signal to the pid of the atd, probably just to check if it's alive, not to
kill it.
Now, looking at the system calls audit log, it seems like sometimes,
although the kill syscall uses one pid (the atd one), it actually send the
kill to our C* java process.
I'm really starting to think it's some kind of a linux kernel bug..
BTW, atd was always stopped, so I'm not really sure yet if it was part of
the problem or not.

HTH,
Or.



On Wed, Aug 13, 2014 at 9:22 AM, Or Sher or.sh...@gmail.com wrote:

 Will do the same!
 Thanks,
 Or.


 On Tue, Aug 12, 2014 at 6:47 PM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi Or,

 For now I removed the test that was failing like this from our suite
 and made a note to revisit it in a couple of weeks.  Unfortunately I
 still don't know what the issue is.  I'll post here if I figure out it
 (please do the same!).  My working hypothesis now is that we had some
 kind of OOM problem.

 Best regards,
 Clint

 On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote:
  Clint, did you find anything?
  I just noticed it happens to us too on only one node in our CI cluster.
  I don't think there is  a special usage before it happens... The last
 line
  in the log before the shutdown lines in at least an hour before..
  We're using C* 2.0.9.
 
 
  On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com
 wrote:
 
  Hi Rob,
 
  Thanks for the clarification; this is really useful.  I'll run some
  experiments to see if the problem is a JVM OOM on our build machine.
 
  Best regards,
  Clint
 
  On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com
 wrote:
   On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
   wrote:
  
   On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands 
 duncan.sa...@gmail.com
   wrote:
  
   this doesn't look like an OOM to me.  If the kernel OOM kills
   Cassandra
   then Cassandra instantly vaporizes, and there will be nothing in
 the
   Cassandra logs (you will find information about the OOM in the
 system
   logs
   though, eg in dmesg).  In the log snippet above you see an orderly
   shutdown,
   this is completely different to the instant OOM kill.
  
  
   Not really.
  
   https://issues.apache.org/jira/browse/CASSANDRA-7507
  
  
   To be clear, there's two different OOMs here, I am talking about the
 JVM
   OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
   necessarily result in the cassandra process dying, and can in fact
   trigger
   clean shutdown.
  
   System level OOM will in fact send the equivalent of KILL, which will
   not
   trigger the clean shutdown hook in Cassandra.
  
   =Rob
 
 
 
 
  --
  Or Sher




 --
 Or Sher




-- 
Or Sher


Re: Cassandra process exiting mysteriously

2014-08-13 Thread Or Sher
Will do the same!
Thanks,
Or.


On Tue, Aug 12, 2014 at 6:47 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Or,

 For now I removed the test that was failing like this from our suite
 and made a note to revisit it in a couple of weeks.  Unfortunately I
 still don't know what the issue is.  I'll post here if I figure out it
 (please do the same!).  My working hypothesis now is that we had some
 kind of OOM problem.

 Best regards,
 Clint

 On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote:
  Clint, did you find anything?
  I just noticed it happens to us too on only one node in our CI cluster.
  I don't think there is  a special usage before it happens... The last
 line
  in the log before the shutdown lines in at least an hour before..
  We're using C* 2.0.9.
 
 
  On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com
 wrote:
 
  Hi Rob,
 
  Thanks for the clarification; this is really useful.  I'll run some
  experiments to see if the problem is a JVM OOM on our build machine.
 
  Best regards,
  Clint
 
  On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com
 wrote:
   On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
   wrote:
  
   On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
 
   wrote:
  
   this doesn't look like an OOM to me.  If the kernel OOM kills
   Cassandra
   then Cassandra instantly vaporizes, and there will be nothing in the
   Cassandra logs (you will find information about the OOM in the
 system
   logs
   though, eg in dmesg).  In the log snippet above you see an orderly
   shutdown,
   this is completely different to the instant OOM kill.
  
  
   Not really.
  
   https://issues.apache.org/jira/browse/CASSANDRA-7507
  
  
   To be clear, there's two different OOMs here, I am talking about the
 JVM
   OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
   necessarily result in the cassandra process dying, and can in fact
   trigger
   clean shutdown.
  
   System level OOM will in fact send the equivalent of KILL, which will
   not
   trigger the clean shutdown hook in Cassandra.
  
   =Rob
 
 
 
 
  --
  Or Sher




-- 
Or Sher


Re: Cassandra process exiting mysteriously

2014-08-12 Thread Or Sher
Clint, did you find anything?
I just noticed it happens to us too on only one node in our CI cluster.
I don't think there is  a special usage before it happens... The last line
in the log before the shutdown lines in at least an hour before..
We're using C* 2.0.9.


On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Rob,

 Thanks for the clarification; this is really useful.  I'll run some
 experiments to see if the problem is a JVM OOM on our build machine.

 Best regards,
 Clint

 On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote:
  On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
 wrote:
 
  On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
  wrote:
 
  this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra
  then Cassandra instantly vaporizes, and there will be nothing in the
  Cassandra logs (you will find information about the OOM in the system
 logs
  though, eg in dmesg).  In the log snippet above you see an orderly
 shutdown,
  this is completely different to the instant OOM kill.
 
 
  Not really.
 
  https://issues.apache.org/jira/browse/CASSANDRA-7507
 
 
  To be clear, there's two different OOMs here, I am talking about the JVM
  OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
  necessarily result in the cassandra process dying, and can in fact
 trigger
  clean shutdown.
 
  System level OOM will in fact send the equivalent of KILL, which will not
  trigger the clean shutdown hook in Cassandra.
 
  =Rob




-- 
Or Sher


Re: Cassandra process exiting mysteriously

2014-08-12 Thread Clint Kelly
Hi Or,

For now I removed the test that was failing like this from our suite
and made a note to revisit it in a couple of weeks.  Unfortunately I
still don't know what the issue is.  I'll post here if I figure out it
(please do the same!).  My working hypothesis now is that we had some
kind of OOM problem.

Best regards,
Clint

On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote:
 Clint, did you find anything?
 I just noticed it happens to us too on only one node in our CI cluster.
 I don't think there is  a special usage before it happens... The last line
 in the log before the shutdown lines in at least an hour before..
 We're using C* 2.0.9.


 On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Rob,

 Thanks for the clarification; this is really useful.  I'll run some
 experiments to see if the problem is a JVM OOM on our build machine.

 Best regards,
 Clint

 On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote:
  On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
  wrote:
 
  On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
  wrote:
 
  this doesn't look like an OOM to me.  If the kernel OOM kills
  Cassandra
  then Cassandra instantly vaporizes, and there will be nothing in the
  Cassandra logs (you will find information about the OOM in the system
  logs
  though, eg in dmesg).  In the log snippet above you see an orderly
  shutdown,
  this is completely different to the instant OOM kill.
 
 
  Not really.
 
  https://issues.apache.org/jira/browse/CASSANDRA-7507
 
 
  To be clear, there's two different OOMs here, I am talking about the JVM
  OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
  necessarily result in the cassandra process dying, and can in fact
  trigger
  clean shutdown.
 
  System level OOM will in fact send the equivalent of KILL, which will
  not
  trigger the clean shutdown hook in Cassandra.
 
  =Rob




 --
 Or Sher


Re: Cassandra process exiting mysteriously

2014-08-06 Thread Duncan Sands

Hi Clint,


INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
ThriftServer.java (line 141) Stop listening to thrift clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
(line 182) Stop listening for CQL clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
Gossiper.java (line 1279) Announcing shutdown
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
MessagingService.java (line 683) Waiting for messaging service to
quiesce
  INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
MessagingService.java (line 923) MessagingService has terminated the
accept() thread

Does anyone have any ideas about how to debug this?  Looking around on
google I found some threads suggesting that this could occur from an
OOM error 
(http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors).


this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra then 
Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs 
(you will find information about the OOM in the system logs though, eg in 
dmesg).  In the log snippet above you see an orderly shutdown, this is 
completely different to the instant OOM kill.


Ciao, Duncan.


Re: Cassandra process exiting mysteriously

2014-08-06 Thread Clint Kelly
Hi Duncan,

Thanks for your help.

I am at a loss as to what is causing this process to stop then.  I
would not expect the Cassandra process to finish until my code calls
Process#destroy, but it seems to non-deterministically stop much
earlier sometimes.

FWIW I have seen failures on another machine this morning which also
look orderly.  These nodes never even get to the point where they
announce they are listening for CQL clients.

If anyone has any ideas on what to look for, I would really appreciate
it.  I will try turning logging up to DEBUG and see if that produces
any useful errors.

Best regards,
Clint




On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote:
 Hi Clint,


 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
 ThriftServer.java (line 141) Stop listening to thrift clients
   INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
 (line 182) Stop listening for CQL clients
   INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
 Gossiper.java (line 1279) Announcing shutdown
   INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
 MessagingService.java (line 683) Waiting for messaging service to
 quiesce
   INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
 MessagingService.java (line 923) MessagingService has terminated the
 accept() thread

 Does anyone have any ideas about how to debug this?  Looking around on
 google I found some threads suggesting that this could occur from an
 OOM error
 (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors).


 this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra then
 Cassandra instantly vaporizes, and there will be nothing in the Cassandra
 logs (you will find information about the OOM in the system logs though, eg
 in dmesg).  In the log snippet above you see an orderly shutdown, this is
 completely different to the instant OOM kill.

 Ciao, Duncan.


Re: Cassandra process exiting mysteriously

2014-08-06 Thread Robert Coli
On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote:

 this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra
 then Cassandra instantly vaporizes, and there will be nothing in the
 Cassandra logs (you will find information about the OOM in the system logs
 though, eg in dmesg).  In the log snippet above you see an orderly
 shutdown, this is completely different to the instant OOM kill.


Not really.

https://issues.apache.org/jira/browse/CASSANDRA-7507

=Rob


Re: Cassandra process exiting mysteriously

2014-08-06 Thread Robert Coli
On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
 wrote:

 this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra
 then Cassandra instantly vaporizes, and there will be nothing in the
 Cassandra logs (you will find information about the OOM in the system logs
 though, eg in dmesg).  In the log snippet above you see an orderly
 shutdown, this is completely different to the instant OOM kill.


 Not really.

 https://issues.apache.org/jira/browse/CASSANDRA-7507


To be clear, there's two different OOMs here, I am talking about the JVM
OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
necessarily result in the cassandra process dying, and can in fact trigger
clean shutdown.

System level OOM will in fact send the equivalent of KILL, which will not
trigger the clean shutdown hook in Cassandra.

=Rob


Re: Cassandra process exiting mysteriously

2014-08-06 Thread Clint Kelly
Hi Rob,

Thanks for the clarification; this is really useful.  I'll run some
experiments to see if the problem is a JVM OOM on our build machine.

Best regards,
Clint

On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote:
 On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
 wrote:

 this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra
 then Cassandra instantly vaporizes, and there will be nothing in the
 Cassandra logs (you will find information about the OOM in the system logs
 though, eg in dmesg).  In the log snippet above you see an orderly shutdown,
 this is completely different to the instant OOM kill.


 Not really.

 https://issues.apache.org/jira/browse/CASSANDRA-7507


 To be clear, there's two different OOMs here, I am talking about the JVM
 OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
 necessarily result in the cassandra process dying, and can in fact trigger
 clean shutdown.

 System level OOM will in fact send the equivalent of KILL, which will not
 trigger the clean shutdown hook in Cassandra.

 =Rob


Cassandra process exiting mysteriously

2014-08-05 Thread Clint Kelly
Hi everyone,

For some integration tests, we start up a CassandraDaemon in a
separate process (using the Java 7 ProcessBuilder API).  All of my
integration tests run beautifully on my laptop, but one of them fails
on our Jenkins cluster.

The failing integration test does around 10k writes to different rows
and then 10k reads.  After running some number of reads, the job dies
with this error:

com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: /127.0.0.10:58209
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read))

This error appears to have occurred because the Cassandra process has
stopped.  The logs for the Cassandra process show some warnings during
batch writes (the batches are too big), no activity for a few minutes
(I assume this is because all of the read operations were proceeding
smoothly), and then look like the following:

INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
ThriftServer.java (line 141) Stop listening to thrift clients
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
(line 182) Stop listening for CQL clients
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
Gossiper.java (line 1279) Announcing shutdown
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
MessagingService.java (line 683) Waiting for messaging service to
quiesce
 INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
MessagingService.java (line 923) MessagingService has terminated the
accept() thread

Does anyone have any ideas about how to debug this?  Looking around on
google I found some threads suggesting that this could occur from an
OOM error 
(http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors).
Wouldn't such an error be logged, however?

The test that fails is a test of our MapReduce Hadoop InputFormat and
as such it does some pretty big queries across multiple rows (over a
range of partitioning key tokens).  The default fetch size I believe
is 5000 rows, and the values in the rows I am fetching are just simple
strings, so I would not think the amount of data in a single read
would be too big.

FWIW I don't see any log messages about garbage collection for at
least 3min before the process shuts down (and no GC messages after the
test stops doing writes and starts doing reads).

I'd greatly appreciate any help before my team kills me for breaking
our Jenkins build so consistently!  :)

Best regards,
Clint


Re: Cassandra process exiting mysteriously

2014-08-05 Thread Kevin Burton
If there is an oom it will be in the logs.
On Aug 5, 2014 8:17 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi everyone,

 For some integration tests, we start up a CassandraDaemon in a
 separate process (using the Java 7 ProcessBuilder API).  All of my
 integration tests run beautifully on my laptop, but one of them fails
 on our Jenkins cluster.

 The failing integration test does around 10k writes to different rows
 and then 10k reads.  After running some number of reads, the job dies
 with this error:

 com.datastax.driver.core.exceptions.NoHostAvailableException: All
 host(s) tried for query failed (tried: /127.0.0.10:58209
 (com.datastax.driver.core.exceptions.DriverException: Timeout during
 read))

 This error appears to have occurred because the Cassandra process has
 stopped.  The logs for the Cassandra process show some warnings during
 batch writes (the batches are too big), no activity for a few minutes
 (I assume this is because all of the read operations were proceeding
 smoothly), and then look like the following:

 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
 ThriftServer.java (line 141) Stop listening to thrift clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
 (line 182) Stop listening for CQL clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
 Gossiper.java (line 1279) Announcing shutdown
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
 MessagingService.java (line 683) Waiting for messaging service to
 quiesce
  INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
 MessagingService.java (line 923) MessagingService has terminated the
 accept() thread

 Does anyone have any ideas about how to debug this?  Looking around on
 google I found some threads suggesting that this could occur from an
 OOM error (
 http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors
 ).
 Wouldn't such an error be logged, however?

 The test that fails is a test of our MapReduce Hadoop InputFormat and
 as such it does some pretty big queries across multiple rows (over a
 range of partitioning key tokens).  The default fetch size I believe
 is 5000 rows, and the values in the rows I am fetching are just simple
 strings, so I would not think the amount of data in a single read
 would be too big.

 FWIW I don't see any log messages about garbage collection for at
 least 3min before the process shuts down (and no GC messages after the
 test stops doing writes and starts doing reads).

 I'd greatly appreciate any help before my team kills me for breaking
 our Jenkins build so consistently!  :)

 Best regards,
 Clint



Re: Cassandra process exiting mysteriously

2014-08-05 Thread Clint Kelly
HI Kevin,

Thanks for your reply.  That is what I assumed, but some of the posts
I read on Stack Overflow (e.g., the one that I referenced in my mail)
suggested otherwise.  I was just curious if others had experienced OOM
problems that weren't logged or if there were other common culprits.

Best regards,
Clint



On Tue, Aug 5, 2014 at 9:29 PM, Kevin Burton bur...@spinn3r.com wrote:
 If there is an oom it will be in the logs.

 On Aug 5, 2014 8:17 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi everyone,

 For some integration tests, we start up a CassandraDaemon in a
 separate process (using the Java 7 ProcessBuilder API).  All of my
 integration tests run beautifully on my laptop, but one of them fails
 on our Jenkins cluster.

 The failing integration test does around 10k writes to different rows
 and then 10k reads.  After running some number of reads, the job dies
 with this error:

 com.datastax.driver.core.exceptions.NoHostAvailableException: All
 host(s) tried for query failed (tried: /127.0.0.10:58209
 (com.datastax.driver.core.exceptions.DriverException: Timeout during
 read))

 This error appears to have occurred because the Cassandra process has
 stopped.  The logs for the Cassandra process show some warnings during
 batch writes (the batches are too big), no activity for a few minutes
 (I assume this is because all of the read operations were proceeding
 smoothly), and then look like the following:

 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
 ThriftServer.java (line 141) Stop listening to thrift clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
 (line 182) Stop listening for CQL clients
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
 Gossiper.java (line 1279) Announcing shutdown
  INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
 MessagingService.java (line 683) Waiting for messaging service to
 quiesce
  INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
 MessagingService.java (line 923) MessagingService has terminated the
 accept() thread

 Does anyone have any ideas about how to debug this?  Looking around on
 google I found some threads suggesting that this could occur from an
 OOM error
 (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors).
 Wouldn't such an error be logged, however?

 The test that fails is a test of our MapReduce Hadoop InputFormat and
 as such it does some pretty big queries across multiple rows (over a
 range of partitioning key tokens).  The default fetch size I believe
 is 5000 rows, and the values in the rows I am fetching are just simple
 strings, so I would not think the amount of data in a single read
 would be too big.

 FWIW I don't see any log messages about garbage collection for at
 least 3min before the process shuts down (and no GC messages after the
 test stops doing writes and starts doing reads).

 I'd greatly appreciate any help before my team kills me for breaking
 our Jenkins build so consistently!  :)

 Best regards,
 Clint