Re: Cassandra process exiting mysteriously
Hi Clint, I think I kind of found the reason for my problem, I doubt you have the exact same problem but here it is: We're using Zabbix as our monitoring system and it uses /usr/bin/at to schedule it monitoring runs. Every time the at command adds another scheduled task, it send a kill signal to the pid of the atd, probably just to check if it's alive, not to kill it. Now, looking at the system calls audit log, it seems like sometimes, although the kill syscall uses one pid (the atd one), it actually send the kill to our C* java process. I'm really starting to think it's some kind of a linux kernel bug.. BTW, atd was always stopped, so I'm not really sure yet if it was part of the problem or not. HTH, Or. On Wed, Aug 13, 2014 at 9:22 AM, Or Sher or.sh...@gmail.com wrote: Will do the same! Thanks, Or. On Tue, Aug 12, 2014 at 6:47 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi Or, For now I removed the test that was failing like this from our suite and made a note to revisit it in a couple of weeks. Unfortunately I still don't know what the issue is. I'll post here if I figure out it (please do the same!). My working hypothesis now is that we had some kind of OOM problem. Best regards, Clint On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote: Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher -- Or Sher -- Or Sher
Re: Cassandra process exiting mysteriously
Will do the same! Thanks, Or. On Tue, Aug 12, 2014 at 6:47 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi Or, For now I removed the test that was failing like this from our suite and made a note to revisit it in a couple of weeks. Unfortunately I still don't know what the issue is. I'll post here if I figure out it (please do the same!). My working hypothesis now is that we had some kind of OOM problem. Best regards, Clint On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote: Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher -- Or Sher
Re: Cassandra process exiting mysteriously
Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher
Re: Cassandra process exiting mysteriously
Hi Or, For now I removed the test that was failing like this from our suite and made a note to revisit it in a couple of weeks. Unfortunately I still don't know what the issue is. I'll post here if I figure out it (please do the same!). My working hypothesis now is that we had some kind of OOM problem. Best regards, Clint On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote: Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher
Re: Cassandra process exiting mysteriously
Hi Clint, INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors). this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Ciao, Duncan.
Re: Cassandra process exiting mysteriously
Hi Duncan, Thanks for your help. I am at a loss as to what is causing this process to stop then. I would not expect the Cassandra process to finish until my code calls Process#destroy, but it seems to non-deterministically stop much earlier sometimes. FWIW I have seen failures on another machine this morning which also look orderly. These nodes never even get to the point where they announce they are listening for CQL clients. If anyone has any ideas on what to look for, I would really appreciate it. I will try turning logging up to DEBUG and see if that produces any useful errors. Best regards, Clint On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: Hi Clint, INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors). this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Ciao, Duncan.
Re: Cassandra process exiting mysteriously
On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 =Rob
Re: Cassandra process exiting mysteriously
On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob
Re: Cassandra process exiting mysteriously
Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob
Cassandra process exiting mysteriously
Hi everyone, For some integration tests, we start up a CassandraDaemon in a separate process (using the Java 7 ProcessBuilder API). All of my integration tests run beautifully on my laptop, but one of them fails on our Jenkins cluster. The failing integration test does around 10k writes to different rows and then 10k reads. After running some number of reads, the job dies with this error: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.10:58209 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) This error appears to have occurred because the Cassandra process has stopped. The logs for the Cassandra process show some warnings during batch writes (the batches are too big), no activity for a few minutes (I assume this is because all of the read operations were proceeding smoothly), and then look like the following: INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors). Wouldn't such an error be logged, however? The test that fails is a test of our MapReduce Hadoop InputFormat and as such it does some pretty big queries across multiple rows (over a range of partitioning key tokens). The default fetch size I believe is 5000 rows, and the values in the rows I am fetching are just simple strings, so I would not think the amount of data in a single read would be too big. FWIW I don't see any log messages about garbage collection for at least 3min before the process shuts down (and no GC messages after the test stops doing writes and starts doing reads). I'd greatly appreciate any help before my team kills me for breaking our Jenkins build so consistently! :) Best regards, Clint
Re: Cassandra process exiting mysteriously
If there is an oom it will be in the logs. On Aug 5, 2014 8:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, For some integration tests, we start up a CassandraDaemon in a separate process (using the Java 7 ProcessBuilder API). All of my integration tests run beautifully on my laptop, but one of them fails on our Jenkins cluster. The failing integration test does around 10k writes to different rows and then 10k reads. After running some number of reads, the job dies with this error: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.10:58209 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) This error appears to have occurred because the Cassandra process has stopped. The logs for the Cassandra process show some warnings during batch writes (the batches are too big), no activity for a few minutes (I assume this is because all of the read operations were proceeding smoothly), and then look like the following: INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error ( http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors ). Wouldn't such an error be logged, however? The test that fails is a test of our MapReduce Hadoop InputFormat and as such it does some pretty big queries across multiple rows (over a range of partitioning key tokens). The default fetch size I believe is 5000 rows, and the values in the rows I am fetching are just simple strings, so I would not think the amount of data in a single read would be too big. FWIW I don't see any log messages about garbage collection for at least 3min before the process shuts down (and no GC messages after the test stops doing writes and starts doing reads). I'd greatly appreciate any help before my team kills me for breaking our Jenkins build so consistently! :) Best regards, Clint
Re: Cassandra process exiting mysteriously
HI Kevin, Thanks for your reply. That is what I assumed, but some of the posts I read on Stack Overflow (e.g., the one that I referenced in my mail) suggested otherwise. I was just curious if others had experienced OOM problems that weren't logged or if there were other common culprits. Best regards, Clint On Tue, Aug 5, 2014 at 9:29 PM, Kevin Burton bur...@spinn3r.com wrote: If there is an oom it will be in the logs. On Aug 5, 2014 8:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, For some integration tests, we start up a CassandraDaemon in a separate process (using the Java 7 ProcessBuilder API). All of my integration tests run beautifully on my laptop, but one of them fails on our Jenkins cluster. The failing integration test does around 10k writes to different rows and then 10k reads. After running some number of reads, the job dies with this error: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.10:58209 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) This error appears to have occurred because the Cassandra process has stopped. The logs for the Cassandra process show some warnings during batch writes (the batches are too big), no activity for a few minutes (I assume this is because all of the read operations were proceeding smoothly), and then look like the following: INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors). Wouldn't such an error be logged, however? The test that fails is a test of our MapReduce Hadoop InputFormat and as such it does some pretty big queries across multiple rows (over a range of partitioning key tokens). The default fetch size I believe is 5000 rows, and the values in the rows I am fetching are just simple strings, so I would not think the amount of data in a single read would be too big. FWIW I don't see any log messages about garbage collection for at least 3min before the process shuts down (and no GC messages after the test stops doing writes and starts doing reads). I'd greatly appreciate any help before my team kills me for breaking our Jenkins build so consistently! :) Best regards, Clint