Bootstrap stuck: vnode enabled 1.2.12
After our otherwise successful upgrade procedure to enable vnodes, when adding back new hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING). From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1]. Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time. Any help is appreciated. Thanks, Arindam [1] Thread dump Thread 3708: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=811 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=55, line=969 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1281 (Interpreted frame) - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted frame) - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256 (Interpreted frame) - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84 (Interpreted frame) - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection) @bci=172, line=978 (Interpreted frame) - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827, line=744 (Interpreted frame) - org.apache.cassandra.service.StorageService.initServer(int) @bci=363, line=585 (Interpreted frame) - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447 (Interpreted frame) - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3, line=490 (Interpreted frame)
Re: TimedOutException in Java but not in cqlsh
After a few tests, it does not depend on the query. Whatever cql3 query I do, I always get the same exception. If someone sees something ... -- Cyril SCETBON On 13 Feb 2014, at 17:22, Cyril Scetbon cyril.scet...@free.fr wrote: Hi, I get a weird issue with cassandra 1.2.13. As written in the subject, a query executed by class CqlPagingRecordReader raises a TimedOutException exception in Java but I don't have any error when I use it with cqlsh. What's the difference between those 2 ways ? Does cqlsh bypass some configuration compared to Java ? You can find my sample code at http://pastebin.com/vbAFyAys (don't take care of the way it's coded cause it's just a sample code). FYI, I can't reproduce it on another cluster. Here is the output of the 2 different ways (java and cqlsh) I used http://pastebin.com/umMNXJRw Thanks -- Cyril SCETBON
Re: TimedOutException in Java but not in cqlsh
Check for consisteny level and socket timeout setting on client side. -Vivek On Fri, Feb 14, 2014 at 2:36 PM, Cyril Scetbon cyril.scet...@free.frwrote: After a few tests, it does not depend on the query. Whatever cql3 query I do, I always get the same exception. If someone sees something ... -- Cyril SCETBON On 13 Feb 2014, at 17:22, Cyril Scetbon cyril.scet...@free.fr wrote: Hi, I get a weird issue with cassandra 1.2.13. As written in the subject, a query executed by class CqlPagingRecordReader raises a TimedOutException exception in Java but I don't have any error when I use it with cqlsh. What's the difference between those 2 ways ? Does cqlsh bypass some configuration compared to Java ? You can find my sample code at http://pastebin.com/vbAFyAys (don't take care of the way it's coded cause it's just a sample code). FYI, I can't reproduce it on another cluster. Here is the output of the 2 different ways (java and cqlsh) I used http://pastebin.com/umMNXJRw Thanks -- Cyril SCETBON
Re: TimedOutException in Java but not in cqlsh
Hi, Good advice. I found earlier in the morning that it's related to consistency LOCAL_ONE. I'll check later if it should raise an error in some cases. Thanks -- Cyril SCETBON On 14 Feb 2014, at 10:12, Vivek Mishra mishra.v...@gmail.com wrote: Check for consisteny level and socket timeout setting on client side. -Vivek On Fri, Feb 14, 2014 at 2:36 PM, Cyril Scetbon cyril.scet...@free.fr wrote: After a few tests, it does not depend on the query. Whatever cql3 query I do, I always get the same exception. If someone sees something ... -- Cyril SCETBON On 13 Feb 2014, at 17:22, Cyril Scetbon cyril.scet...@free.fr wrote: Hi, I get a weird issue with cassandra 1.2.13. As written in the subject, a query executed by class CqlPagingRecordReader raises a TimedOutException exception in Java but I don't have any error when I use it with cqlsh. What's the difference between those 2 ways ? Does cqlsh bypass some configuration compared to Java ? You can find my sample code at http://pastebin.com/vbAFyAys (don't take care of the way it's coded cause it's just a sample code). FYI, I can't reproduce it on another cluster. Here is the output of the 2 different ways (java and cqlsh) I used http://pastebin.com/umMNXJRw Thanks -- Cyril SCETBON
Re: Bootstrap failure on C* 1.2.13
Hi Paulo, Did you find out how to fix this issue ? I am experimenting the exact same issue after trying to help you on this exact subject a few days ago :). Config : 32 C*1.2.11 nodes, Vnodes enabled, RF=3, 1 DC, On AWS EC2 m1.xlarge. We added a few nodes (4) and it seems that this occurs on one node out of two... INFO 12:52:16,889 Finished streaming session d5e4d014-9558-11e3-950d-cd6aba92807e from /xxx.xxx.xxx.xxx java.lang.RuntimeException: Unable to fetch range [(20078703525355016727168231761171377180,20105424945623564908585534414693308183], (129753652951782325468767616123724624016,129754698153613057562227134647005586420], (449910615740630024413140540076738,4524540663392564361402125588359485564], (122461441134035840782923349842361962551,122462803389597917496737056756119104930], (107970238065835199457922160357012606207,107987706615224138615506976884972465320], (129754698153613057562227134647005586420,129760990520285412763184172827801136526], (38338043252657275110873170917842646549,38368318768493907804399955985800320618], (42022774431506526693485667522039962965,42053289032932587102300879230918436885], (66836265760288088017242608238099612345,66844191330959602627129212011239690831], (52540232739182066369547232798226785314,52559117354438503565212218200939569114], (145046787539667961591986998676504957238,145057153206926436867917708334845130444], (108279691586280658015556401795266720050,108305470056478513440634738885678702409], (40039571254531814244837067525035822613,40053379084508254942645157728035688263], (132027653159543236812527609067336099062,132029648290617316887203744857701890860], (52516518106546460227349801041398186304,52540232739182066369547232798226785314], (151797253868519929321029931533765036527,151828244658375264200603444399788004805], (145057153206926436867917708334845130444,145084033851007428646660791831082771964], (107963567982152736714636832273817259428,107970238065835199457922160357012606207]] for keyspace foo_bar from any hosts at org.apache.cassandra.dht.RangeStreamer.fetch(RangeStreamer.java:260) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:84) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:973) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:740) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212) Cannot load daemon Service exit with a return value of 3 Hope you'll be able to help me on this one :) 2014-02-07 19:24 GMT+01:00 Robert Coli rc...@eventbrite.com: On Fri, Feb 7, 2014 at 4:41 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: From changelog : 1.2.15 * Move handling of migration event source to solve bootstrap race (CASSANDRA-6648) Maybe should you give this new version a try, if you suspect your issue to be related to CASSANDRA-6648. 6648 appears to have been introduced in 1.2.14, by : https://issues.apache.org/jira/browse/CASSANDRA-6615 So it should only affect 1.2.14. =Rob
Exception in cassandra logs while processing the message
Hello, I am seeing below exception in my cassandra logs(/var/log/cassandra/system.log). INFO [ScheduledTasks:1] 2014-02-13 13:13:57,641 GCInspector.java (line 119) GC for ParNew: 273 ms for 1 collections, 2319121816 used; max is 445 6448000 INFO [ScheduledTasks:1] 2014-02-13 13:14:02,695 GCInspector.java (line 119) GC for ParNew: 214 ms for 1 collections, 2315368976 used; max is 445 6448000 INFO [OptionalTasks:1] 2014-02-13 13:14:08,093 MeteredFlusher.java (line 64) flushing high-traffic column family CFS(Keyspace='comsdb', ColumnFa mily='product_update') (estimated 213624220 bytes) INFO [OptionalTasks:1] 2014-02-13 13:14:08,093 ColumnFamilyStore.java (line 626) Enqueuing flush of Memtable-product_update@1067619242(31239028/ 213625108 serialized/live bytes, 222393 ops) INFO [FlushWriter:94] 2014-02-13 13:14:08,127 Memtable.java (line 400) Writing Memtable-product_update@1067619242(31239028/213625108 serialized/ live bytes, 222393 ops) INFO [ScheduledTasks:1] 2014-02-13 13:14:08,696 GCInspector.java (line 119) GC for ParNew: 214 ms for 1 collections, 2480175160 used; max is 445 6448000 * INFO [FlushWriter:94] 2014-02-13 13:14:10,836 Memtable.java (line 438) Completed flushing /cassandra1/data/comsdb/product_update/comsdb-product_* *update-ic-416-Data.db (15707248 bytes) for commitlog position ReplayPosition(segmentId=1391568233618, position=13712751)* *ERROR [Thrift:13] 2014-02-13 13:15:45,694 CustomTThreadPoolServer.java (line 213) Thrift error occurred during processing of message.* *org.apache.thrift.TException: Negative length: -2147418111* at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:388) at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:20304) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:21) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) ERROR [Thrift:103] 2014-02-13 13:21:25,719 CustomTThreadPoolServer.java (line 213) Thrift error occurred during processing of message. org.apache.thrift.TException: Negative length: -2147418111 Below is my cassandra version and hector client version, which is being used currently. Cassandra-version: 1.2.11 Hector-client: 1.0-2 Any lead would be appreciated though we are planning to move cassandra 2.0 version with java-driver but it may take some time meanwhile need to find the root cause and resolve this issue. Regards, Ankit Tyagi
Expired column showing up
Hi, I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns), I expire few column by setting TTL as 1 second. At times these columns show up during slice query. When I have this issue, running count and get commands for that row using Cassandra cli it gives different column counts. But once I run flush and compact, the issue goes off and expired columns don't show up. Can someone provide some help on this issue. -- Regards, Mahesh Rajamani
Re: Expired column showing up
I am just learning, I don't know answer to your question, but What is the use case for TTL as 1 second? On Fri, Feb 14, 2014 at 6:45 AM, mahesh rajamani rajamani.mah...@gmail.comwrote: Hi, I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns), I expire few column by setting TTL as 1 second. At times these columns show up during slice query. When I have this issue, running count and get commands for that row using Cassandra cli it gives different column counts. But once I run flush and compact, the issue goes off and expired columns don't show up. Can someone provide some help on this issue. -- Regards, Mahesh Rajamani
Re: Bootstrap failure on C* 1.2.13
Hello Alain, I solved this with a brute force solution, but didn't understand exactly what happened behind the scenes. What I did was: a) removed the failed node from the ring with the unsafeAssassinate JMX option. b) this caused requests to that node to be routed to the following node which didn't have the data, so in order to fix the problem I inserted a new dummy node with the same token as the failed node, but with autobootstrap=false c) after the node joined the ring again, I did a clean shutdown with nodetool -h localhost disablethrift nodetool -h localhost disablegossip sleep 10 nodetool -h localhost drain d) restart the bootstrap process again in the new node. But in our case, our cluster was not using VNodes, so this workaround will probably not work with VNodes, since you cannot specify the 256 tokens from the old node. This really seem like some kind of metadata inconsistency in gossip, so you probably should check if your nodetool gossipinfo shows a node that's not supposed to be in the ring and unsafeAssassinate it. This post has more info about it: http://nartax.com/2012/09/assassinate-cassandra-node/ But be careful to know what you're doing, as this can be a dangerous operation. Good luck! Cheers, Paulo On Fri, Feb 14, 2014 at 11:17 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Paulo, Did you find out how to fix this issue ? I am experimenting the exact same issue after trying to help you on this exact subject a few days ago :). Config : 32 C*1.2.11 nodes, Vnodes enabled, RF=3, 1 DC, On AWS EC2 m1.xlarge. We added a few nodes (4) and it seems that this occurs on one node out of two... INFO 12:52:16,889 Finished streaming session d5e4d014-9558-11e3-950d-cd6aba92807e from /xxx.xxx.xxx.xxx java.lang.RuntimeException: Unable to fetch range [(20078703525355016727168231761171377180,20105424945623564908585534414693308183], (129753652951782325468767616123724624016,129754698153613057562227134647005586420], (449910615740630024413140540076738,4524540663392564361402125588359485564], (122461441134035840782923349842361962551,122462803389597917496737056756119104930], (107970238065835199457922160357012606207,107987706615224138615506976884972465320], (129754698153613057562227134647005586420,129760990520285412763184172827801136526], (38338043252657275110873170917842646549,38368318768493907804399955985800320618], (42022774431506526693485667522039962965,42053289032932587102300879230918436885], (66836265760288088017242608238099612345,66844191330959602627129212011239690831], (52540232739182066369547232798226785314,52559117354438503565212218200939569114], (145046787539667961591986998676504957238,145057153206926436867917708334845130444], (108279691586280658015556401795266720050,108305470056478513440634738885678702409], (40039571254531814244837067525035822613,40053379084508254942645157728035688263], (132027653159543236812527609067336099062,132029648290617316887203744857701890860], (52516518106546460227349801041398186304,52540232739182066369547232798226785314], (151797253868519929321029931533765036527,151828244658375264200603444399788004805], (145057153206926436867917708334845130444,145084033851007428646660791831082771964], (107963567982152736714636832273817259428,107970238065835199457922160357012606207]] for keyspace foo_bar from any hosts at org.apache.cassandra.dht.RangeStreamer.fetch(RangeStreamer.java:260) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:84) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:973) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:740) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212) Cannot load daemon Service exit with a return value of 3 Hope you'll be able to help me on this one :) 2014-02-07 19:24 GMT+01:00 Robert Coli rc...@eventbrite.com: On Fri, Feb 7, 2014 at 4:41 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: From changelog : 1.2.15 * Move handling of migration event source to solve bootstrap race (CASSANDRA-6648) Maybe should you give this new version a try, if you suspect your issue to be related to CASSANDRA-6648. 6648 appears to have been introduced in 1.2.14, by : https://issues.apache.org/jira/browse/CASSANDRA-6615 So it should only affect 1.2.14. =Rob
Re: Expired column showing up
You should upgrade. Cassandra 2.0.2 is not the latest version. If you still have the problem report a bug. On Fri, Feb 14, 2014 at 12:50 PM, Yogi Nerella ynerella...@gmail.comwrote: I am just learning, I don't know answer to your question, but What is the use case for TTL as 1 second? On Fri, Feb 14, 2014 at 6:45 AM, mahesh rajamani rajamani.mah...@gmail.com wrote: Hi, I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns), I expire few column by setting TTL as 1 second. At times these columns show up during slice query. When I have this issue, running count and get commands for that row using Cassandra cli it gives different column counts. But once I run flush and compact, the issue goes off and expired columns don't show up. Can someone provide some help on this issue. -- Regards, Mahesh Rajamani
Re: Intermittent long application pauses on nodes
Sorry, I have not had a chance to file a JIRA ticket. We have not been able to resolve the issue. But since Joel mentioned that upgrading to Cassandra 2.0.X solved it for them, we may need to upgrade. We are currently on Java 1.7 and Cassandra 1.2.8 On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright kwri...@nanigans.com wrote: You're running 2.0.* in production? May I ask what C* version and OS? Any hardware details would be appreciated as well. Thx! From: Joel Samuelsson samuelsson.j...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, February 13, 2014 at 11:39 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Intermittent long application pauses on nodes We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 seems to have helped our issues. 2014-02-13 Keith Wright kwri...@nanigans.com: Frank did you ever file a ticket for this issue or find the root cause? I believe we are seeing the same issues when attempting to bootstrap. Thanks From: Robert Coli rc...@eventbrite.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Monday, February 3, 2014 at 6:10 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Intermittent long application pauses on nodes On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: It's possible that this is a JVM issue, but if so there may be some remedial action we can take anyway. There are some more flags we should add, but we can discuss that once you open a ticket. If you could include the strange JMX error as well, that might be helpful. It would be appreciated if you could inform this thread of the JIRA ticket number, for the benefit of the community and google searchers. :) =Rob
Re: Expired column showing up
Hi Mahesh, is it possible you are creating columns with a long TTL, then update these columns with a smaller TTL? kind regards, Christian On Fri, Feb 14, 2014 at 3:45 PM, mahesh rajamani rajamani.mah...@gmail.comwrote: Hi, I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns), I expire few column by setting TTL as 1 second. At times these columns show up during slice query. When I have this issue, running count and get commands for that row using Cassandra cli it gives different column counts. But once I run flush and compact, the issue goes off and expired columns don't show up. Can someone provide some help on this issue. -- Regards, Mahesh Rajamani
Re: Expired column showing up
It is my understanding that rows with TTLs don't mix well with rows that don't have TTLs. ie they should all have TTL or all not have TTL. That said if you can create a small java class (test case) that demonstrates the problem, I'm happy to try it out on 2.0.5. This code can be attached to a jira ticket if needed. __ Sent from iPhone On 15 Feb 2014, at 1:45 am, mahesh rajamani rajamani.mah...@gmail.com wrote: Hi, I am using Cassandra 2.0.2 version. On a wide row (approx. 1 columns), I expire few column by setting TTL as 1 second. At times these columns show up during slice query. When I have this issue, running count and get commands for that row using Cassandra cli it gives different column counts. But once I run flush and compact, the issue goes off and expired columns don't show up. Can someone provide some help on this issue. -- Regards, Mahesh Rajamani
Re: Bootstrap failure on C* 1.2.13
On Fri, Feb 14, 2014 at 10:08 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: But in our case, our cluster was not using VNodes, so this workaround will probably not work with VNodes, since you cannot specify the 256 tokens from the old node. Sure you can, in a comma delimited list. I plan to write a short blog post about this, but... I recommend that anyone using Cassandra, vnodes or not, always explicitly populate their initial_token line in cassandra.yaml. There are a number of cases where you will lose if you do not do so, and AFAICT no cases where you lose by doing so. If one is using vnodes and wants to do this, the process goes like : 1) set num_tokens to the desired number of vnodes 2) start node/bootstrap 3) use a one liner like jeffj's : nodetool info -T | grep ^Token | awk '{ print $3 }' | tr \\n , | sed -e 's/,$/\n/' to get a comma delimited list of the vnode tokens 4) insert this comma delimited list in initial_token, and comment out num_tokens (though it is a NOOP) =Rob
Re: supervisord and cassandra
Hi, Using now oracle 7. commented out the line StringTableSize=103 same issue. but nothing in the log file now. but I start from, the command line the works. Thanks On Fri, Feb 14, 2014 at 9:48 AM, Michael Shuler mich...@pbandjelly.orgwrote: On 02/13/2014 07:03 PM, David Montgomery wrote: I only added the -f flag after the first time it did not work. If I dont use the -f flag. cassandra_server:cassandra FATAL Exited too quickly (process log may have details) From your original message: Unrecognized VM option 'StringTableSize=103' Could not create the Java virtual machine. Comment out the -XX:StringTableSize=103 line in conf/cassandra-env.sh and see what happens. java version 1.7.0_25 OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2) Use Oracle's JVM and see what happens. -- Michael
Re: supervisord and cassandra
On 02/14/2014 06:58 PM, David Montgomery wrote: Hi, Using now oracle 7. commented out the line StringTableSize=103 same issue. but nothing in the log file now. but I start from, the command line the works. What user are you running c* with, when running from the command line? What user is running c* via supervisord? -- Michael
Re: supervisord and cassandra
On 02/14/2014 07:34 PM, Michael Shuler wrote: On 02/14/2014 06:58 PM, David Montgomery wrote: Hi, Using now oracle 7. commented out the line StringTableSize=103 same issue. but nothing in the log file now. but I start from, the command line the works. What user are you running c* with, when running from the command line? What user is running c* via supervisord? So you peaked my interest and tried supervisord in a vm. I think you need to probably go hit up the supervisord community for some how do I do this correctly questions. Attached a console log and the conf I used. Here's what I did: - installed c* 2.0.5 with /var/{lib,log}/cassandra owned by my user, as usual - verified c* runs fine from the command line - killed c* - installed supervisor package and added the attached conf - stopped/started supervisord to pick up the new conf - c* is running fine and nodetool confirms - supervisorctl status shows ignorance of c* running (wrong config, I assume) - stopped supervisord, c* still running (not sure if this is normal..) I have never played with supervisord. It's interesting, but my guess is there is some additional magic needed by supervisor experts to help you with a properly behaving configuration. Good luck and do report back with a good config for the archives! -- Kind regards, Michael mshuler@debian:~$ ps axu|grep [j]ava mshuler@debian:~$ mshuler@debian:~$ sudo invoke-rc.d supervisor start Starting supervisor: supervisord. mshuler@debian:~$ mshuler@debian:~$ ps axu|grep [j]ava mshuler 5313 75.6 16.1 1053044 166152 ? Sl 19:58 0:03 java -ea -javaagent:/opt/cassandra/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms501M -Xmx501M -Xmn100M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -cp /opt/cassandra/bin/../conf:/opt/cassandra/bin/../build/classes/main:/opt/cassandra/bin/../build/classes/thrift:/opt/cassandra/bin/../lib/antlr-3.2.jar:/opt/cassandra/bin/../lib/apache-cassandra-2.0.5.jar:/opt/cassandra/bin/../lib/apache-cassandra-clientutil-2.0.5.jar:/opt/cassandra/bin/../lib/apache-cassandra-thrift-2.0.5.jar:/opt/cassandra/bin/../lib/commons-cli-1.1.jar:/opt/cassandra/bin/../lib/commons-codec-1.2.jar:/opt/cassandra/bin/../lib/commons-lang3-3.1.jar:/opt/cassandra/bin/../lib/compress-lzf-0.8.4.jar:/opt/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.3.jar:/opt/cassandra/bin/../lib/disruptor-3.0.1.jar:/opt/cassandra/bin/../lib/guava-15.0.jar:/opt/cassandra/bin/../lib/high-scale-lib-1.1.2.jar:/opt/cassandra/bin/../lib/jackson-core-asl-1.9.2.jar:/opt/cassandra/bin/../lib/jackson-mapper-asl-1.9.2.jar:/opt/cassandra/bin/../lib/jamm-0.2.5.jar:/opt/cassandra/bin/../lib/jbcrypt-0.3m.jar:/opt/cassandra/bin/../lib/jline-1.0.jar:/opt/cassandra/bin/../lib/json-simple-1.1.jar:/opt/cassandra/bin/../lib/libthrift-0.9.1.jar:/opt/cassandra/bin/../lib/log4j-1.2.16.jar:/opt/cassandra/bin/../lib/lz4-1.2.0.jar:/opt/cassandra/bin/../lib/metrics-core-2.2.0.jar:/opt/cassandra/bin/../lib/netty-3.6.6.Final.jar:/opt/cassandra/bin/../lib/reporter-config-2.1.0.jar:/opt/cassandra/bin/../lib/servlet-api-2.5-20081211.jar:/opt/cassandra/bin/../lib/slf4j-api-1.7.2.jar:/opt/cassandra/bin/../lib/slf4j-log4j12-1.7.2.jar:/opt/cassandra/bin/../lib/snakeyaml-1.11.jar:/opt/cassandra/bin/../lib/snappy-java-1.0.5.jar:/opt/cassandra/bin/../lib/snaptree-0.1.jar:/opt/cassandra/bin/../lib/thrift-server-0.3.3.jar org.apache.cassandra.service.CassandraDaemon mshuler@debian:~$ mshuler@debian:~$ sudo supervisorctl status cassandra_server:cassandra FATAL Exited too quickly (process log may have details) mshuler@debian:~$ mshuler@debian:~$ /opt/cassandra/bin/nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns (effective) Host ID Rack UN 127.0.0.1 114.22 KB 256 100.0% 3a7fde73-b1ca-4503-b5f2-b4cd9b41032c rack1 mshuler@debian:~$ mshuler@debian:~$ killall java mshuler@debian:~$ mshuler@debian:~$ ps axu|grep [j]ava mshuler@debian:~$ mshuler@debian:~$ ps axu|grep [j]ava mshuler@debian:~$ mshuler@debian:~$ date Fri Feb 14 20:00:09 CST 2014 mshuler@debian:~$ mshuler@debian:~$ ps axu|grep [j]ava mshuler@debian:~$ mshuler@debian:~$ cat /tmp/cassandra.* mshuler@debian:~$ mshuler@debian:~$ grep cassandra /var/log/supervisor/supervisord.log 2014-02-14 19:54:11,521 WARN Included extra file
Re: supervisord and cassandra
On 02/14/2014 08:10 PM, Michael Shuler wrote: Attached a console log and the conf I used. Here's what I did: - installed c* 2.0.5 with /var/{lib,log}/cassandra owned by my user, as usual - verified c* runs fine from the command line - killed c* - installed supervisor package and added the attached conf - stopped/started supervisord to pick up the new conf - c* is running fine and nodetool confirms - supervisorctl status shows ignorance of c* running (wrong config, I assume) I missed a couple steps in here: - killed c* and supervisor never restarted it - restarted supervisor service, which starts up c* fine (not in my console paste) - stopped supervisord, c* still running (not sure if this is normal..) (not in my console paste) Anyway, let us know how this works out! -- Michael
Re: supervisord and cassandra
On 02/14/2014 08:10 PM, Michael Shuler wrote: mshuler@debian:~$ sudo supervisorctl status cassandra_server:cassandra FATAL Exited too quickly (process log may have details) I imagine the problems all stem from the fact that the initializing script, in my case, /opt/cassandra/bin/cassandra, is executed and it's done starting c* (Exited too quickly), and the process that actually needs to be supervised is the java process (thus the ignorance that it is running, and the fact that killing it is not recognized). -- Michael
Re: supervisord and cassandra
On 02/14/2014 08:27 PM, Michael Shuler wrote: On 02/14/2014 08:10 PM, Michael Shuler wrote: mshuler@debian:~$ sudo supervisorctl status cassandra_server:cassandra FATAL Exited too quickly (process log may have details) I imagine the problems all stem from the fact that the initializing script, in my case, /opt/cassandra/bin/cassandra, is executed and it's done starting c* (Exited too quickly), and the process that actually needs to be supervised is the java process (thus the ignorance that it is running, and the fact that killing it is not recognized). Yup. https://lists.supervisord.org/pipermail/supervisor-users/2012-December/001207.html -- Michael
Re: supervisord and cassandra
On 02/14/2014 08:32 PM, Michael Shuler wrote: On 02/14/2014 08:27 PM, Michael Shuler wrote: On 02/14/2014 08:10 PM, Michael Shuler wrote: mshuler@debian:~$ sudo supervisorctl status cassandra_server:cassandra FATAL Exited too quickly (process log may have details) I imagine the problems all stem from the fact that the initializing script, in my case, /opt/cassandra/bin/cassandra, is executed and it's done starting c* (Exited too quickly), and the process that actually needs to be supervised is the java process (thus the ignorance that it is running, and the fact that killing it is not recognized). Yup. https://lists.supervisord.org/pipermail/supervisor-users/2012-December/001207.html (Self reply again..) with cassandra -f, which is not being backgrounded in the exec line of the script, my conf works: mshuler@debian:~$ sudo supervisorctl status cassandra_server:cassandra RUNNINGpid 2784, uptime 0:00:23 mshuler@debian:~$ pkill java mshuler@debian:~$ ps axu|grep java mshuler 2988 0.0 0.0 7828 876 pts/0S+ 20:45 0:00 grep java mshuler@debian:~$ ps axu|grep java mshuler 2989 18.1 16.4 1056412 168492 ? Sl 20:45 0:04 java -ea -javaagent:/opt/cassandra/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities ... and in the supervisor log: 2014-02-14 20:42:23,067 INFO daemonizing the supervisord process 2014-02-14 20:42:23,067 INFO supervisord started with pid 2777 2014-02-14 20:42:24,072 INFO spawned: 'cassandra' with pid 2784 2014-02-14 20:42:39,302 INFO success: cassandra entered RUNNING state, process has stayed up for than 15 seconds (startsecs) 2014-02-14 20:45:57,131 INFO exited: cassandra (exit status 143; not expected) 2014-02-14 20:45:58,134 INFO spawned: 'cassandra' with pid 2989 2014-02-14 20:46:13,241 INFO success: cassandra entered RUNNING state, process has stayed up for than 15 seconds (startsecs) -- Michael
Re: supervisord and cassandra
So.. see the rest of my replies for a working configuration, but I wanted to reply to your initial post. What problem are you trying to solve, and why do you think using supervisord to restart a failed c* node will help? You really don't want a node to be bouncing up and down.. A dead or dieing node should stay down, until you can troubleshoot *why* it is dead or dieing and determine if it should be replaced by a new node, or has a repairable issue that will allow you to rejoin it to the ring. -- Kind regards, Michael
Re: supervisord and cassandra
I had to give up on supervisor. I installed the deb package rather than from source. that worked though. thanks On Sat, Feb 15, 2014 at 10:10 AM, Michael Shuler mich...@pbandjelly.orgwrote: On 02/14/2014 07:34 PM, Michael Shuler wrote: On 02/14/2014 06:58 PM, David Montgomery wrote: Hi, Using now oracle 7. commented out the line StringTableSize=103 same issue. but nothing in the log file now. but I start from, the command line the works. What user are you running c* with, when running from the command line? What user is running c* via supervisord? So you peaked my interest and tried supervisord in a vm. I think you need to probably go hit up the supervisord community for some how do I do this correctly questions. Attached a console log and the conf I used. Here's what I did: - installed c* 2.0.5 with /var/{lib,log}/cassandra owned by my user, as usual - verified c* runs fine from the command line - killed c* - installed supervisor package and added the attached conf - stopped/started supervisord to pick up the new conf - c* is running fine and nodetool confirms - supervisorctl status shows ignorance of c* running (wrong config, I assume) - stopped supervisord, c* still running (not sure if this is normal..) I have never played with supervisord. It's interesting, but my guess is there is some additional magic needed by supervisor experts to help you with a properly behaving configuration. Good luck and do report back with a good config for the archives! -- Kind regards, Michael