Re: I have a deaf node?
I wouldnt worry unless it changes from deaf to deadbeef On Sun, Jun 1, 2014 at 11:34 PM, Tim Dunphy bluethu...@gmail.com wrote: This post should definitely make to the hall of fame!! :) My proudest accomplishment on the list. heh On Sun, Jun 1, 2014 at 11:24 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: This post should definitely make to the hall of fame!! :) On Mon, Jun 2, 2014 at 12:05 AM, Tim Dunphy bluethu...@gmail.com wrote: That made my day. Not to worry thought unless you start seeing the number 23 in your host ids. Yeah man, glad to provide some comic relief to the list! ;) On Sun, Jun 1, 2014 at 11:01 PM, Apostolis Xekoukoulotakis xekou...@gmail.com wrote: That made my day. Not to worry thought unless you start seeing the number 23 in your host ids. On Jun 2, 2014 12:40 AM, Kevin Burton bur...@spinn3r.com wrote: could be worse… it could be under caffeinated and say decafbad … On Sat, May 31, 2014 at 10:45 AM, Tim Dunphy bluethu...@gmail.com wrote: I think the deaf thing is just the ending of the host ID in hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D Hah.. yeah that thought did cross my mind. :) On Sat, May 31, 2014 at 1:35 PM, DuyHai Doan doanduy...@gmail.com wrote: I think the deaf thing is just the ending of the host ID in hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D On Sat, May 31, 2014 at 6:38 PM, Tim Dunphy bluethu...@gmail.com wrote: I didn't realize cassandra nodes could develop hearing problems. :) But I have a dead node in my cluster I would like to get rid of. [root@beta:~] #nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.10.1.94 199.6 KB 256 49.4% fd2f76ae-8dcf-4e93-a37f-bf1e9088696e rack1 DN 10.10.1.64 ? 256 50.6% f2a48fc7-a362-43f5-9061-4bb3739f*deaf * rack1 I was just wondering what this could indicate and if that might mean that I will have some more trouble than I would be bargaining for in getting rid of it. I've made a couple of attempts to get rid of this so far. I'm about to try again. Thanks Tim -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200 -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- David Daeschler
Strange trouble with JNA on CentOS6
Good evening list, I'm having a rather confusing issue with 4 new nodes that I put up. JNA doesn't seem to want to work in production even though it works on my test VM that should have pretty much the same setup. I have created a series of setup scripts to bring up new nodes. I am using a tested and tried configuration that has been running very well in production for a while and in all cases the following holds true: OS: CentOS6 with latest updates 2.6.32-431.17.1.el6.x86_64 Java: Java(TM) SE Runtime Environment (build 1.6.0_30-b12) Cassandra: 1.0.12 JNA: Tried both jna-3.5.1.jar, and jna-4.1.0.jar In both my old production servers and my test setup (which I can recreate over and over and get the same results), startup yields: CLibrary.java (line 109) JNA mlockall successful However, for some reason on the NEW production boxes, I get: Unable to link C library. Native methods will be disabled The JNA libs are exact copies and part of our setup scripts. There are only 2 major differences I can think of. The first is the hardware. All are 64 bit platforms, but the newer servers has 48 GB of ram and newer processors than the older boxes and my test vm. I don't think that would have anything to do with it, but it's worth mentioning. Both production sets are Xeons though. The second difference is that the new image would've been set up by Rackspace while my test uses an older image and brings it up to date with yum. However, as part of setup I apply yum update to all of them, so they should be running at least the same versions of whatever they have in common. Another interesting note is that while looking for jna on the disk, I also see what looks like an ELF binary in /tmp named /tmp/jna-1073564104/jna4691935553862497129.tmp. Thank you ahead of time for any help you may be able to offer. -- David Daeschler
Re: Strange trouble with JNA on CentOS6
Figured this one out.. In the new setup /tmp was mounted as noexec. So it looks like JNA was putting the native library here and then it was unable to execute it. I hope this can help someone else. On Mon, May 26, 2014 at 11:33 PM, David Daeschler david.daesch...@gmail.com wrote: Good evening list, I'm having a rather confusing issue with 4 new nodes that I put up. JNA doesn't seem to want to work in production even though it works on my test VM that should have pretty much the same setup. I have created a series of setup scripts to bring up new nodes. I am using a tested and tried configuration that has been running very well in production for a while and in all cases the following holds true: OS: CentOS6 with latest updates 2.6.32-431.17.1.el6.x86_64 Java: Java(TM) SE Runtime Environment (build 1.6.0_30-b12) Cassandra: 1.0.12 JNA: Tried both jna-3.5.1.jar, and jna-4.1.0.jar In both my old production servers and my test setup (which I can recreate over and over and get the same results), startup yields: CLibrary.java (line 109) JNA mlockall successful However, for some reason on the NEW production boxes, I get: Unable to link C library. Native methods will be disabled The JNA libs are exact copies and part of our setup scripts. There are only 2 major differences I can think of. The first is the hardware. All are 64 bit platforms, but the newer servers has 48 GB of ram and newer processors than the older boxes and my test vm. I don't think that would have anything to do with it, but it's worth mentioning. Both production sets are Xeons though. The second difference is that the new image would've been set up by Rackspace while my test uses an older image and brings it up to date with yum. However, as part of setup I apply yum update to all of them, so they should be running at least the same versions of whatever they have in common. Another interesting note is that while looking for jna on the disk, I also see what looks like an ELF binary in /tmp named /tmp/jna-1073564104/jna4691935553862497129.tmp. Thank you ahead of time for any help you may be able to offer. -- David Daeschler -- David Daeschler
Re: What % of cassandra developers are employed by Datastax?
Datastax have also gone far out of their way to support companies using Cassandra regardless of if it happens to be the DSE or not. We are not part of any paid agreement with the company and I had a Datastax employee sending me texts late on a weekend to help me through an issue I was having with my cluster. Actions like this demonstrate to me a commitment by Datastax to provide support and warm fuzzies to the entire community. Having them around has eased many of my concerns for running Cassandra on a limited small business budget. On Fri, May 23, 2014 at 1:42 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, May 16, 2014 at 10:56 AM, Kevin Burton bur...@spinn3r.com wrote: Perhaps because the developers are working on DSE :-P FWIW, and I am not necessarily known for being the biggest defender of Datastax and the relationship of their commercial interests to the architectural direction of Cassandra... ... Datastax contribute a huge amount of work back to Apache Cassandra by the simple method operating on Cassandra issues raised by DSE customers inside of the Apache JIRA. If they are not out there selling DSE to these (often quite large) customers, there is not necessarily a commercial driver for fixing these Cassandra Issues. The DSE features are mostly isolated enough from the main codebase, and from my understanding the teams are separated enough internally, that I'm pretty confident Datastax is a significant net positive for contribution to and promotion of Apache Cassandra. =Rob -- David Daeschler
Re: Hinted Handoff runs every ten minutes
Hi Steve, Also confirming this. After having a node go down on Cassandra 1.0.8 there seems to be hinted handoff between two of our 4 nodes every 10 minutes. Our setup also shows 0 rows. It does not appear to have any effect on the operation of the ring, just fills up the log files. - David On Thu, Oct 18, 2012 at 2:10 PM, Stephen Pierce spie...@verifyle.com wrote: I installed Cassandra on three nodes. I then ran a test suite against them to generate load. The test suite is designed to generate the same type of load that we plan to have in production. As one of many tests, I reset one of the nodes to check the failure/recovery modes. Cassandra worked just fine. I stopped the load generation, and got distracted with some other project/problem. A few days later, I noticed something strange on one of the nodes. On this node hinted handoff starts every ten minutes, and while it seems to finish without any errors, it will be started again in ten minutes. None of the nodes has any traffic, and hasn’t for several days. I checked the logs, and this goes back to the initial failure/recovery testing: INFO [HintedHandoff:1] 2012-10-18 10:19:26,618 HintedHandOffManager.java (line 294) Started hinted handoff for token: 113427455640312821154458202477256070484 with IP: /192.168.128.136 INFO [HintedHandoff:1] 2012-10-18 10:19:26,779 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136 INFO [HintedHandoff:1] 2012-10-18 10:29:26,622 HintedHandOffManager.java (line 294) Started hinted handoff for token: 113427455640312821154458202477256070484 with IP: /192.168.128.136 INFO [HintedHandoff:1] 2012-10-18 10:29:26,735 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136 INFO [HintedHandoff:1] 2012-10-18 10:39:26,624 HintedHandOffManager.java (line 294) Started hinted handoff for token: 113427455640312821154458202477256070484 with IP: /192.168.128.136 INFO [HintedHandoff:1] 2012-10-18 10:39:26,751 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136 The other nodes are happy and don’t show this behavior. All the test data is readable, and everything is fine, but I’m curious why hinted handoff is running on one node all the time. I searched the bug database, and I found a bug that seems to have the same symptoms: https://issues.apache.org/jira/browse/CASSANDRA-3733 Although it’s been marked fixed in 0.6, this describes my problem exactly. I’m running Cassandra 1.1.5 from Datastax on Centos 6.0: http://rpm.datastax.com/community/noarch/apache-cassandra11-1.1.5-1.noarch.rpm Is anyone else seeing this behavior? What can I do to provide more information? Steve -- David Daeschler
Nodetool repair, exit code/status?
Hello. In the process of trying to streamline and provide better reporting for various data storage systems, I've realized that although we're verifying that nodetool repair runs, we're not verifying that it is successful. I found a bug relating to the exit code for nodetool repair, where, in some situations, there is no way to verify the repair has completed successfully: https://issues.apache.org/jira/browse/CASSANDRA-2666 Is this still a problem? What is the best way to monitor the final status of the repair command to make sure all is well? Thank you ahead of time for any info. - David
Re: High CPU usage as of 8pm eastern time
More information for others that were affected. Our installation of java: [root@inv4 conf]# java -version java version 1.6.0_30 Java(TM) SE Runtime Environment (build 1.6.0_30-b12) Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode) [root@inv4 conf]# uname -a Linux inv4 2.6.32-220.4.2.el6.x86_64 #1 SMP Tue Feb 14 04:00:16 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Jonathan pointed out a Linux bug that may be related: https://issues.apache.org/jira/browse/CASSANDRA-4066 In my case only the Java process went nuts, as seems to be the case in many other reports: https://bugzilla.mozilla.org/show_bug.cgi?id=769972 http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/ I hope everyone got enough sleep! - David On Sun, Jul 1, 2012 at 4:49 AM, Hontvári József Levente hontv...@flyordie.com wrote: Thank you for the mail. Same here, but I restarted the affected server before I noticed your mail. It affected both OpenJDK Java 6 (packaged with Ubuntu 10.04) and Oracle Java 7 processes. Ubuntu 32 bit servers had no issues, only a 64 bit machine. Likely it is related to the leap second introduced today. On 2012.07.01. 5:11, Mina Naguib wrote: Hi folks Our cassandra (and other java-based apps) started experiencing extremely high CPU usage as of 8pm eastern time (midnight UTC). The issue appears to be related to specific versions of java + linux + ntpd There are many solutions floating around on IRC, twitter, stackexchange, LKML. The simplest one that worked for us is simply to run this command on each affected machine: date; date `date +%m%d%H%M%C%y.%S`; date; CPU drop was instantaneous - there was no need to restart the server, ntpd, or any of the affected JVMs.
Re: cassandra halt after started minutes later
This looks like the problem a bunch of us were having yesterday that isn't cleared without a reboot or a date command. It seems to be related to the leap second that was added between the 30th June and the 1st of July. See the mailing list thread with subject High CPU usage as of 8pm eastern time If you are seeing high CPU usage and a stall after restarting cassandra still, and you are on Linux, try: date; date `date +%m%d%H%M%C%y.%S`; date; In a terminal and see if everything starts working again. I hope this helps. -- David Daeschler On Sun, Jul 1, 2012 at 11:33 AM, Yan Chunlu springri...@gmail.com wrote: adjust the timezone of java by -Duser.timezone and the timezone of cassandra is the same with system(Debian 6.0). after restart cassandra I found the following error message in the log file of node B. after about 2 minutes later, node C stop responding the error log of node B: Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) the log info in node C: DEBUG [MutationStage:25] 2012-07-01 23:29:42,909 RowMutationVerbHandler.java (line 60) RowMutation(keyspace='spark', key='3937343836623538363837363135353264313339333463343532623634373131656462306139', modifications=[ColumnFamily(permacache [76616c7565:false:67906@1341156582948365,])]) applied. Sending response to 79529@/192.168.1.129 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 CassandraServer.java (line 523) insert DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line 172) Mutations/ConsistencyLevel are [RowMutation(keyspace='spark', key='636f6d6d656e74735f706172656e74735f32373232343938', modifications=[ColumnFamily(permacache [76616c7565:false:6@1341156582953843,])])]/QUORUM DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line 301) insert writing key 636f6d6d656e74735f706172656e74735f32373232343938 to /192.168.1.40 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line 301) insert writing key 636f6d6d656e74735f706172656e74735f32373232343938 to /192.168.1.129 DEBUG [Thread-8] 2012-07-01 23:29:42,913 IncomingTcpConnection.java (line 116) Version is now 3 DEBUG [RequestResponseStage:27] 2012-07-01 23:29:42,913 ResponseVerbHandler.java (line 44) Processing response on a callback from 50050@/192.168.1.129 DEBUG [Thread-12] 2012-07-01 23:29:42,914 IncomingTcpConnection.java (line 116) Version is now 3 DEBUG [RequestResponseStage:29] 2012-07-01 23:29:42,914 ResponseVerbHandler.java (line 44) Processing response on a callback from 50051@/192.168.1.40 DEBUG [Thread-11] 2012-07-01 23:29:42,939 IncomingTcpConnection.java (line 116) Version is now 3 On Sun, Jul 1, 2012 at 11:14 PM, Yan Chunlu springri...@gmail.com wrote: I have a three node cluster running 1.0.2, today there's a very strange problem that suddenly two of cassandra node(let's say B and C) was costing a lot of cpu, turned out for some reason the java binary just dont run I am using OpenJDK1.6.0_18, so I switched to sun jdk, which works okay. after that node A stop working... same problem, I install sun jdk, then it's okay. but minutes later, B stop working again, about 5-10 minutes later after the cassandra started, it stop responding connections, I can't access 9160 and nodetool dont return either. I have turned on DEBUG and dont see much useful information, the last rows on node B are as belows: DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java (line 65) resolving 2 responses DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java (line 106) digests verified DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java (line 110) resolve: 0 ms. DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,831 StorageProxy.java (line 694) Read: 5 ms. DEBUG [Thread-8] 2012-07-01 07:45:42,831
Re: Oftopic: ksoftirqd after ddos take more cpu? as result cassandra latensy very high
Good afternoon, This again looks like it could be the leap second issue: This looks like the problem a bunch of us were having yesterday that isn't cleared without a reboot or a date command. It seems to be related to the leap second that was added between the 30th June and the 1st of July. See the mailing list thread with subject High CPU usage as of 8pm eastern time If you are seeing high CPU usage and a stall after restarting cassandra still, and you are on Linux, try: date; date `date +%m%d%H%M%C%y.%S`; date; In a terminal and see if everything starts working again. I hope this helps. Please spread the word if you see others having issues with unresponsive kernels/high CPU. -- David Daeschler On Sun, Jul 1, 2012 at 1:05 PM, ruslan usifov ruslan.usi...@gmail.com wrote: Hello We was under ddos attack, and as result we got high ksoftirqd activity - as result cassandra begin answer very slow. But when ddos was gone high ksoftirqd activity still exists, and dissaper when i stop cassandra daemon, and repeat again when i start cassadra daemon, the fully resolution of problem is full reboot of server. What this can be (why ksoftirqd begin work very intensive when cassandra runing - we disable all working traffic to cluster but this doesn't help so this is can't be due heavy load )? And how to solve this? PS: OS ubuntu 10.0.4 (2.6.32.41) cassandra 1.0.10 java 1.6.32 (from oracle)
Re: High CPU usage as of 8pm eastern time
YES! This happened to me as well. RUNNING THAT COMMAND FIXED THE PROBLEM! THANK YOU SO MUCH On Sat, Jun 30, 2012 at 11:11 PM, Mina Naguib mina.nag...@bloomdigital.com wrote: Hi folks Our cassandra (and other java-based apps) started experiencing extremely high CPU usage as of 8pm eastern time (midnight UTC). The issue appears to be related to specific versions of java + linux + ntpd There are many solutions floating around on IRC, twitter, stackexchange, LKML. The simplest one that worked for us is simply to run this command on each affected machine: date; date `date +%m%d%H%M%C%y.%S`; date; CPU drop was instantaneous - there was no need to restart the server, ntpd, or any of the affected JVMs. -- David Daeschler
Re: nodetool repair -pr enough in this scenario?
Thank you for all the replies. It has been enlightening to read. I think I now have a better idea of repair, ranges, replicas and how the data is distributed. It also seems that using -pr would be the best way to go in my scenario with 1.x+ Thank you for all the feedback. Glad to see such an active community around Cassandra. - David
nodetool repair -pr enough in this scenario?
Hello, Currently I have a 4 node cassandra cluster on CentOS64. I have been running nodetool repair (no -pr option) on a weekly schedule like: Host1: Tue, Host2: Wed, Host3: Thu, Host4: Fri In this scenario, if I were to add the -pr option, would this still be sufficient to prevent forgotten deletes and properly maintain consistency? Thank you, - David