Re: I have a deaf node?

2014-06-01 Thread David Daeschler
I wouldnt worry unless it changes from deaf to deadbeef


On Sun, Jun 1, 2014 at 11:34 PM, Tim Dunphy bluethu...@gmail.com wrote:

 This post should definitely make to the hall of fame!! :)


 My proudest accomplishment on the list. heh


 On Sun, Jun 1, 2014 at 11:24 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 This post should definitely make to the hall of fame!! :)


 On Mon, Jun 2, 2014 at 12:05 AM, Tim Dunphy bluethu...@gmail.com wrote:

 That made my day. Not to worry thought unless you  start seeing the
 number 23 in your host ids.


 Yeah man, glad to provide some comic relief to the list! ;)


 On Sun, Jun 1, 2014 at 11:01 PM, Apostolis Xekoukoulotakis 
 xekou...@gmail.com wrote:

 That made my day. Not to worry thought unless you  start seeing the
 number 23 in your host ids.
  On Jun 2, 2014 12:40 AM, Kevin Burton bur...@spinn3r.com wrote:

 could be worse… it could be under caffeinated and say decafbad …


 On Sat, May 31, 2014 at 10:45 AM, Tim Dunphy bluethu...@gmail.com
 wrote:

 I think the deaf thing is just the ending of the host ID in
 hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D


 Hah.. yeah that thought did cross my mind.  :)



 On Sat, May 31, 2014 at 1:35 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 I think the deaf thing is just the ending of the host ID in
 hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D


 On Sat, May 31, 2014 at 6:38 PM, Tim Dunphy bluethu...@gmail.com
 wrote:

 I didn't realize cassandra nodes could develop hearing problems. :)


 But I have a dead node in my cluster I would like to get rid of.

 [root@beta:~] #nodetool status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address Load   Tokens  Owns   Host ID
 Rack
 UN  10.10.1.94  199.6 KB   256 49.4%
  fd2f76ae-8dcf-4e93-a37f-bf1e9088696e  rack1
 DN  10.10.1.64  ?  256 50.6%
  f2a48fc7-a362-43f5-9061-4bb3739f*deaf * rack1

 I was just wondering what this could indicate and if that might
 mean that I will have some more trouble than I would be bargaining for 
 in
 getting rid of it.

 I've made a couple of attempts to get rid of this so far. I'm about
 to try again.

 Thanks
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B





 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.




 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




-- 
David Daeschler


Strange trouble with JNA on CentOS6

2014-05-26 Thread David Daeschler
Good evening list,

I'm having a rather confusing issue with 4 new nodes that I put up. JNA
doesn't seem to want to work in production even though it works on my test
VM that should have pretty much the same setup.

I have created a series of setup scripts to bring up new nodes. I am using
a tested and tried configuration that has been running very well in
production for a while and in all cases the following holds true:

OS: CentOS6 with latest updates 2.6.32-431.17.1.el6.x86_64
Java: Java(TM) SE Runtime Environment (build 1.6.0_30-b12)
Cassandra: 1.0.12
JNA: Tried both jna-3.5.1.jar, and jna-4.1.0.jar

In both my old production servers and my test setup (which I can recreate
over and over and get the same results), startup yields:

CLibrary.java (line 109) JNA mlockall successful

However, for some reason on the NEW production boxes, I get:

Unable to link C library. Native methods will be disabled


The JNA libs are exact copies and part of our setup scripts.

There are only 2 major differences I can think of. The first is the
hardware. All are 64 bit platforms, but the newer servers has 48 GB of ram
and newer processors than the older boxes and my test vm. I don't think
that would have anything to do with it, but it's worth mentioning. Both
production sets are Xeons though.

The second difference is that the new image would've been set up by
Rackspace while my test uses an older image and brings it up to date with
yum. However, as part of setup I apply yum update to all of them, so they
should be running at least the same versions of whatever they have in
common.

Another interesting note is that while looking for jna on the disk, I
also see what looks like an ELF binary in /tmp named
/tmp/jna-1073564104/jna4691935553862497129.tmp.


Thank you ahead of time for any help you may be able to offer.

-- 
David Daeschler


Re: Strange trouble with JNA on CentOS6

2014-05-26 Thread David Daeschler
Figured this one out..

In the new setup /tmp was mounted as noexec. So it looks like JNA was
putting the native library here and then it was unable to execute it.

I hope this can help someone else.



On Mon, May 26, 2014 at 11:33 PM, David Daeschler david.daesch...@gmail.com
 wrote:

 Good evening list,

 I'm having a rather confusing issue with 4 new nodes that I put up. JNA
 doesn't seem to want to work in production even though it works on my test
 VM that should have pretty much the same setup.

 I have created a series of setup scripts to bring up new nodes. I am using
 a tested and tried configuration that has been running very well in
 production for a while and in all cases the following holds true:

 OS: CentOS6 with latest updates 2.6.32-431.17.1.el6.x86_64
 Java: Java(TM) SE Runtime Environment (build 1.6.0_30-b12)
 Cassandra: 1.0.12
 JNA: Tried both jna-3.5.1.jar, and jna-4.1.0.jar

 In both my old production servers and my test setup (which I can recreate
 over and over and get the same results), startup yields:

 CLibrary.java (line 109) JNA mlockall successful

 However, for some reason on the NEW production boxes, I get:

 Unable to link C library. Native methods will be disabled


 The JNA libs are exact copies and part of our setup scripts.

 There are only 2 major differences I can think of. The first is the
 hardware. All are 64 bit platforms, but the newer servers has 48 GB of ram
 and newer processors than the older boxes and my test vm. I don't think
 that would have anything to do with it, but it's worth mentioning. Both
 production sets are Xeons though.

 The second difference is that the new image would've been set up by
 Rackspace while my test uses an older image and brings it up to date with
 yum. However, as part of setup I apply yum update to all of them, so they
 should be running at least the same versions of whatever they have in
 common.

 Another interesting note is that while looking for jna on the disk, I
 also see what looks like an ELF binary in /tmp named
 /tmp/jna-1073564104/jna4691935553862497129.tmp.


 Thank you ahead of time for any help you may be able to offer.

 --
 David Daeschler




-- 
David Daeschler


Re: What % of cassandra developers are employed by Datastax?

2014-05-23 Thread David Daeschler
Datastax have also gone far out of their way to support companies using
Cassandra regardless of if it happens to be the DSE or not. We are not part
of any paid agreement with the company and I had a Datastax employee
sending me texts late on a weekend to help me through an issue I was having
with my cluster.

Actions like this demonstrate to me a commitment by Datastax to provide
support and warm fuzzies to the entire community. Having them around has
eased many of my concerns for running Cassandra on a limited small business
budget.


On Fri, May 23, 2014 at 1:42 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, May 16, 2014 at 10:56 AM, Kevin Burton bur...@spinn3r.com wrote:

 Perhaps because the developers are working on DSE :-P


 FWIW, and I am not necessarily known for being the biggest defender of
 Datastax and the relationship of their commercial interests to the
 architectural direction of Cassandra...

 ... Datastax contribute a huge amount of work back to Apache Cassandra by
 the simple method operating on Cassandra issues raised by DSE customers
 inside of the Apache JIRA. If they are not out there selling DSE to these
 (often quite large) customers, there is not necessarily a commercial driver
 for fixing these Cassandra Issues.

 The DSE features are mostly isolated enough from the main codebase, and
 from my understanding the teams are separated enough internally, that I'm
 pretty confident Datastax is a significant net positive for contribution to
 and promotion of Apache Cassandra.

 =Rob




-- 
David Daeschler


Re: Hinted Handoff runs every ten minutes

2012-10-18 Thread David Daeschler
Hi Steve,

Also confirming this. After having a node go down on Cassandra 1.0.8
there seems to be hinted handoff between two of our 4 nodes every 10
minutes. Our setup also shows 0 rows. It does not appear to have any
effect on the operation of the ring, just fills up the log files.

- David



On Thu, Oct 18, 2012 at 2:10 PM, Stephen Pierce spie...@verifyle.com wrote:
 I installed Cassandra on three nodes. I then ran a test suite against them
 to generate load. The test suite is designed to generate the same type of
 load that we plan to have in production. As one of many tests, I reset one
 of the nodes to check the failure/recovery modes.  Cassandra worked just
 fine.



 I stopped the load generation, and got distracted with some other
 project/problem. A few days later, I noticed something strange on one of the
 nodes. On this node hinted handoff starts every ten minutes, and while it
 seems to finish without any errors, it will be started again in ten minutes.
 None of the nodes has any traffic, and hasn’t for several days. I checked
 the logs, and this goes back to the initial failure/recovery testing:



 INFO [HintedHandoff:1] 2012-10-18 10:19:26,618 HintedHandOffManager.java
 (line 294) Started hinted handoff for token:
 113427455640312821154458202477256070484 with IP: /192.168.128.136

 INFO [HintedHandoff:1] 2012-10-18 10:19:26,779 HintedHandOffManager.java
 (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136

 INFO [HintedHandoff:1] 2012-10-18 10:29:26,622 HintedHandOffManager.java
 (line 294) Started hinted handoff for token:
 113427455640312821154458202477256070484 with IP: /192.168.128.136

 INFO [HintedHandoff:1] 2012-10-18 10:29:26,735 HintedHandOffManager.java
 (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136

 INFO [HintedHandoff:1] 2012-10-18 10:39:26,624 HintedHandOffManager.java
 (line 294) Started hinted handoff for token:
 113427455640312821154458202477256070484 with IP: /192.168.128.136

 INFO [HintedHandoff:1] 2012-10-18 10:39:26,751 HintedHandOffManager.java
 (line 390) Finished hinted handoff of 0 rows to endpoint /192.168.128.136



 The other nodes are happy and don’t show this behavior. All the test data is
 readable, and everything is fine, but I’m curious why hinted handoff is
 running on one node all the time.



 I searched the bug database, and I found a bug that seems to have the same
 symptoms:

 https://issues.apache.org/jira/browse/CASSANDRA-3733

 Although it’s been marked fixed in 0.6, this describes my problem exactly.



 I’m running Cassandra 1.1.5 from Datastax on Centos 6.0:

 http://rpm.datastax.com/community/noarch/apache-cassandra11-1.1.5-1.noarch.rpm



 Is anyone else seeing this behavior? What can I do to provide more
 information?



 Steve





-- 
David Daeschler


Nodetool repair, exit code/status?

2012-10-08 Thread David Daeschler
Hello.

In the process of trying to streamline and provide better reporting
for various data storage systems, I've realized that although we're
verifying that nodetool repair runs, we're not verifying that it is
successful.

I found a bug relating to the exit code for nodetool repair, where, in
some situations, there is no way to verify the repair has completed
successfully: https://issues.apache.org/jira/browse/CASSANDRA-2666

Is this still a problem? What is the best way to monitor the final
status of the repair command to make sure all is well?


Thank you ahead of time for any info.
- David


Re: High CPU usage as of 8pm eastern time

2012-07-01 Thread David Daeschler
More information for others that were affected.

Our installation of java:

[root@inv4 conf]# java -version
java version 1.6.0_30
Java(TM) SE Runtime Environment (build 1.6.0_30-b12)
Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)

[root@inv4 conf]# uname -a
Linux inv4 2.6.32-220.4.2.el6.x86_64 #1 SMP Tue Feb 14 04:00:16 GMT
2012 x86_64 x86_64 x86_64 GNU/Linux

Jonathan pointed out a Linux bug that may be related:
https://issues.apache.org/jira/browse/CASSANDRA-4066

In my case only the Java process went nuts, as seems to be the case in
many other reports:
https://bugzilla.mozilla.org/show_bug.cgi?id=769972
http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/

I hope everyone got enough sleep!
- David


On Sun, Jul 1, 2012 at 4:49 AM, Hontvári József Levente
hontv...@flyordie.com wrote:
 Thank you for the mail. Same here, but I restarted the affected server
 before I noticed your mail.

 It affected both OpenJDK Java 6  (packaged with Ubuntu 10.04) and Oracle
 Java 7 processes. Ubuntu 32 bit servers had no issues, only a 64 bit
 machine.

 Likely it is related to the leap second introduced today.


 On 2012.07.01. 5:11, Mina Naguib wrote:

 Hi folks

 Our cassandra (and other java-based apps) started experiencing extremely
 high CPU usage as of 8pm eastern time (midnight UTC).

 The issue appears to be related to specific versions of java + linux +
 ntpd

 There are many solutions floating around on IRC, twitter, stackexchange,
 LKML.

 The simplest one that worked for us is simply to run this command on each
 affected machine:

 date; date `date +%m%d%H%M%C%y.%S`; date;

 CPU drop was instantaneous - there was no need to restart the server,
 ntpd, or any of the affected JVMs.








Re: cassandra halt after started minutes later

2012-07-01 Thread David Daeschler
This looks like the problem a bunch of us were having yesterday that
isn't cleared without a reboot or a date command. It seems to be
related to the leap second that was added between the 30th June and
the 1st of July.

See the mailing list thread with subject High CPU usage as of 8pm eastern time

If you are seeing high CPU usage and a stall after restarting
cassandra still, and you are on Linux, try:

date; date `date +%m%d%H%M%C%y.%S`; date;

In a terminal and see if everything starts working again.

I hope this helps.
-- 
David Daeschler



On Sun, Jul 1, 2012 at 11:33 AM, Yan Chunlu springri...@gmail.com wrote:
 adjust the timezone of java by  -Duser.timezone   and the timezone of
 cassandra is the same with system(Debian 6.0).

 after restart cassandra I found the following error message in the log file
 of node B. after about 2 minutes later, node C stop responding

 the error log of node B:

 Thrift transport error occurred during processing of message.
 org.apache.thrift.transport.TTransportException
 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
 at
 org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



 the log info in node C:


 DEBUG [MutationStage:25] 2012-07-01 23:29:42,909 RowMutationVerbHandler.java
 (line 60) RowMutation(keyspace='spark',
 key='3937343836623538363837363135353264313339333463343532623634373131656462306139',
 modifications=[ColumnFamily(permacache
 [76616c7565:false:67906@1341156582948365,])]) applied.  Sending response to
 79529@/192.168.1.129
 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 CassandraServer.java (line
 523) insert
 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line
 172) Mutations/ConsistencyLevel are [RowMutation(keyspace='spark',
 key='636f6d6d656e74735f706172656e74735f32373232343938',
 modifications=[ColumnFamily(permacache
 [76616c7565:false:6@1341156582953843,])])]/QUORUM
 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line
 301) insert writing key 636f6d6d656e74735f706172656e74735f32373232343938 to
 /192.168.1.40
 DEBUG [pool-2-thread-209] 2012-07-01 23:29:42,913 StorageProxy.java (line
 301) insert writing key 636f6d6d656e74735f706172656e74735f32373232343938 to
 /192.168.1.129
 DEBUG [Thread-8] 2012-07-01 23:29:42,913 IncomingTcpConnection.java (line
 116) Version is now 3
 DEBUG [RequestResponseStage:27] 2012-07-01 23:29:42,913
 ResponseVerbHandler.java (line 44) Processing response on a callback from
 50050@/192.168.1.129
 DEBUG [Thread-12] 2012-07-01 23:29:42,914 IncomingTcpConnection.java (line
 116) Version is now 3
 DEBUG [RequestResponseStage:29] 2012-07-01 23:29:42,914
 ResponseVerbHandler.java (line 44) Processing response on a callback from
 50051@/192.168.1.40
 DEBUG [Thread-11] 2012-07-01 23:29:42,939 IncomingTcpConnection.java (line
 116) Version is now 3



 On Sun, Jul 1, 2012 at 11:14 PM, Yan Chunlu springri...@gmail.com wrote:

 I have a three node cluster running 1.0.2, today there's a very strange
 problem that suddenly two of cassandra  node(let's say B and C) was costing
 a lot of cpu, turned out for some reason the java binary just dont run
 I am using OpenJDK1.6.0_18, so I switched to sun jdk, which works okay.

 after that node A stop working... same problem, I install sun jdk, then
 it's okay. but minutes later, B stop working again, about 5-10 minutes later
 after the cassandra started, it stop responding connections, I can't access
 9160 and nodetool dont return either.

 I have turned on DEBUG and dont see much useful information, the last rows
 on node B are as belows:
 DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java
 (line 65) resolving 2 responses
 DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java
 (line 106) digests verified
 DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,830 RowDigestResolver.java
 (line 110) resolve: 0 ms.
 DEBUG [pool-2-thread-72] 2012-07-01 07:45:42,831 StorageProxy.java (line
 694) Read: 5 ms.
 DEBUG [Thread-8] 2012-07-01 07:45:42,831

Re: Oftopic: ksoftirqd after ddos take more cpu? as result cassandra latensy very high

2012-07-01 Thread David Daeschler
Good afternoon,

This again looks like it could be the leap second issue:

This looks like the problem a bunch of us were having yesterday that
isn't cleared without a reboot or a date command. It seems to be
related to the leap second that was added between the 30th June and
the 1st of July.

See the mailing list thread with subject High CPU usage as of 8pm eastern time

If you are seeing high CPU usage and a stall after restarting
cassandra still, and you are on Linux, try:

date; date `date +%m%d%H%M%C%y.%S`; date;

In a terminal and see if everything starts working again.

I hope this helps. Please spread the word if you see others having
issues with unresponsive kernels/high CPU.

-- 
David Daeschler



On Sun, Jul 1, 2012 at 1:05 PM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Hello

 We was under ddos attack, and as result we got high ksoftirqd activity
 - as result cassandra begin answer very slow. But when ddos was gone
 high ksoftirqd activity still exists, and dissaper when i stop
 cassandra daemon, and repeat again when i start cassadra daemon, the
 fully resolution of problem is full reboot of server. What this can be
 (why ksoftirqd begin work very intensive when cassandra runing - we
 disable all working traffic to cluster but this doesn't help so this
 is can't be due heavy load )? And how to solve this?

 PS:
  OS ubuntu 10.0.4 (2.6.32.41)
  cassandra 1.0.10
  java 1.6.32 (from oracle)


Re: High CPU usage as of 8pm eastern time

2012-06-30 Thread David Daeschler
YES!

This happened to me as well. RUNNING THAT COMMAND FIXED THE PROBLEM!

THANK YOU SO MUCH



On Sat, Jun 30, 2012 at 11:11 PM, Mina Naguib
mina.nag...@bloomdigital.com wrote:

 Hi folks

 Our cassandra (and other java-based apps) started experiencing extremely high 
 CPU usage as of 8pm eastern time (midnight UTC).

 The issue appears to be related to specific versions of java + linux + ntpd

 There are many solutions floating around on IRC, twitter, stackexchange, LKML.

 The simplest one that worked for us is simply to run this command on each 
 affected machine:

 date; date `date +%m%d%H%M%C%y.%S`; date;

 CPU drop was instantaneous - there was no need to restart the server, ntpd, 
 or any of the affected JVMs.






-- 
David Daeschler


Re: nodetool repair -pr enough in this scenario?

2012-06-05 Thread David Daeschler
Thank you for all the replies. It has been enlightening to read. I think I
now have a better idea of repair, ranges, replicas and how the data is
distributed. It also seems that using -pr would be the best way to go in my
scenario with 1.x+


Thank you for all the feedback. Glad to see such an active community around
Cassandra.
- David


nodetool repair -pr enough in this scenario?

2012-06-04 Thread David Daeschler
Hello,

Currently I have a 4 node cassandra cluster on CentOS64. I have been
running nodetool repair (no -pr option) on a weekly schedule like:

Host1: Tue, Host2: Wed, Host3: Thu, Host4: Fri

In this scenario, if I were to add the -pr option, would this still be
sufficient to prevent forgotten deletes and properly maintain consistency?

Thank you,
- David