Re: cassandra 1.0.10 : Bootstrapping 7 node cluster to 14 nodes
My guess is that 108 has become a new replica for the streamed data on 103, 104, 107, which is decided by your per-keyspace replica placement strategy. When we bootstrap, we do not simply stream data from 102 to 108. Rather, we calculate all the ranges that 108 is responsible for. So looking from the perspective of data instead of node. On Fri, Nov 2, 2012 at 12:41 AM, Brennan Saeta sa...@coursera.org wrote: The other nodes all have copies of the same data. To optimize performance, all of them stream different parts of the data, even though 102 has all the data that 108 needs. (I think. I'm not an expert.) -Brennan On Thu, Nov 1, 2012 at 9:31 AM, Ramesh Natarajan rames...@gmail.comwrote: I am trying to bootstrap cassandra 1.0.10 cluster of 7 nodes to 14 nodes. My seed nodes are 101, 102, 103 and 104. Here is my initial ring Address DC RackStatus State Load OwnsToken 145835300108973627198589117470757804908 192.168.1.101 datacenter1 rack1 Up Normal 8.16 GB 14.29% 0 192.168.1.102 datacenter1 rack1 Up Normal 8.68 GB 14.29% 24305883351495604533098186245126300818 192.168.1.103 datacenter1 rack1 Up Normal 8.45 GB 14.29% 48611766702991209066196372490252601636 192.168.1.104 datacenter1 rack1 Up Normal 8.16 GB 14.29% 72917650054486813599294558735378902454 192.168.1.105 datacenter1 rack1 Up Normal 8.33 GB 14.29% 97223533405982418132392744980505203272 192.168.1.106 datacenter1 rack1 Up Normal 8.71 GB 14.29% 121529416757478022665490931225631504090 192.168.1.107 datacenter1 rack1 Up Normal 8.41 GB 14.29% 145835300108973627198589117470757804908 I add a new node 108 with the initial_token between 101 and 102. After I start bootstrapping, I see the node is placed in the ring in correct place Address DC RackStatus State Load OwnsToken 145835300108973627198589117470757804908 192.168.1.101 datacenter1 rack1 Up Normal 8.16 GB 14.29% 0 192.168.1.108 datacenter1 rack1 Up Joining 114.61 KB 7.14% 12152941675747802266549093122563150409 192.168.1.102 datacenter1 rack1 Up Normal 8.68 GB 7.14% 24305883351495604533098186245126300818 192.168.1.103 datacenter1 rack1 Up Normal 8.4 GB 14.29% 48611766702991209066196372490252601636 192.168.1.104 datacenter1 rack1 Up Normal 8.15 GB 14.29% 72917650054486813599294558735378902454 192.168.1.105 datacenter1 rack1 Up Normal 8.33 GB 14.29% 97223533405982418132392744980505203272 192.168.1.106 datacenter1 rack1 Up Normal 8.71 GB 14.29% 121529416757478022665490931225631504090 192.168.1.107 datacenter1 rack1 Up Normal 8.41 GB 14.29% 145835300108973627198589117470757804908 What puzzles me is when I look at the netstats I see nodes 107,104 and 103 are streaming data to 108. Can someone explain why this happens? I was under the impression that only node 102 needs to split the tokens and send to 108. Am I missing something? Streaming from: /192.168.1.107 Streaming from: /192.168.1.104 Streaming from: /192.168.1.103 Thanks Ramesh
Re: repair, compaction, and tombstone rows
Hi Sylvain, might I ask why repair cannot simply ignore anything that is older than gc-grace? (like Aaron proposed) I agree that repair should not process any tombstones or anything. But in my mind it sounds reasonable to make repair ignore timed-out data. Because the timestamp is created on the client, there is no reason to repair these, right? We are using TTLs quite heavily and I was noticing that every repair increases the load of all nodes by 1-2 GBs, where each node has about 20-30GB of data. I dont know if this increases with the data-volume. The data is mostly time-series data. I even noticed an increase when running two repairs directly after each other. So even when data was just repaired, there is still data being transferred. I assume this is due some columns timing out within that timeframe and the entire row being repaired. regards, Christian On Thu, Nov 1, 2012 at 9:43 AM, Sylvain Lebresne sylv...@datastax.comwrote: Is this a feature or a bug? Neither really. Repair doesn't do any gcable tombstone collection and it would be really hard to change that (besides, it's not his job). So if you when you run repair there is sstable with tombstone that could be collected but are not yet, then yes, they will be streamed. Now the theory is that compaction will run often enough that gcable tombstone will be collected in a reasonably timely fashion and so you will never have lots of such tombstones in general (making the fact that repair stream them largely irrelevant). That being said, in practice, I don't doubt that there is a few scenario like your own where this still can lead to doing too much useless work. I believe the main problem is that size tiered compaction has a tendency to not compact the largest sstables very often. Meaning that you could have large sstable with mostly gcable tombstone sitting around. In the upcoming Cassandra 1.2, https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that. Until then, if you are no afraid of a little bit of scripting, one option could be before running a repair to run a small script that would check the creation time of your sstable. If an sstable is old enough (for some value of that that depends on what is the TTL you use on all your columns), you may want to force a compaction (using the JMX call forceUserDefinedCompaction()) of that sstable. The goal being to get read of a maximum of outdated tombstones before running the repair (you could also alternatively run a major compaction prior to the repair, but major compactions have a lot of nasty effect so I wouldn't recommend that a priori). -- Sylvain
Re: Multiple counters value after restart
I ran the same cql query against my 3 nodes (after adding the third and repairing each of them): On the new node: cqlsh:mykeyspace select '20121029#myevent' from 'mycf' where key = '887#day'; 20121029#myevent --- 4983 On the 2 others (old nodes): cqlsh:mykeyspace select '20121029#myevent' from 'mycf' where key = '887#day'; 20121029#myevent --- 4254 And the read value atc CL.QUORUM is 4943, which is the good value. How is it possible that QUORUM read 4943 with only 1 node out of 3 answering that count ? How could a new node, get a value that none of other existing node has ? Is there a way to fix the data (isn't repair supposed to do it) ? Alain 2012/11/1 Alain RODRIGUEZ arodr...@gmail.com Can you try it thought, or run a repair ? Repairing didn't help My first thought is to use QUOURM This fix the problem. However, my data is probably still inconsistent, even if I read now always the same value. The point is that I can't handle a crash with CL.QUORUM, I can't even restart a node... I will add a third server. But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... I was asking to understand how you did the upgrade. Ok. On my side I am just concern about the possibility of using counters with CL.ONE and correctly handle a crash or restart without a drain. Alain 2012/11/1 aaron morton aa...@thelastpickle.com What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). My first thought is to use QUOURM. But with only two nodes it's hard to get strong consistency using QUOURM. Can you try it thought, or run a repair ? But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... I was asking to understand how you did the upgrade. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/11/2012, at 11:39 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: What version of cassandra are you using ? 1.1.2 Can you explain this further? I had an unexplained amount of reads (up to 1800 r/s and 90 Mo/s) on one server the other was doing about 200 r/s and 5 Mo/s max. I fixed it by rebooting the server. This server is dedicated to cassandra. I can't tell you more about it 'cause I don't get it... But a simple Cassandra restart wasn't enough. Was something writing to the cluster ? Yes we are having some activity and perform about 600 w/s. Did you drain for the upgrade ? We upgrade a long time ago and to 1.1.2. This warning is about the version 1.1.6. What changes did you make ? In the cassandra.yaml I just change the compaction_throughput_mb_per_sec property to slow down my compaction a bit. I don't think the problem come from here. Are you saying that a particular counter column is giving different values for different reads ? Yes, this is exactly what I was saying. Sorry if something is wrong with my English, it's not my mother tongue. What CL are you using ? I think this can be what causes the issue. I'm writing and reading at CL ONE. I didn't drain before stopping Cassandra and this may have produce a fail in the current counters (those which were being written when I stopped a server). But isn't Cassandra suppose to handle a server crash ? When a server crashes I guess it don't drain before... Thank you for your time Aaron, once again. Alain 2012/10/31 aaron morton aa...@thelastpickle.com What version of cassandra are you using ? I finally restart Cassandra. It didn't solve the problem so I stopped Cassandra again on that node and restart my ec2 server. This solved the issue (1800 r/s to 100 r/s). Can you explain this further? Was something writing to the cluster ? Did you drain for the upgrade ? https://github.com/apache/cassandra/blob/cassandra-1.1/NEWS.txt#L17 Today I changed my cassandra.yml and restart this same server to apply my conf. What changes did you make ? I just noticed that my homepage (which uses a Cassandra counter and refreshes every sec) shows me 4 different values. 2 of them repeatedly (5000 and 4000) and the 2 other some rare times (5500 and 3800) Are you saying that a particular counter column is giving different values for different reads ? What CL are you using ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 3:39 AM, Jason Wee peich...@gmail.com wrote: maybe enable the debug in log4j-server.properties and going through the log to see what actually happen? On Tue, Oct 30, 2012 at 7:31 PM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi, I have an issue with counters,
Re: repair, compaction, and tombstone rows
On Fri, Nov 2, 2012 at 2:46 AM, horschi hors...@gmail.com wrote: might I ask why repair cannot simply ignore anything that is older than gc-grace? (like Aaron proposed) I agree that repair should not process any tombstones or anything. But in my mind it sounds reasonable to make repair ignore timed-out data. Because the timestamp is created on the client, there is no reason to repair these, right? IIRC, tombstone timestamps are written by the server, at compaction time. Therefore if you have RF=X, you have X different timestamps relative to GCGraceSeconds. I believe there was another thread about two weeks ago in which Sylvain detailed the problems with what you are proposing, when someone else asked approximately the same question. I even noticed an increase when running two repairs directly after each other. So even when data was just repaired, there is still data being transferred. I assume this is due some columns timing out within that timeframe and the entire row being repaired. Merkle trees are an optimization, what they trade for this optimization is over-repair. (FWIW, I agree that, if possible, this particular case of over-repair would be nice to eliminate.) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: repair, compaction, and tombstone rows
IIRC, tombstone timestamps are written by the server, at compaction time. Therefore if you have RF=X, you have X different timestamps relative to GCGraceSeconds. I believe there was another thread about two weeks ago in which Sylvain detailed the problems with what you are proposing, when someone else asked approximately the same question. Oh yes, I forgot about the thread. I assume you are talking about: http://grokbase.com/t/cassandra/user/12ab6pbs5n/unnecessary-tombstones-transmission-during-repair-process I think these are multiple issues that correlate with each other: 1) Repair uses the local timestamp of DeletedColumns for Merkle tree calculation. This is what the other thread was about. Alexey claims that this was fixed by some other commit: https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch But honestly, I dont see how this solves it. I understand how Alexeys patch a few messages before would solve it (by overriding the updateDigest method in DeletedColumn) 2) ExpiringColumns should not be used for merkle tree calculation if they are timed out. I checked LazilyCompactedRow and saw that it does not exclude any timed-out columns. It loops over all columns and calls updateDigest on them. Without any condition. Imho ExpiringColumn.updateDigest() should check for its own isMarkedForDelete() first before doing any digest-changes (We cannot simply call isMarkedDelete from LazilyCompactionRow because we dont want this for DeletedColumns). 3) Cassandra should not create tombstones for expiring columns. I am not a 100% sure, but it looks to me like cassandra creates tombstones for expired ExpiringColumns. This makes me wonder if we could delete expired columns directly. The digest for a ExpiringColumn and DeletedColumn can never match, due to the different implementations. So there will be always a repair if compactions are not synchronous on nodes. Imho it should be valid to delete ExpiringColumns directly, because the TTL is given by the client and should pass on all nodes at the same time. All together should reduce over-repair. Merkle trees are an optimization, what they trade for this optimization is over-repair. (FWIW, I agree that, if possible, this particular case of over-repair would be nice to eliminate.) Of course, rather over-repair than corrupt something.
Re: distribution of token ranges with virtual nodes
On Fri, Nov 2, 2012 at 12:38 AM, Manu Zhang owenzhang1...@gmail.com wrote: It splits into a contiguous range, because truly upgrading to vnode functionality is another step. That confuses me. As I understand it, there is no point in having 256 tokens on same node if I don't commit the shuffle This isn't exactly true. By-partition operations (think repair, streaming, etc) will be more reliable in the sense that if they fail and need to be restarted, there is less that is lost/needs redoing. Also, if all you did was migrate from 1-token-per-node to 256 contiguous tokens per node, normal topology changes (bootstrapping new nodes, decommissioning old ones), would gradually work to redistribute the partitions. And, from a topology perspective, splitting the one partition into many contiguous partition is a no-op; it's safe to do and there is no cost to speak of from a computational or IO perspective. On the other hand, shuffling requires moving tokens around the cluster. If you completely randomize placement, it follows that you will need to relocate all of the clusters data, so it's quite costly. It's also precedent setting, and not thoroughly tested yet. -- Eric Evans Acunu | http://www.acunu.com | @acunu
how to detect stream closed in twitter/cassandra api?
Hi all, I'm using twitter/cassandra ruby client, trying to pool a connection in a static variable. @@client = Cassandra.new(keyspace, host, :retries = retries, :connect_timeout = connect_timeout, :timeout = timeout, :exception_classes = []) but the connection returns stream closed error after a while. is there a method in twitter/cassandra client to detect this state? Thank you.. Yuhan -- The information contained in this e-mail is for the exclusive use of the intended recipient(s) and may be confidential, proprietary, and/or legally privileged. Inadvertent disclosure of this message does not constitute a waiver of any privilege. If you receive this message in error, please do not directly or indirectly print, copy, retransmit, disseminate, or otherwise use the information. In addition, please delete this e-mail and all copies and notify the sender.
Insert via CQL
Hi, any idea, how to insert into a column family for a column of type blob via cql query? -Vivek
Re: Insert via CQL
On Fri, Nov 2, 2012 at 8:09 PM, Vivek Mishra mishra.v...@gmail.com wrote: any idea, how to insert into a column family for a column of type blob via cql query? Yes, most of them involve binary data that is hex-encoded ascii. :) -- Eric Evans Acunu | http://www.acunu.com | @acunu