Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?
This is incorrect. IMO that page is misleading. replicate on write should normally always be turned on, or the change will only be recorded on one node. Replicate on write is asynchronous with respect to the request and doesn't affect consistency level at all. On Wed, May 29, 2013 at 7:32 PM, Andrew Bialecki andrew.biale...@gmail.com wrote: To answer my own question, directly from the docs: http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write. It appears the answer to this is: Yes, CL.QUORUM isn't necessary for reads. Essentially, replicate_on_write sets the CL to ALL regardless of what you actually set it to (and for good reason). On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki andrew.biale...@gmail.com wrote: Quick question about counter columns. In looking at the replicate_on_write setting, assuming you go with the default of true, my understanding is it writes the increment to all replicas on any increment. If that's the case, doesn't that mean there's no point in using CL.QUORUM for reads because all replicas have the same values? Similarly, what effect does the read_repair_chance have on counter columns since they should need to read repair on write. In anticipation a possible answer, that both CL.QUORUM for reads and read_repair_chance only end up mattering for counter deletions, it's safe to only use CL.ONE and disable the read repair if we're never deleting counters. (And, of course, if we did start deleting counters, we'd need to revert those client and column family changes.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Read IO
Is this correct ? Yes, at least under optimal conditions and assuming a reasonably sized row. Things like read-ahead (at the kernel level) will play into it; and if your read (even if assumed to be small) straddles two pages you might or might not take another read depending on your kernel settings (typically trading pollution of page cache vs. number of I/O:s). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Simulating a failed node
Operation [158320] retried 10 times - error inserting key 0158320 ((UnavailableException)) This means that at the point where the thrift request to write data was handled, the co-ordinator node (the one your client is connected to) believed that, among the replicas responsible for the key, too many were down to satisfy the consistency level. Most likely causes would be that you're in fact not using RF 2 (e.g., is the RF really 1 for the keyspace you're inserting into), or you're in fact not using ONE. I'm sure my naive setup is flawed in some way, but what I was hoping for was when the node went down it would fail to write to the downed node and instead write to one of the other nodes in the clusters. So question is why are writes failing even after a retry? It might be the stress client doesn't pool connections (I took Write always go to all responsible replicas that are up, and when enough return (according to consistency level), the insert succeeds. If replicas fail to respond you may get a TimeoutException. UnavailableException means it didn't even try because it didn't have enough replicas to even try to write to. (Note though: Reads are a bit of a different story and if you want to test behavior when nodes go down I suggest including that. See CASSANDRA-2540 and CASSANDRA-3927.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Java 7 support?
FWIW, we're using openjdk7 on most of our clusters. For those where we are still on openjdk6, it's not because of an issue - just haven't gotten to rolling out the upgrade yet. We haven't had any issues that I recall with upgrading the JDK. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: nodetool cleanup
On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote: does nodetool cleanup perform a major compaction in the process of removing unwanted data? No.
Re: Why data tripled in size after repair?
It looks like what I need. Couple questions. Does it work with RandomPartinioner only? I use ByteOrderedPartitioner. I believe it should work with BOP based on cursory re-examination of the patch. I could be wrong. I don't see it as part of any release. Am I supposed to build my own version of cassandra? It's in the 1.1 branch; I don't remember if it went into a release yet. If not, it'll be in the next 1.1.x release. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Why data tripled in size after repair?
What is strange every time I run repair data takes almost 3 times more - 270G, then I run compaction and get 100G back. https://issues.apache.org/jira/browse/CASSANDRA-2699 outlines the maion issues with repair. In short - in your case the limited granularity of merkle trees is causing too much data to be streamed (effectively duplicate data). https://issues.apache.org/jira/browse/CASSANDRA-3912 may be a bandaid for you in that it allows granularity to be much finer, and the process to be more incremental. A 'nodetool compact' decreases disk space temporarily as you have noticed, but it may also have a long-term negative effect on steady state disk space usage depending on your workload. If you've got a workload that's not limited to insertions only (i.e., you have overwrites/deletes), a major compaction will tend to push steady state disk space usage up - because you're creating a single sstable bigger than what would normally happen, and it takes more total disk space before it will be part of a compaction again. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
It is not a total waste, but practically your time is better spent in other places. The problem is just about everything is a moving target, schema, request rate, hardware. Generally tuning nudges a couple variables in one direction or the other and you see some decent returns. But each nudge takes a restart and a warm up period, and with how Cassandra distributes requests you likely have to flip several nodes or all of them before you can see the change! By the time you do that its probably a different day or week. Essentially finding our if one setting is better then the other is like a 3 day test in production. Before c* I used to deal with this in tomcat. Once in a while we would get a dev that read some article about tuning, something about a new jvm, or collector. With bright eyed enthusiasm they would want to try tuning our current cluster. They spend a couple days and measure something and say it was good lower memory usage. Meanwhile someone else would come to me and say higher 95th response time. More short pauses, fewer long pauses, great taste, less filing. That's why blind blackbox testing isn't the way to go. Understanding what the application does, what the GC does, and the goals you have in mind is more fruitful. For example, are you trying to improve p99? Maybe you want to improve p999 at the cost of worse p99? What about failure modes (non-happy cases)? Perhaps you don't care about few-hundred-ms pauses but want to avoid full gc:s? There's lots of different goals one might have, and workloads. Testing is key, but only in combination with some directed choice of what to tweak. Especially since it's hard to test for for the non-happy cases (e.g., node takes a burst of traffic and starts promoting everything into old-gen prior to processing a request, resulting in a death spiral). G1 is the perfect example of a time suck. Claims low pause latency for big heaps, and delivers something regarded by the Cassandra community (and hbase as well) that works worse then CMS. If you spent 3 hours switching tuning knobs and analysing, that is 3 hours of your life you will never get back. This is similar to saying that someone told you to switch to CMS (or, use some particular flag, etc), you tried it, and it didn't have the result you expected. G1 and CMS have different trade-offs. Nether one will consistently result in better latencies across the board. It's all about the details. Better to let SUN and other people worry about tuning (at least from where I sit) They're not tuning. They are providing very general purpose default behavior, including things that make *no* sense at all with Cassandra. For example, the default behavior with CMS is to try to make the marking phase run as late as possible so that it finishes just prior to heap exhaustion, in order to optimize for throughput; except that's not a good idea for many cases because is exacerbates fragmentation problems in old-gen by pushing usage very high repeatedly, and it increases the chance of full gc because marking started too late (even if you don't hit promotion failures due to fragmentation). Sudden changes in workloads (e.g., compaction kicks in) also makes it harder for CMS's mark triggering heuristics to work well. As such, default options for Cassandra are use certain settings that diverge from that of the default behavior of the JVM, because Cassandra-in-general is much more specific a use-case than the completely general target audience of the JVM. Similarly, a particular cluster (with certain workloads/goals/etc) is a yet more specific use-case than Cassandra-in-general and may be better served by settings that differ from that of default Cassandra. But, I certainly agree with this (which I think roughly matches what you're saying): Don't randomly pick options someone claims is good in a blog post and expect it to just make things better. If it were that easy, it would be the default behavior for obvious reasons. The reason it's not, is likely that it depends on the situation. Further, even if you do play the lottery and win - if you don't know *why*, how are you able to extrapolate the behavior of the system with slightly changed workloads? It's very hard to blackbox-test GC settings, which is probably why GC tuning can be perceived as a useless game of whack-a-mole. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Invalid Counter Shard errors?
I don't understand what the three in parentheses values are exactly. I guess the last number is the count and the middle one is the number of increments, is that true ? What is the first string (identical in all the errors) ? It's (UUID, clock, increment). Very briefly, counter columns in Cassandra are made up of multiple shards. In the write path, a particular counter increment is executed by one leader which is one of the replicas of the counter. The leader will increment it's own value, read it's own full value (this is why Replicate On Write has to do reads in the write path for counters) and replicas to other nodes. UUID roughly corresponds to a node in the cluster (UUID:s are sometimes refreshed, so it's not a strict correlation). Clockid is supposed to be monotonically increasing for a given UUID. How can the last number (assuming it's the count) be negative knowing that I only sum positive numbers ? I don't see a negative number in you paste? An other point is that the highest value seems to be *always* the good one (assuming this time that the middle number is the number of increments). DISCLAIMER: This is me responding off the cuff without digging into it further. Depends on the source of the problem. If the problem, as theorized in the ticket, is caused by non-clean shutdown of nodes the expected result *should* be that we effectively loose counter increments. Given a particular leader among the replicas, suppose you increment counter C by N1, followed by un-clean shutdown with the value never having been written to the commit log. On the next increment of C by N2, a counter shard would be generated which has the value being base+N2 instead of base+N1 (assuming the memtable wasn't flushed and no other writes to the same counter column happened). When this gets replicated to other nodes, they would see a value based on N1 and a value based on N2, both with the same clock. It would choose the higher one. In either case as far as I can tell (off the top of my head), *some* counter increment is lost. The only way I can see (again off the top of my head) the resulting value being correct is if the later increment (N2 in this case) is somehow including N1 as well (e.g., because it was generated by first reading the current counter value). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Invalid Counter Shard errors?
The significance I think is: If it is indeed the case that the higher value is always *in fact* correct, I think that's inconsistent with the hypothesis that unclean shutdown is the sole cause of these problems - as long as the client is truly submitting non-idempotent counter increments without a read-before-write. As a side note: If hou're using these counters for stuff like determining amounts of money to be payed by somebody, consider the non-idempotense of counter increments. Any write that increments a counter, that fails by e.g. Timeout *MAY OR MAY NOT* have been applied and cannot be safely retried. Cassandra counters are generally not useful if *strict* correctness is desired, for this reason. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
Generally tuning the garbage collector is a waste of time. Sorry, that's BS. It can be absolutely critical, when done right, and only useless when done wrong. There's a spectrum in between. Just follow someone else's recommendation and use that. No, don't. Most recommendations out there are completely useless in the general case because someone did some very specific benchmark under very specific circumstances and then recommends some particular combination of options. In order to understand whether a particular recommendation applies to you, you need to know enough about your use-case that I suspect you're better of just reading up on the available options and figuring things out. Of course, randomly trying various different settings to see which seems to work well may be realistic - but you loose predictability (in the face of changing patterns of traffic for example) if you don't know why it's behaving like it is. If you care about GC related behavior you want to understand how the application behaves, how the garbage collector behaves, what your requirements are, and select settings based on those requirements and how the application and GC behavior combine to produce emergent behavior. The best GC options may vary *wildly* depending on the nature of your cluster and your goals. There are also non-GC settings (in the specific case of Cassandra) that affect the interaction with the garbage collector, like whether you're using row/key caching, or things like phi conviction threshold and/or timeouts. It's very hard for anyone to give generalized recommendations. If it weren't, Cassandra would ship with The One True set of settings that are always the best and there would be no discussion. It's very unfortunate that the state of GC in the freely available JVM:s is at this point given that there exists known and working algorithms (and at least one practical implementation) that avoids it, mostly. But, it's the situation we're in. The only way around it that I know of if you're on Hotspot, is to have the application behave in such a way that it avoids the causes of un-predictable behavior w.r.t. GC by being careful about it's memory allocation and *retention* profile. For the specific case of avoiding *ever* seeing a full gc, it gets even more complex. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Changing bloom filter false positive ratio
I have a hunch that the SSTable selection based on the Min and Max keys in ColumnFamilyStore.markReferenced() means that a higher false positive has less of an impact. it's just a hunch, i've not tested it. For leveled compaction, yes. For non-leveled, I can't see how it would since each sstable will effectively cover almost the entire range (since you're effectively spraying random tokens at it, unless clients are writing data in md5 order). (Maybe it's different for ordered partitioning though.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
Relatedly, I'd love to learn how to reliably reproduce full GC pauses on C* 1.1+. Our full gc:s are typically not very frequent. Few days or even weeks in between, depending on cluster. But it happens on several clusters; I'm guessing most (but I haven't done a systematic analysis). The only question is how often. But given the lack of handling of such failure modes, the effect on clients is huge. Recommend data reads by default to mitigate this and a slew of other sources of problems (and for counter increments, we're rolling out least-active-request routing). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
I was able to run IBM Java 7 with Cassandra (could not do it with 1.6 because of snappy). It has a new Garbage collection policy (called balanced) that is good for very large heap size (over 8 GB), documented here that is so promising with Cassandra. I have not tried it but I like to see how it is in action. FWIW, J9's balanced collector is very similar to G1 in it's design. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
Our full gc:s are typically not very frequent. Few days or even weeks in between, depending on cluster. *PER NODE* that is. On a cluster of hundreds of nodes, that's pretty often (and all it takes is a single node). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
I am currently profiling a Cassandra 1.1.1 set up using G1 and JVM 7. It is my feeble attempt to reduce Full GC pauses. Has anyone had any experience with this ? Anyone tried it ? Have tried; for some workloads it's looking promising. This is without key cache and row cache and with a pretty large young gen. The main think you'll want to look for is whether your post-mixed mode collection heap usage remains stable or keeps growing. The main issue with G1 that causes fallbacks to full GC is regions becoming effectively uncollectable due to high remembered set scanning costs (driven by inter-region pointers). If you can avoid that, one might hope to avoid full gc:s all-together. The jury is still out on my side; but like I said, I've seen promising indications. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra 1.1.1 on Java 7
Has anyone tried running 1.1.1 on Java 7? Have been running jdk 1.7 on several clusters on 1.1 for a while now. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Invalid Counter Shard errors?
This problem is not new to 1.1. On Sep 6, 2012 5:51 AM, Radim Kolar h...@filez.com wrote: i would migrate to 1.0 because 1.1 is highly unstable.
Re: force gc?
I think I described the problem wrong :) I don't want to do Java's memory GC. I want to do cassandra's GC - that is I want to really remove deleted rows from a column family and get my disc space back. I think that was clear from your post. I don't see a problem with your process. Setting gc grace to 0 and forcing compaction should indeed return you to the smallest possible on-disk size. Did you really not see a *decrease*, or are you just comparing the final size with that of PostgreSQL? Keep in mind that in many cases (especially if not using compression) the Cassandra on-disk format is not as compact as PostgreSQL. For example column names are duplicated in each row, and the row key is duplicated twice (once in index, once in data). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: force gc?
I think that was clear from your post. I don't see a problem with your process. Setting gc grace to 0 and forcing compaction should indeed return you to the smallest possible on-disk size. (But may be unsafe as documented; can cause deleted data to pop back up, etc.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: force gc?
Maybe there is some tool to analyze it? It would be great if I could somehow export each row of a column family into a separate file - so I could see their count and sizes. Is there any such tool? Or maybe you have some better thoughts... Use something like pycassa to non-obnoxiously iterate over all rows: for row_id, row in your_column_family.get_range(): https://github.com/pycassa/pycassa -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Memory Usage of a connection
Could these 500 connections/second cause (on average) 2600Mb memory usage per 2 second ~ 1300Mb/second. or For 1 connection around 2-3Mb. In terms of garbage generated it's much less about number of connections as it is about what you're doing with them. Are you for example requesting large amounts of data? Large or many columns (or both), etc. Essentially all working data that your request touches is allocated on the heap and contributes to allocation rate and ParNew frequency. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JMX(RMI) dynamic port allocation problem still exists?
I can recommend Jolokia highly for providing an HTTP/JSON interface to JMX (it can be trivially run in agent mode by just altering JVM args): http://www.jolokia.org/ -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Node forgets about most of its column families
I can confirm having seen this (no time to debug). One method of recovery is to jump the node back into the ring with auto_bootstrap set to false and an appropriate token set, after deleting system tables. That assumes you're willing to have the node take a few bad reads until you're able to disablegossip and make other nodes not send requests to it. disabling thrift would also be advised, or even firewalling it prior to restart. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Why so slow?
You're almost certainly using a client that doesn't set TCP_NODELAY on the thrift TCP socket. The nagle algorithm is enabled, leading to 200 ms latency for each, and thus 5 requests/second. http://en.wikipedia.org/wiki/Nagle's_algorithm -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: nodetool repair uses insane amount of disk space
How come a node would consume 5x its normal data size during the repair process? https://issues.apache.org/jira/browse/CASSANDRA-2699 It's likely a variation based on how out of synch you happen to be, and whether you have a neighbor that's also been repaired and bloated up already. My setup is kind of strange in that it's only about 80-100GB of data on a 35 node cluster, with 2 data centers and 3 racks, however the rack assignments are unbalanced. One data center has 8 nodes, and the other data center is split into 2 racks with one rack of 9 nodes, and the other with 18 nodes. However, within each rack, the tokens are distributed equally. It's a long sad story about how we ended up this way, but it basically boils down to having to utilize existing resources to resolve a production issue. https://issues.apache.org/jira/browse/CASSANDRA-3810 In terms of DCs, different DC:s are effectively independent of each other in terms of replica placement. So there is no need or desire for two DC:s to be symmetrical. The racks are important though if you are trying to take advantage of racks being somewhat independent failure domains (for reasons outlined in 3810 above). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: 0.8 -- 1.1 Upgrade: Any Issues?
We currently have a 0.8 production cluster that I would like to upgrade to 1.1. Are there any know compatibility or upgrade issues between 0.8 and 1.1? Can a rolling upgrade be done or is it all-or-nothing? If you have lots of keys: https://issues.apache.org/jira/browse/CASSANDRA-3820 -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Invalid Counter Shard errors?
We're running a three node cluster of cassandra 1.1 servers, originally 1.0.7 and immediately after the upgrade the error logs of all three servers began filling up with the following message: The message you are receiving is new, but the problem it identifies is not. The checking for this condition, and the logging, was added so that certain kinds of counter corruption would be self-healed eventually instead of remaining forever incorrect. Likely nothing is wrong that wasn't before; you're just seeing it being logged now. And I can confirm having seen this on 1.1, so the root cause remains unknown as far as I can tell (had previously hoped the root cause were thread-unsafe shard merging, or one of the other counter related issues fixed during the 0.8 run). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: TTL 3 hours + GC grace 0
I am using TTL 3 hours and GC grace 0 for a CF. I have a normal CF that has records with TTL 3 hours and I dont send any delete request. I just wonder if using GC grace 0 will cause any problem except extra Memory/IO/network load. I know that gc grace is for not transferring deleted records after a down node comes back. So I assumed that transferring expired records will not cause any problem. Do you have any idea? Thank you! If you do not perform any deletes at all, a GC grace of 0 should be fine. But if you don't, the GC grace should not really be relevant either. So I suggest leaving GC grace high in case you do start doing deletes. Columns with TTL:s will disappear regardless of GC grace. If you do decide to run with short GC grace, be aware of the consequencues: http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: OOM opening bloom filter
How did this this bloom filter get too big? Bloom filters grow with the amount of row keys you have. It is natural that they grow bigger over time. The question is whether there is something wrong with this node (for example, lots of sstables and disk space used due to compaction not running, etc) or whether your cluster is simply increasing it's use of row keys over time. You'd want graphs to be able to see the trends. If you don't, I'd start by comparing this node with other nodes in the cluster and figure out whether there is a very significant difference or not. In any case, a bigger heap will allow you to start up again. But you should definitely make sure you know what's going on (natural growth of data vs. some problem) if you want to avoid problems in the future. If it is legitimate use of memory, you *may*, depending on your workload, want to adjust target bloom filter false positive rates: https://issues.apache.org/jira/browse/CASSANDRA-3497 -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: how to increase compaction rate?
multithreaded_compaction: false Set to true. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: LeveledCompaction and/or SnappyCompressor causing memory pressure during repair
However, when I run a repair my CMS usage graph no longer shows sudden drops but rather gradual slopes and only manages to clear around 300MB each GC. This seems to occur on 2 other nodes in my cluster around the same time, I assume this is because they're the replicas (we use 3 replicas). Parnew collections look about the same on my graphs with or without repair running so no trouble there so far as I can tell. I don't know why leveled/snappy would affect it, but disregarding that, I would have been suggesting that you are seeing additional heap usage because of long-running repairs retaining sstables and delaying their unload/removal (index sampling/bloom filters filling your heap). If it really only happens for leveled/snappy however, I don't know what that might be caused by. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Frequency of Flushing in 1.0
if a node goes down, it will take longer for commitlog replay. commit log replay time is insignificant. most time during node startup is wasted on index sampling. Index sampling here runs for about 15 minutes. Depends entirely on your situation. If you have few keys and lots of writes, index sampling will be insignificant. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Analysis of performance benchmarking - unexpected results
1. Changing consistency level configurations from Write.ALL + Read.ONE to Write.ALL + Read.ALL increases write latency (expected) and decrease read latency (unexpected). When you tested at CL.ONE, was read repair turned on? The two ways I can think of right now, by which read latency might improve, are: * You're benchmarking at saturation (rather than at reasonable capacity), and the decrease overall throughput (thus load) causes better latencies. * No (or low) read repair could lead to different selection of read endpoints by the co-ordinator, such as if a node never gets the chance to be snitched close due to lack of traffic. In addition, I should mention that it *would* be highly expected to see better latencies at ALL if https://issues.apache.org/jira/browse/CASSANDRA-2540 were done (same with QUORUM). 2. Changing from a single-region Cassandra cluster to a multi-region Cassandra cluster on EC2 significantly increases write latency (approx 2x, again expected) but slightly decreases read latency (approx -10%, again unexpected). Are all benchmarking clients in a single region? I could imagine some kind of snitch effect here, possible. But not if your benchmarking clients are connecting to random co-ordinators (assuming the inter-region latency is in fact worse ;)). Something is highly likely bogus IMO. If you have significantly higher latencies across regions, there's just no way CL.ALL would get you lower latencies than CL.ONE. Even if the clients selected hosts randomly across the entire cluster, all requests that happen to go to a region-local co-ordinator should see better latencies. That's ASSUMING that you've set up the multi-region cluster as multi-DC in Cassandra. If not, Cassandra would not have knowledge about topology and relative latency other than what is driven by traffic, and I could imagine this happening if read repair were turned completely off. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
I actually has the opposite 'problem'. I have a pair of servers that have been static since mid last week, but have seen performance vary significantly (x10) for exactly the same query. I hypothesised it was various caches so I shut down Cassandra, flushed the O/S buffer cache and then bought it back up. The performance wasn't significantly different to the pre-flush performance I don't get this thread at all :) Why would restarting with clean caches be expected to *improve* performance? And why is key cache loading involved other than to delay start-up and hopefully pre-populating caches for better (not worse) performance? If you want to figure out why queries seem to be slow relative to normal, you'll need to monitor the behavior of the nodes. Look at disk I/O statistics primarily (everyone reading this running Cassandra who aren't intimately familiar with iostat -x -k 1 should go and read up on it right away; make sure you understand the utilization and avg queue size columns), CPU usage, weather compaction is happening, etc. One easy way to see sudden bursts of poor behavior is to be heavily reliant on cache, and then have sudden decreases in performance due to compaction evicting data from page cache while also generating more I/O. But that's total speculation. It is also the case that you cannot expect consistent performance on EC2 and that might be it. But my #1 advise: Log into the node while it is being slow, and observe. Figure out what the bottleneck is. iostat, top, nodetool tpstats, nodetool netstats, nodetool compactionstats. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
Yep - I've been looking at these - I don't see anything in iostat/dstat etc that point strongly to a problem. There is quite a bit of I/O load, but it looks roughly uniform on slow and fast instances of the queries. The last compaction ran 4 days ago - which was before I started seeing variable performance [snip] I now why it is slow - it's clearly I/O bound. I am trying to hunt down why it is sometimes much faster even though I have (tried) to replicate the same conditions What does clearly I/O bound mean, and what is quite a bit of I/O load? In general, if you have queries that come in at some rate that is determined by outside sources (rather than by the time the last query took to execute), you will typically either get more queries than your cluster can take, or fewer. If fewer, there is a non-trivially sized grey area where overall I/O throughput needed is lower than that available, but the closer you are to capacity the more often requests have to wait for other I/O to complete, for purely statistical reasons. If you're running close to maximum capacity, it would be expected that the variation in query latency is high. That said, if you're seeing consistently bad latencies for a while where you sometimes see consistently good latencies, that sounds different but would hopefully be observable somehow. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
I'm making an assumption . . . I don't yet know enough about cassandra to prove they are in the cache. I have my keycache set to 2 million, and am only querying ~900,000 keys. so after the first time I'm assuming they are in the cache. Note that the key cache only caches the index positions in the data file, and not the actual data. The key cache will only ever eliminate the I/O that would have been required to lookup the index entry; it doesn't help to eliminate seeking to get the data (but as usual, it may still be in the operating system page cache). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
For one thing, what does ReadStage's pending look like if you repeatedly run nodetool tpstats on these nodes? If you're simply bottlenecking on I/O on reads, that is the most easy and direct way to observe this empirically. If you're saturated, you'll see active close to maximum at all times, and pending racking up consistently. If you're just close, you'll likely see spikes sometimes. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
What is your total data size (nodetool info/nodetool ring) per node, your heap size, and the amount of memory on the system? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
the servers spending 50% of the time in io-wait Note that I/O wait is not necessarily a good indicator, depending on situation. In particular if you have multiple drives, I/O wait can mostly be ignored. Similarly if you have non-trivial CPU usage in addition to disk I/O, it is also not a good indicator. I/O wait is essentially giving you the amount of time CPU:s spend doing nothing because the only processes that would otherwise be runnable are waiting on disk I/O. But even a single process waiting on disk I/O - lots of I/O wait even if you have 24 drives. The per-disk % utilization is generally a much better indicator (assuming no hardware raid device, and assuming no SSD), along with the average queue size. In general, if you have queries that come in at some rate that is determined by outside sources (rather than by the time the last query took to execute), That's an interesting approach - is that likely to give close to optimal performance ? I just mean that it all depends on the situation. If you have, for example, some N number of clients that are doing work as fast as they can, bottlenecking only on Cassandra, you're essentially saturating the Cassandra cluster no matter what (until the client/network becomes a bottleneck). Under such conditions (saturation) you generally never should expect good latencies. For most non-batch job production use-cases, you tend to have incoming requests driven by something external such as user behavior or automated systems not related to the Cassandra cluster. In this cases, you tend to have a certain amount of incoming requests at any given time that you must serve within a reasonable time frame, and that's where the question comes in of how much I/O you're doing in relation to maximum. For good latencies, you always want to be significantly below maximum - particularly when platter based disk I/O is involved. That may well explain it - I'll have to think about what that means for our use case as load will be extremely bursty To be clear though, even your typical un-bursty load is still bursty once you look at it at sufficient resolution, unless you have something specifically ensuring that it is entirely smooth. A completely random distribution over time for example would look very even on almost any graph you can imagine unless you have sub-second resolution, but would still exhibit un-evenness and have an affect on latency. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
Yep, the readstage is backlogging consistently - but the thing I am trying to explain s why it is good sometimes in an environment that is pretty well controlled - other than being on ec2 So pending is constantly 0? What are the clients? Is it batch jobs or something similar where there is a feedback mechanism implicit in that the higher latencies of the cluster are slowing down the clients, thus reaching an equilibrium? Or are you just teetering on the edge, dropping requests constantly? Under typical live-traffic conditions, you never want to be running with read stage pending backing up constantly. If on the other hand these are batch jobs where throughput is the concern, it's not relevant. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: keycache persisted to disk ?
2 Node cluster, 7.9GB of ram (ec2 m1.large) RF=2 11GB per node Quorum reads 122 million keys heap size is 1867M (default from the AMI I am running) I'm reading about 900k keys Ok, so basically a very significant portion of the data fits in page cache, but not all. As I was just going through cfstats - I noticed something I don't understand Key cache capacity: 906897 Key cache size: 906897 I set the key cache to 2million, it's somehow got to a rather odd number You're on 1.0 +? Nowadays there is code to actively make caches smaller if Cassandra detects that you seem to be running low on heap. Watch cassandra.log for messages to that effect (don't remember the exact message right now). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Multiple data center nodetool ring output display 0% owns
There was a thread on this a couple days ago -- short answer, the 'owns %' column is effectively incorrect when you're using multiple DCs. If you had all 3 servers in 1 DC, since server YYY has token 1 and server XXX has token 0, then server XXX would truly 'own' 0% (actually, 1/(2^128) :) ), and depending on your replication factor might have no data (if replication were 1). It's also incorrect for rack awareness if your topology is such that the rack awareness changes ownership (see https://issues.apache.org/jira/browse/CASSANDRA-3810). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra 1.0.6 multi data center question
It is expected that the schema is replicated everywhere, but *data* won't be in the DC with 0 replicas. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra 1.0.6 multi data center question
Thanks for the reply. But I can see the data setting inserted in DC1 in DC2. So that means data also getting replicated to DC2. data setting inserted? You should not be seeing data in the DC with 0 replicas. What does data setting inserted mean? Do you have sstables on disk with data for the keyspace? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra 1.0.6 multi data center question
Again the *schema* gets propagated and the keyspace will exist everywhere. You should just have exactly zero amount of data for the keyspace in the DC w/o replicas. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: read-repair?
sorry to be dense, but which is it? do i get the old version or the new version? or is it indeterminate? Indeterminate, depending on which nodes happen to be participating in the read. Eventually you should get the new version, unless the node that took the new version permanently crashed with data loss prior to the data making it elsewhere. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Can you query Cassandra while it's doing major compaction
If every node in the cluster is running major compaction, would it be able to answer any read request? And is it wise to write anything to a cluster while it's doing major compaction? Compaction is something that is supposed to be continuously running in the background. As noted, it will have a performance impact in that it (1) generates I/O, (2) leads to cache eviction, and (if you're CPU bound rather than disk bound) (3) adds CPU load. But there is no intention that clients should have to care about compaction; it's to be viewed as a background operation continuously happening. A good rule of thumb is that an individual node should be able to handle your traffic when doing compaction; you don't want to be in the position where you're just barely dealing with the traffic, and a node doing compaction not being able to handle it. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: read-repair?
i have RF=3, my row/column lives on 3 nodes right? if (for some reason, eg a timed-out write at quorum) node 1 has a 'new' version of the row/column (eg clock = 10), but node 2 and 3 have 'old' versions (clock = 5), when i try to read my row/column at quorum, what do i get back? You either get back the new version or the old version, depending on whether node 1 was participated in the read. In your scenario, the prevoius write at quorum failed (since it only made it to one node), so this is not a violation of the contract. Once node 2 and/or 3 return their response, read repair (if it is active) will cause re-read and re-conciliation followed by a row mutation being send to the nodes to correct the column. do i get the clock 5 version because that is what the quorum agrees on, and No; a quorum of node is waited for, and the newest column wins. This accomplish the reads-see-write invariant. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: how to delete data with level compaction
I'm using level compaction and I have about 200GB compressed in my largest CFs. The disks are getting full. This is time-series data so I want to drop data that is a couple of months old. It's pretty easy for me to iterate through the relevant keys and delete the rows. But will that do anything? I currently have the majority of sstables at generation 4. Deleting rows will initially just create a ton of tombstones. For them to actually free up significant space they need to get promoted to gen 4 and cause a compaction there, right? nodetool compact doesn't do anything with level compaction, it seems. Am I doomed? (Ok, I'll whip out my CC and order more disk ;-) There is a delay before you will see the effects of deletions in terms of disk space, yes. However, this is not normally a problem because you effectively reach a steady state of disk usage. It only becomes a problem if you're almost *entirely* full and are trying to delete data in a panic. How far away are you from entirely full? Are you just worried about the future or are you about to run out of disk space right now? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: how to delete data with level compaction
I'm at 80%, so not quite panic yet ;-) I'm wondering, in the steady state, how much of the space used will contain deleted data. That depends entirely on your workload, including: * How big the data that you are deleting is in relation to the size of tombstones * How long the average piece of data lives before being deleted * How much other data never being deleted you have in relation to the data that gets deleted * How it varies over time and where you are in your cycle I can't offer any mathematics that will give you the actual answered for leveled compaction (and I don't have a good feel for it as I haven't used leveled compaction in production systems yet). It's strongly recommend graphing disk space for your particular workload/application and see how it behaves over time. And *have alerts* on disk space running out. Have a good amount of margin. Less so with leveled compaction than size tiered compaction, but still important. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: ideal cluster size
Thanks for the responses! We'll definitely go for powerful servers to reduce the total count. Beyond a dozen servers there really doesn't seem to be much point in trying to increase count anymore for Just be aware that if big servers imply *lots* of data (especially in relation to memory size), it's not necessarily the best trade-off. Consider the time it takes to do repairs, streaming, node start-up, etc. If it's only about CPU resources then bigger nodes probably make more sense if the h/w is cost effective. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Garbage collection freezes cassandra node
Thanks for this very helpful info. It is indeed a production site which I cannot easily upgrade. I will try the various gc knobs and post any positive results. *IF* your data size, or at least hot set, is small enough that you're not extremely reliant on the current size of page cache, and in terms of short-term relief, I recommend: * Significantly increasing the heap size. Like double it or more. * Decrease the occupancy trigger such that it kicks in around the time it already does (in terms of amount of heap usage used). * Increase the young generation size (to lessen promotion into old-gen). Experiment on a single node, making sure you're not causing too much disk I/O by stealing memory otherwise used by page cache. Once you have something that works you might try slowly going back down. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: delay in data deleting in cassadra
The problem occurs when this thread is invoked for the second time. In that step , it returns some of data that i already deleted in the third step of the previous cycle. In order to get a guarantee about a subsequent read seeing a write, you must read and write at QUORUM (or LOCAL_QUORUM if it's only within a DC). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Garbage collection freezes cassandra node
On node 172.16.107.46, I see the following: 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K-324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K-3298251K(8005248K), 10.8526193 secs] 6310427K-3298251K(8350272K), [CMS Perm : 26355K-26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs] 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528 I have not yet tested the XX:+DisableExplicitGC switch. Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting? * Increasing the total heap size can definitely help; the only kink is that if you need to increase the heap size unacceptably much, it is not helpful. * Decreasing the occupancy trigger can help yes, but you will get very much diminishing returns as your trigger fraction approaches the actual live size of data on the heap. * I just re-checked your original message - you're on Cassandra 0.7? I *strongly* suggest upgrading to 1.x. In general that holds true, but also specifically relating to this are significant improvements in memory allocation behavior that significantly reduces the probability and/or frequency of promotion failures and full gcs. * Increasing the size of the young generation can help by causing less promotion to old-gen (see the cassandra.in.sh script or equivalent of for Windows). * Increasing the amount of parallel threads used by CMS can help CMS complete it's marking phase quicker, but at the cost of a greater impact on the mutator (cassandra). I think the most important thing is - upgrade to 1.x before you run these benchmarks. Particularly detailed tuning of GC issues is pretty useless on 0.7 given the significant changes in 1.0. Don't even bother spending time on this until you're on 1.0, unless this is about a production cluster that you cannot upgrade for some reason. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: ideal cluster size
We're embarking on a project where we estimate we will need on the order of 100 cassandra nodes. The data set is perfectly partitionable, meaning we have no queries that need to have access to all the data at once. We expect to run with RF=2 or =3. Is there some notion of ideal cluster size? Or perhaps asked differently, would it be easier to run one large cluster or would it be easier to run a bunch of, say, 16 node clusters? Everything we've done to date has fit into 4-5 node clusters. Certain things certainly becomes harder with many nodes just due to the shear amount; increased need to automate administrative tasks, etc. But mostly, this would apply equally to e.g. 10 clusters of 10 nodes, as it does to one cluster of 100 nodes. I'd prefer running one cluster unless there is a specific reason to do otherwise, just because it means you have one thing to keep track of both mentally and in terms of e.g. monitoring/alerting instead of having another level of grouping applied to your hosts. I can't think of significant benefits to small clusters that still hold true when you have many of them, as opposed to a correspondingly big single cluster. It is probably more useful to try to select hardware such that you have a greater number of smaller nodes, than it is to focus on node count (although once you start reaching the few hundreds level you're entering territory of less actual real-life production testing). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cannot start cassandra node anymore
(too much to quote) Looks to me like this should be fixable by removing hints from the node in question (I don't remember whether this is a bug that's been identified and fixed or not). I may be wrong because I'm just basing this on a quick look at the stack trace but it seems to me there are hints on the node, for other nodes, that contain writes to a deleted column family. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Consistency Level
Thanks for the response Peter! I checked everything and it look good to me. I am stuck with this for almost 2 days now. Has anyone had this issue? While it is certainly possible that you're running into a bug, it seems unlikely to me since it is the kind of bug that would affect almost anyone if it is failing with Unavailable due to unrelated (not in replica sets) nodes being down. Can you please post back with (1) the ring layout ('nodetool ring'), and (2) the exact row key that you're testing with? You might also want to run with DEBUG level (modify log4j-server.properties at the top) and the strategy (assuming you are using NetworkTopologyStrategy) will log selected endpoints, and confirm that it's indeed picking endpoints that you think it should based on getendpoints. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Consistency Level
exception May not be enough replicas present to handle consistency level Check for mistakes in using getendpoints. Cassandra says Unavailable when there is not enough replicas *IN THE REPLICA SET FOR THE ROW KEY* to satisfy the consistency level. I tried to read data using cassandra-cli but I am getting null. This is just cassandra-cli quirkyness IIRC; I think you get null on exceptions. With consistency level ONE, I would assume that with just one node up and running (of course the one that has the data) I should get my data back. But this is not happening. 1 node among the ones in the replica set of your row has to be up. Will the read repair happen automatically even if I read and write using the consistency level ONE? Yes, assuming it's turned on. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: better anti OOM
I will investigate situation more closely using gc via jconsole, but isn't bloom filter for new sstable entirely in memory? On disk there are only 2 files Index and Data. -rw-r--r-- 1 root wheel 1388969984 Dec 27 09:25 sipdb-tmp-hc-4634-Index.db -rw-r--r-- 1 root wheel 10965221376 Dec 27 09:25 sipdb-tmp-hc-4634-Data.db Bloom filter can be that big. I have experience that if i trigger major compaction on 180 GB CF ( Compacted row mean size: 130) it will OOM node after 10 seconds, so i am sure that compactions eats memory pretty well. Yes you're right, you'll definitely spike in memory usage whatever amount corresponds to index sampling/BF for the thing being compacted. This can be mitigated by never running full compactions (i.e., not running 'nodetool compact'), but won't be gone all together. Also, if your version doesn't yet have https://issues.apache.org/jira/browse/CASSANDRA-2466 applied, another side-effect is that the sudden large allocations for bloom filters can cause promotion failures even if there is free memory. yes, it prints messages like heap is almost full and after some time it usually OOM during large compaction. Ok, in that case it seems even more clear that you simply need a larger heap. How large is the bloom filters in total? I.e., sizes of the *-Filter.db files. In general, don't expect to be able to run at close to heap capacity; there *will* be spikes. In this particular case, leveled compaction in 1.0 should mitigate the effect quite significantly since it only compacts up to 10% of the data set at a time so memory usage should be considerably more even (as will disk space usage be). That would allow you to run a bit closer to heap capacity than regular compaction. Also, consider tweaking compaction throughput settings to control the rate of allocation generated during a compaction, even if you don't need it for disk I/O purposes. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: index sampling
on node with 300m rows (small node), it will be 585937 index sample entries with 512 sampling. lets say 100 bytes per entry this will be 585 MB, bloom filters are 884 MB. With default sampling 128, sampled entries will use majority of node memory. Index sampling should be reworked like bloom filters to avoid allocating one large array per sstable. hadoop mapfile is using sampling 128 by default too and it reads entire mapfile index into memory. The index summary does have an ArrayList which will be backed by an array which could become large; however larger than that array (which is going to be 1 object reference per sample, or 1-2 taking into account internal growth of the array list) will be the overhead of the objects in the array (regular Java objects). This is also why it is non-trivial to report on the data size. it should be clearly documented in http://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom filters + index sampling will be responsible for most memory used by node. Caching itself has minimal use on large data set used for OLAP. I added some information at the end. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: will compaction delete empty rows after all columns expired?
Compaction should delete empty rows once gc_grace_seconds is passed, right? Yes. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: will compaction delete empty rows after all columns expired?
Compaction should delete empty rows once gc_grace_seconds is passed, right? Yes. But just to be extra clear: Data will not actually be removed once the row in question participates in compaction. Compactions will not be actively triggered by Cassandra for tombstone processing reasons. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis
Also when comparing these technologies very subtle differences in design have profound in effects in operation and performance. Thus someone trying to paper over 6 technologies and compare them with a few bullet points is really doing the world an injustice. +1. Same goes for 99% of all benchmarks ever published on the internet... -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Restart for change of endpoint_snitch ?
If I change endpoint_snitch from SimpleSnitch to PropertyFileSnitch, does it require restart of cassandra on that node ? Yes. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: reported bloom filter FP ratio
but reported ratio is Bloom Filter False Ratio: 0.00495 which is higher than my computed ratio 0.000145. If you were true than reported ratio should be lower then mine computed from CF reads because there are more reads to sstables then to CF. The ratio is the ratio of false positives to true positives *per sstable*. It's not the amount of false positives in each sstable *per cf read*. Thus, there is no expectation of higher vs. lower, and the magnitude of the discrepancy is easily explained by the fact that you only have 10 false positives. That's not a statistically significant sample set. from investigation of bloom filter FP ratio it seems that default bloom filter FP ratio (soon user configurable) should be higher. Hbase defaults to 1% cassandra defaults to 0.000744. bloom filters are using quite a bit memory now. I don't understand how you reached that conclusion. There is a direct trade-off between memory use and false positive hit rate, yes. That does not mean that hbase's 1% is magically the correct choice. I definitely think it should be tweakable (and IIRC there's work happening on a JIRA to make this an option now), but a 1% false positive hit rate will be completely unacceptable in some circumstances. In others, perfectly acceptable due to the decrease in memory use and few reads. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: reported bloom filter FP ratio
I don't understand how you reached that conclusion. On my nodes most memory is consumed by bloom filters. Also 1.0 creates The point is that just because that's the problem you have, doesn't mean the default is wrong, since it quite clearly depends on use-case. If your relative amounts of rows is low compared to the cost of sustaining a read-heavy workload, the trade-off is different. Cassandra does not measure memory used by index sampling yet, i suspect that it will be memory hungry too and can be safely lowered by default i see very little difference by changing index sampling from 64 to 512. Bloom filters and index sampling are the two major contributors to memory use that scale with the number of rows (and thus typically with data size). This is known. Index sampling can indeed be significant. The default is 128 though, not 64. Here again it's a matter of trade-offs; 512 may have worked for you, but it doesn't mean it's an appropriate default (I am not arguing for 128 either, I am just saying that it's more complex than observing that in your particular case you didn't see a problem with 512). Part of the trade-off is additional CPU usage implied in streaming and deserializing a larger amount of data per average sstable index read; part of the trade-off is also effects on I/O; a sparser index sampling could result in a higher amount of seeks per index lookup. Basic problem with cassandra daily administration which i am currently solving is that memory consumption grows with your dataset size. I dont really like this design - you put more data in and cluster can OOM. This makes cassandra not optimal solution for use in data archiving. It will get better after tunable bloom filters will be committed. That is a good reason for both to be configurable IMO. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: better anti OOM
If node is low on memory 0.95+ heap used it can do: 1. stop repair 2. stop largest compaction 3. reduce number of compaction slots 4. switch compaction to single threaded flushing largest memtable/ cache reduce is not enough Note that the emergency flushing is just a stop-gap. You should run with appropriately sized heaps under normal conditions; the emergency flushing stuff is intended to mitigate the effects of having a too small heap size; it is not expected to avoid completely the detrimental effects. Also note that things like compaction does not normally contribute significantly to the live size on your heap, but it typically does contribute to allocation rate which can cause promotion failures or concurrent mode failures if your heap size is too small and/or concurrent mark/sweep settings not aggressive enough. Aborting compaction wouldn't really help anything other than short-term avoiding a fallback to full GC. I suggest you describe exactly what the problem is you have and why you think stopping compaction/repair is the appropriate solution. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: better anti OOM
I suggest you describe exactly what the problem is you have and why you think stopping compaction/repair is the appropriate solution. compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to heap, node logs messages like: I don't know what you are basing that on. It seems unlikely to me that the working set of a compaction is 600 MB. However, it may very well be that the allocation rate is such that it contributes to an additional 600 MB average heap usage after a CMS phase has completed. After node boot Heap Memory (MB) : 1157.98 / 1985.00 disabled gossip + thrift, only compaction running Heap Memory (MB) : 1981.00 / 1985.00 Using nodetool info to monitor heap usage is not really useful unless done continuously over time and observing the free heap after CMS phases have completed. Regardless, the heap is always expected to grow in usage to the occupancy trigger which kick-starts CMS. That said, 1981/1985 does indicate a non-desirable state for Cassandra, but it does not mean that compaction is using 600 mb as such (in terms of live set). You might say that it implies = 600 mb extra heap required at your current heap size and GC settings. If you want to understand what's happening I suggest attaching with visualvm/jconsole and looking at the GC behavior, and run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps. When attached with visualvm/jconsole you can hit perform gc and see how far it drops, to judge what the actual live set is. Also, you say it's pretty dead. What exactly does that mean? Does it OOM? I suspect you're just seeing fallbacks to full GC and long pauses because you're allocating and promoting to old-gen fast enough that CMS is just not keeping up; rather than it having to do with memory use per say. In your case, I suspect you simply need to run with a bigger heap or reconfigure CMS to use additional threads for concurrent marking (-XX:ParallelCMSThreads=XXX - try XXX = number of CPU cores for example in this case). Alternatively, a larger young gen to avoid so much getting promoted during compaction. But really, in short: The easiest fix is probably to increase the heap size. I know this e-mail doesn't begin to explain details but it's such a long story. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: reported bloom filter FP ratio
Read Count: 68844 [snip] why reported bloom filter FP ratio is not counted like this 10/68844.0 0.00014525594096798558 Because the read count is total amount of reads to the CF, while the bloom filter is per sstable. The number of individual reads to sstables will be higher than the number of reads to the CF (unless you happen to have exactly one sstable or no rows ever span sstables). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Routine nodetool repair
One other thing to consider is are you creating a few very large rows ? You can check the min, max and average row size using nodetool cfstats. Normall I agree, but assuming the two-node cluster has RF 2 it would actually not matter ;) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra stress test and max vs. average read/write latency.
Thanks for your input. Can you tell me more about what we should be looking for in the gc log? We've already got the gc logging turned on and, and we've already done the plotting to show that in most cases the outliers are happening periodically (with a period of 10s of seconds to a few minutes, depnding on load and tuning) Are you measuring writes or reads? If writes, https://issues.apache.org/jira/browse/CASSANDRA-1991 is still relevant I think (sorry no progress from my end on that one). Also, I/O scheduling issues can easily cause problems with the commit log latency (on fsync()). Try switching to periodic commit log mode and see if it helps, just to eliminate that (if you're not already in periodic; if so, try upping the interval). For reads, I am generally unaware of much aside from GC and legitimate jitter (scheduling/disk I/O etc) that would generate outliers. At least that I can think of off hand... And w.r.t. the GC log - yeah, correlating in time is one thing. Another thing is to confirm what kind of GC pauses you're seeing. Generally you want to be seeing lots of ParNew:s of shorter duration, and those are tweakable by changing the young generation size. The other thing is to make sure CMS is not failing (promotion failure/concurrent mode failure) and falling back to a stop-the-world serial compacting GC of the entire heap. You might also use -:XX+PrintApplicationPauseTime (I think, I am probably not spelling it entirely correctly) to get a more obvious and greppable report for each pause, regardless of type/cause. I've tried to correlate the times of the outliers with messages either in the system log or the gc log. There seemms to be some (but not complete) correlation between the outliers and system log messages about memtable flushing. I can not find anything in the gc log that seems to be an obvious problem, or that matches up with the time times of the outliers. And these are still the very extreme (500+ ms and such) outliers that you're seeing w/o GC correlation? Off the top of my head, that seems very unexpected (assuming a non-saturated system) and would definitely invite investigation IMO. If you're willing to start iterating with the source code I'd start bisecting down the call stack and see where it's happening . -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Routine nodetool repair
Could the lack of routine repair be why nodetool ring reports: node(1) Load - 78.24 MB and node(2) Load - 67.21 MB? The load span between the two nodes has been increasing ever so slowly... No. Generally there will be a variation in load depending on what state compaction happens to be in on the given node (I am assuming you're not using leveled compaction). That is in addition to any imbalance that might result from your population of data in the cluster. Running repair can affect the live size, but *lack* of repair won't cause a live size divergence. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Garbage collection freezes cassandra node
During the garbage collections, Cassandra freezes for about ten seconds. I observe the following log entries: “GC for ConcurrentMarkSweep: 11597 ms for 1 collections, 1887933144 used; max is 8550678528” Ok, first off: Are you certain that it is actually pausing, or are you assuming that due to the log entry above? Because the log entry in no way indicates a 10 second pause; it only indicates that CMS took 10 seconds - which is entirely expected, and most of CMS is concurrent and implies only short pauses. A full pause can happen, but that log entry is expected and is not in and of itself indicative of a stop-the-world 10 second pause. It is fully expected using the CMS collector that you'll have a sawtooth pattern as young gen is being collected, and then a sudden drop as CMS does its job concurrently without pausing the application for a long period of time. I will second the recommendation to run with -XX:+DisableExplicitGC (or -XX:+ExplicitGCInvokesConcurrent) to eliminate that as a source. I would also run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and report back the results (i.e., the GC log around the time of the pause). Your graph is looking very unusual for CMS. It's possible that everything is as it otherwise should and CMS is kicking in too late, but I am kind of skeptical towards that even the extremely smooth look of your graph. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Garbage collection freezes cassandra node
I should add: If you are indeed actually pausing due to promotion failed or concurrent mode failure (which you will see in the GC log if you enable it with the options I suggested), the first thing I would try to mitigate is: * Decrease the occupancy trigger (search for occupancy) of CMS to a lower percentage, making the concurrent mark phase start earlier. * Increase heap size significantly (probably not necessary based on your graph, but for good measure). If it then goes away, report back and we can perhaps figure out details. There are other things that can be done. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra stress test and max vs. average read/write latency.
I'm trying to understand if this is expected or not, and if there is Without careful tuning, outliers around a couple of hundred ms are definitely expected in general (not *necessarily*, depending on workload) as a result of garbage collection pauses. The impact will be worsened a bit if you are running under high CPU load (or even maxing it out with stress) because post-pause, if you are close to max CPU usage you will take considerably longer to catch up. Personally, I would just log each response time and feed it to gnuplot or something. It should be pretty obvious whether or not the latencies are due to periodic pauses. If you are concerned with eliminating or reducing outliers, I would: (1) Make sure that when you're benchmarking, that you're putting Cassandra under a reasonable amount of load. Latency benchmarks are usually useless if you're benchmarking against a saturated system. At least, start by achieving your latency goals at 25% or less CPU usage, and then go from there if you want to up it. (2) One can affect GC pauses, but it's non-trivial to eliminate the problem completely. For example, the length of frequent young-gen pauses can typically be decreased by decreasing the size of the young generation, leading to more frequent shorter GC pauses. But that instead causes more promotion into the old generation, which will result in more frequent very long pauses (relative to normal; they would still be infrequent relative to young gen pauses) - IF your workload is such that you are suffering from fragmentation and eventually seeing Cassandra fall back to full compacting GC:s (stop-the-world) for the old generation. I would start by adjusting young gen so that your frequent pauses are at an acceptable level, and then see whether or not you can sustain that in terms of old-gen. Start with this in any case: Run Cassandra with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: CPU bound workload
I would first eliminate or confirm any GC hypothesis by running all nodes with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps. Is full GC not being logged through GCInspector with the defaults ? The GC inspect tries its best, but it's polling. Unfortunately the JDK provides no way to properly monitor for GC events within the Java application. The GC inspector can miss a GC. Also, the GC inspector only tells you time + type of GC; a GC log will provide all sorts of details. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Crazy compactionstats
Exception in thread main java.io.IOException: Repair command #1: some repair session(s) failed (see log for details). For why repair failed you unfortunately need to log at logs as it suggests. I still see pending tasks in nodetool compactionstats, and their number goes into hundreds which I haven't seen before. What's going on? The compactions pending is not very useful. It just says how many tasks are pending that MIGHT compact. Typically it will either be 0, or it will be steadily increasing while compactions are happening until suddenly snapping back to 0 again once compactions catch up. Whether or not non-zero is a problem depends on the Cassandra version, how many concurrent compactors you are running, and your column families/data sizes/flushing speeds etc. (Sorry, kind of a long story) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: cassandra in production environment
We are currently testing cassandra in RHEL 6.1 64 bit environment running on ESXi 5.0 and are experiencing issues with data file corruptions. If you are using linux for production environment can you please share which OS/version you are using? It would probably be a good idea if you could be a bit more specific about the nature of the corruption and the observations made, and the version of Cassandra you are using. As for production envs; lots of people are bound to use various environments in production; I suppose the only interesting bit would be if someone uses RHEL 6.1 specifically? I mean I can say that I've run Cassandra on Debian Squeeze in production, but that doesn't really help you ;) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Consistence for node shutdown and startup
How about the conflict data when the two node on line separately. How it synchronized by two nodes when they both on line finally? Briefly, it's based on timestamp conflict resolution. This may be a good resource: http://www.datastax.com/docs/1.0/dml/about_writes#about-transactions -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: read/write counts
1) Are KS level counts and CF level counts for whole cluster or just for an individual node? Individual node. Also note that the CF level counts will refer to local reads/writes submitted to the node, while the statistics you get from StorageProxy (in JMX) are for requests routed. In general, you will see a magnification by a factor of RF on the local statistics (in aggregate) relative to the StorageProxy stats. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Meaning of values in tpstats
With the slicing, I'm not sure off the top of my head. I'm sure someone else can chime in. For e.g. a multi-get, they end up as independent tasks. So if I multiget 10 keys, they are fetched in //, consolidated by the coodinator and then sent back ? Took me a while to figure out that // == parallel :) I'm pretty sure (but not entirely, I'd have to check the code) that the request is forwarded as one request to the necessary node(s); what I was saying rather was that the individual gets get queued up as individual tasks to be executed internally in the different stages. That does lead to parallelism locally on the node (subject to the concurrent reader setting. Agreed, I followed someone suggestion some time ago to reduce my batch sizes and it has helped tremendoulsy. I'm now doing multigetslices in batchers of 512 instead of 5000 and I find I no longer have Pendings up so high. The most I see now is a couple hundred. In general, the best balance will depend on the situation. For example the benefit of batching increases as the latency to the cluster (and within it) increases, and the negative effects increase as you have higher demands of low latency on other traffic to the cluster. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: ParNew and caching
After re-reading my post, what I meant to say is that I switched from Serializing cache provider to ConcurrentLinkedHash cache provider and then saw better performance, but still far worse than no caching at all: - no caching at all : 25-30ms - with Serializing provider : 1300+ms - with Concurrent provider : 500ms 100% cache hit rate. ParNew is the only stat that I see out of line, so seems like still a lot of copying In general, if you want to get to the bottom of this stuff and you think GC is involved, always run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps so that the GC activity can be observed. 1300+ should not be GC unless you are having fallbacks to full GC:s (would be possible to see with gc logging) and it should definitely be possible to avoid full gc:s being extremely common (but eliminating them entirely may not be possible). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: CPU bound workload
I've got a batch process running every so often that issues a bunch of counter increments. I have noticed that when this process runs without being throttled it will raise the CPU to 80-90% utilization on the nodes handling the requests. This in turns timeouts and general lag on queries running on the cluster. This much is entirely expected. If you are not bottlenecking anywhere else and saturing the cluster, you will be bound by it, and it will affect the latency of other traffic, no matter how fast or slow Cassandra is. You do say nodes handling the requests. Two things to always keep in mind is to (1) spread the requests evenly across all members of the cluster, and (2) if you are doing a lot of work per row key, spread it around and be concurrent so that you're not hitting a single row at a time, which will be under the responsibility of a single set of RF nodes (you want to put load on the entire cluster evently if you want to maximize throughput). Is there anything that can be done to increase the throughput, I've been looking on the wiki and the mailing list and didn't find any optimization suggestions (apart from spreading the load on more nodes). Cluster is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores), Gigabit ethernet, RAID-0 SATA2 disks For starters, what *is* the throughput? How many counter mutations are you submitting per second? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: CPU bound workload
Counter increment is a special case in cassandra because the incur a local read before write. Normal column writes so not do this. So counter writes are intensive. If possible batch up the increments for less rpc calls and less reads. Note though that the CPU usage impact of this should be limited, in comparison to the impact when your reads end up going down to disk. I.e., the most important performance characteristic for people to consider is that counter writes may need to read from disk, contrary to all other writes in Cassandra which will only imply sequential I/O asynchronously. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra behavior too fragile?
Thing is, why is it so easy for the repair process to break? OK, I admit I'm not sure why nodes are reported as dead once in a while, but it's absolutely certain that they simply don't fall off the edge, are knocked out for 10 min or anything like that. Why is there no built-in tolerance/retry mechanism so that a node that may seem silent for a minute can be contacted later, or, better yet, a different node with a relevant replica is contacted? As was evident from some presentations at Cassandra-NYC yesterday, failed compactions and repairs are a major problem for a number of users. The cluster can quickly become unusable. I think it would be a good idea to build more robustness into these procedures, I am trying to argue for removing the failure-detector-kills-repair in https://issues.apache.org/jira/browse/CASSANDRA-3569, but I don't know whether that will happen since there is opposition. However, that only fixes the particular issue you are having right now. There are significant problems with repair, and the answer to why there is no retry is probably because it takes non-trivial amounts of work to make the current repair process be fault-tolerant in the face of TCP connections dying. Personally, my pet ticket to fix repair once and for all is https://issues.apache.org/jira/browse/CASSANDRA-2699 which should, at least as I envisioned it, fix a lot of problems, including making it much much much more robust to transient failures (it would just automatically be robust without specific code necessary to deal with it, because repair work would happen piecemeal and incrementally in a repeating fashion anyway). Nodes could basically be going up and down in any wild haywire mode and things would just automatically continue to work in the background. Repair would become irrelevant to cluster maintenance, and you wouldn't really have to think about whether or not someone is repairing. You would also not have to think about repair vs. gc grace time because it would all just sit there and work without intervention. It's a pretty big ticket though and not something I'm gonna be working on in my spare time, so I don't know whether or when I would actually work on that ticket (depends on priorities). I have the ideas but I can't promise to fix it :) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra not suitable?
I'm quite desperate about Cassandra's performance in our production cluster. We have 8 real-HW nodes, 32core CPU, 32GB memory, 4 disks in raid10, cassandra 0.8.8, RF=3 and Hadoop. We four keyspaces, one is the large one, it has 2 CFs, one is kind of index, the other holds data. There are about 7milinon rows, mean row size is 7kB. We run several mapreduce tasks, most of them just reads from cassandra and writes to hdfs, but one fetch rows from cassnadra, compute something and write it back, for each row we compute three new json values, about 1kB each (they get overwritten next round). We got lots and lots of Timeout exceptions, LiveSSTablesCount over 100. Reapir doesn't finish even in 24hours, reading from the other keyspaces timeouts as well. We set compaction_throughput_mb_per_sec: 0 but it didn't help. Exactly why you're seeing timeouts would depend on quite a few factors. In general however, my observation is that you have ~ 100 gig per node on nodes with 32 gigs of memory, *AND* you say you're running map reduce jobs. In general, I would expect that any performance problems you have are probably due to cache misses and simply bottlenecking on disk I/O. What to do about it depends very much on the situation and it's difficult to give a concrete suggestion without more context. Some things that might mitigate effects include using row cache for the hot data set (if you have a very small hot data set that should work well since the row cache is unaffected by e.g. mapreduce jobs), selecting a different compaction strategy (leveled can be better, depending), running map reduce on a separate DC that takes writes but is separated from the live cluster that takes reads (unless you're only doing batch request). But those are just some random things thrown in the air; do not take that as concrete suggestions for your particular case. The key is understanding the access pattern, what the bottlenecks are, in combination with how the database works - and figure out what the most cost-effective solution is. Note that if you're bottlenecking on disk I/O, it's not surprising at all that repairing ~ 100 gigs of data takes more than 24 hours. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Repair failure under 0.8.6
I capped heap and the error is still there. So I keep seeing node dead messages even when I know the nodes were OK. Where and how do I tweak timeouts? You can increase phi_convict_threshold in the configuration. However, I would rather want to find out why they are being marked as down to begin with. In a healthy situation, especially if you are not putting extreme load on the cluster, there is very little reason for hosts to be marked as down unless there's some bug somewhere. Is this cluster under constant traffic? Are you seeing slow requests from the point of view of the client (indicating that some requests are routed to nodes that are temporarily inaccessible)? With respect to GC, I would recommend running with -XX:+PrintGC and -XX:PrintGCDetails and -XX:+PrintGCTimeStamps and -XX:+PrintGCDateStamps and then look at the system log. A fallback to full GC should be findable by grepping for Full. Also, is this a problem with one specific host, or is it happening to all hosts every now and then? And I mean either the host being flagged as down, or the host that is flagging others as down. As for uncapped heap: Generally a larger heap is not going to make it more likely to fall back to full GC; usually the opposite is true. However, a larger heap can make some of the non-full GC pauses longer, depending. In either case, r unning with the above GC options will give you specific information on GC pauses and should allow you to rule that out (or not). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Repair failure under 0.8.6
I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? You will need to restart the nodes for which you want the phi convict threshold to be different. You might want to do on e.g. half of the cluster to do A/B testing. I do recall that I see nodes temporarily marked as down, only to pop up later. I recommend grepping through the logs on all the clusters (e.g., cat /var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell you quickly whether they all seem to be seeing roughly as many node flaps, or whether some particular node or set of nodes is/are over-represented. Next, look at the actual nodes flapping (remove wc -l) and see if all nodes are flapping or if it is a single node, or a subset of the nodes (e.g., sharing a switch perhaps). In the current situation, there is no load on the cluster at all, outside the maintenance like the repair. Ok. So what i'm getting at then is that there may be real legitimate connectivity problems that you aren't noticing in any other way since you don't have active traffic to the cluster. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Repair failure under 0.8.6
As a side effect of the failed repair (so it seems) the disk usage on the affected node prevents compaction from working. It still works on the remaining nodes (we have 3 total). Is there a way to scrub the extraneous data? This is one of the reasons why killing an in-process repair is a bad thing :( If you do not have enough disk space for any kind of compaction to work, then no, unfortunately there is no easy way to get rid of the data. You can go to extra trouble such as moving the entire node to some other machine (e.g. firewalled from the cluster) with more disk and run compaction there and then move it back, but that is kind of painful to do. Another option is to decommission the node and replace it. However, be aware that (1) that leaves the ring with less capacity for a while, and (2) when you decommission, the data you stream from that node to others would be artificially inflated due to the repair so there is some risk of infecting the other nodes with a large data set. I should mention that if you have no traffic running against the cluster, one way is to just remove all the data and then run repair afterwards. But that implies that you're trusting that (1) no reads are going to the cluster (else you might serve reads based on missing data) and (2) that you are comfortable with loss of the data on the node. (2) might be okay if you're e.g. writing at QUORUM at all times and have RF = 3 (basically, this is as if the node would have been lost due to hardware breakage). A faster way to reconstruct the node would be to delete the data from your keyspaces (except the system keyspace), start the node (now missing data), and run 'nodetool rebuild' after https://issues.apache.org/jira/browse/CASSANDRA-3483 is done. The patch attached to that ticket should work for 0.8.6 I suspect (but no guarantees). This also assumes you have no reads running against the cluster. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Repair failure under 0.8.6
quite understand how Cassandra declared a node dead (in the below). Was is a timeout? How do I fix that? I was about to respond to say that repair doesn't fail just due to failure detection, but this appears to have been broken by CASSANDRA-2433 :( Unless there is a subtle bug the exception you're seeing should be indicative that it really was considered Down by the node. You might grep the log for references ot the node in question (UP or DOWN) to confirm. The question is why though. I would check if the node has maybe automatically restarted, or went into full GC, etc. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Repair failure under 0.8.6
Filed https://issues.apache.org/jira/browse/CASSANDRA-3569 to fix it so that streams don't die due to conviction. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Decommission a node causing high IO usage on 2 other nodes
What may be causing high IO wait on 2.new and 3.new? Do you have RF=3? The problem with 'nodetool ring' in terms of interpreting those '0%' is that it does not take RF and replication strategy into account. If you have RF=3, whatever data has its primary range assigned to node N will also be on N+1 and N+2. Additionally, whatever data has it's primary range assigned to N-1 and N-2 will also have a copy on N. So despite all the old nodes showing up as 0% in nodetool ring, they are in fact responsible for data. And you would expect to see nodes N+2 and N-2 be the two old nodes affected by decommissioning node N. (Unless I'm tripping myself up somewhere now...) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Pending ReadStage is exploding on only one node
I'm measuring a high load value on a few nodes during the update process (which is normal), but one node keeps the high load after the process for a long time. I would say that either the reading that you to is overloading that one node and other traffic is getting piled up as a result, or you're stomping on page cache by reading a lot from that one node (e.g. using CL.ONE) and you're then seeing readstage backed up until the page cache or row cache is warm again. In general, unless you're running at close to full CPU capacity it sounds like you're completely disk bound, and that'll show up as a huge amount of pending ReadStage. iostat -x -k 1 should confirm it. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Local quorum reads
Keyspace: NetworkTopologyStrategy, 1DC, RF=2 1 node goes down and we cannot read from the ring. My expectation is that LOCAL_QUORUM dictates that it will return a record once a majority (N/2 +1) of replicas reports back. Yes, the RF and the number of hosts that are up within the replica set of the row in question is what matters. 2/2 + 1 = 2 I originally thought N=3 for 3 nodes but someone has corrected me on this that it's the replicas (2) but when I use the cli, setConsistencyLevel AS LOCAL_QUORUM nothing comes back. Set it back to ONE I can see the data Is there somethin I'm missing? I don't know, but I am guessing this is just a poor UI issue. The CLI is probably getting an Unavailable exception and just giving you nothing instead of reporting a problem, but that is speculation on my part. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Local quorum reads
My expectation is that LOCAL_QUORUM dictates that it will return a record once a majority (N/2 +1) of replicas reports back. Yes, the RF and the number of hosts that are up within the replica set of the row in question is what matters. And note that this is for fundamental reasons. A node that is not part of the replica set (and is not expected to have any data for the row) cannot usefully participate in the read. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Local quorum reads
No it's not just the cli tool, our app has the same issue coming back with read issues. You are supposed to not be able to read it. But you should be getting a proper error, not an empty result. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Random access
you can access cassandra data at any node and keys can be accessed at random. Including individual columns in a row. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?
I would like to know it also - actually is should be similar, plus there are no dependencies to sun.misc packages. I don't remember the discussion, but I assume the reason is that allocateDirect() is not freeable except by waiting for soft ref counting. This is enforced by the API in order to enable safe use of allocated memory without it being possible to use to break out of the JVM sandbox. JNA or misc.unsafe allows explicit freeing (at the cost of application bugs maybe segfaulting the JVM or causing other side-effects; i.e., breaking out of the managed runtime sandbox). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Mass deletion -- slowing down
Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a range of columns into a row, deleting it, and inserting another non-overlapping subsequent range. Repeat that a bunch of times. In terms of what's stored in Cassandra for the row you now have: tomb tomb tomb tomb actual data If you then do something like a slice on that row with the end-points being such that they include all the tombstones, Cassandra essentially has to read through and process all those tombstones (for the PostgreSQL aware: this is similar to the effect you can get if implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect to the number of deleted entries until the last vacuum - improved in modern versions)). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)