Write timeout on other nodes when joing a new node (in new DC)
Hi all, we are experiencing a strange behavior when we are trying to bootstrap a new node. The problem is that the Recent Write Latency goes to 2s on all the other Cassandra nodes (which are receiving user traffic), which corresponds to our setting of "write_request_timeout_in_ms: 2000". We use Cassandra 2.0.10 and trying to convert to vnodes and increase a replication factor. So we are adding a new node in new DC (marked as DCXA) as the only node in new DC with replication factor 3. The reason for higher RF is that we will be converting another 2 existing servers to new DC (vnodes) and we want them to get all the data. The replication settings look like this: ALTER KEYSPACE slw WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC4': '1', 'DC5': '1', 'DC2': '1', 'DC3': '1', 'DC0': '1', 'DC1': '1', 'DC0A': '3', 'DC1A': '3', 'DC2A': '3', 'DC3A': '3', 'DC4A': '3', 'DC5A': '3' }; We were adding the nodes to DC0A->DC4A without any effects on existing nodes (DCX without A). When we are trying to add DC5A, the abovemention problem happens, 100% reproducibly. I tried to increase number of concurrent_writers from 32 to 128 on the old nodes, also tried to increase number of flush writers, both with no effect. The strange thing is that the load, CPU usage, GC, network throughput - everything is fine on the old nodes which are reporting 2s of write latency. Nodetool tpstats does not show any blocked/pending operations. I think I must be hitting some limit (because of overall of replicas?) somewhere. Any input would be greatly appreciated. Thanks Jirka H.
Re: Write timeout on other nodes when joing a new node (in new DC)
Hi all, so after deep investigation, we found out that this is this problem https://issues.apache.org/jira/browse/CASSANDRA-8058 Jiri Horky On 10/20/2015 12:00 PM, Jiri Horky wrote: > Hi all, > > we are experiencing a strange behavior when we are trying to bootstrap a > new node. The problem is that the Recent Write Latency goes to 2s on all > the other Cassandra nodes (which are receiving user traffic), which > corresponds to our setting of "write_request_timeout_in_ms: 2000". > > We use Cassandra 2.0.10 and trying to convert to vnodes and increase a > replication factor. So we are adding a new node in new DC (marked as > DCXA) as the only node in new DC with replication factor 3. The reason > for higher RF is that we will be converting another 2 existing servers > to new DC (vnodes) and we want them to get all the data. > > The replication settings look like this: > ALTER KEYSPACE slw WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'DC4': '1', > 'DC5': '1', > 'DC2': '1', > 'DC3': '1', > 'DC0': '1', > 'DC1': '1', > 'DC0A': '3', > 'DC1A': '3', > 'DC2A': '3', > 'DC3A': '3', > 'DC4A': '3', > 'DC5A': '3' > }; > > We were adding the nodes to DC0A->DC4A without any effects on existing > nodes (DCX without A). When we are trying to add DC5A, the abovemention > problem happens, 100% reproducibly. > > I tried to increase number of concurrent_writers from 32 to 128 on the > old nodes, also tried to increase number of flush writers, both with no > effect. The strange thing is that the load, CPU usage, GC, network > throughput - everything is fine on the old nodes which are reporting 2s > of write latency. Nodetool tpstats does not show any blocked/pending > operations. > > I think I must be hitting some limit (because of overall of replicas?) > somewhere. > > Any input would be greatly appreciated. > > Thanks > Jirka H. >
Re: Availability testing of Cassandra nodes
Hi Jack, it seems there is a some misunderstanding. There are two things. One is that the Cassandra works for application, which may (and should) be true even if some of the nodes are actually down. The other thing is that even in this case you want to be notified that there are faulty Cassandra nodes. Now I am trying to tackle the later case, I am not having issues with how client-side load balancing works. Jirka H. On 04/09/2015 07:15 AM, Ajay wrote: Adding Java driver forum. Even we like to know more on this. - Ajay On Wed, Apr 8, 2015 at 8:15 PM, Jack Krupansky jack.krupan...@gmail.com mailto:jack.krupan...@gmail.com wrote: Just a couple of quick comments: 1. The driver is supposed to be doing availability and load balancing already. 2. If your cluster is lightly loaded, it isn't necessary to be so precise with load balancing. 3. If your cluster is heavily loaded, it won't help. Solution is to expand your cluster so that precise balancing of requests (beyond what the driver does) is not required. Is there anything special about your use case that you feel is worth the extra treatment? If you are having problems with the driver balancing requests and properly detecting available nodes or see some room for improvement, make sure to the issues so that they can be fixed. -- Jack Krupansky On Wed, Apr 8, 2015 at 10:31 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi all, we are thinking of how to best proceed with availability testing of Cassandra nodes. It is becoming more and more apparent that it is rather complex task. We thought that we should try to read and write to each cassandra node to monitoring keyspace with a unique value with low TTL. This helps to find an issue but it also triggers flapping of unaffected hosts, as the key of the value which is beining inserted sometimes belongs to an affected host and sometimes not. Now, we could calculate the right value to insert so we can be sure it will hit the host we are connecting to, but then, you have replication factor and consistency level, so you can not be really sure that it actually tests ability of the given host to write values. So we ended up thinking that the best approach is to connect to each individual host, read some system keyspace (which might be on a different disk drive...), which should be local, and then check several JMX values that could indicate an error + JVM statitics (full heap, gc overhead). Moreover, we will more monitor our applications that are using cassandra (with mostly datastax driver) and try to get fail node information from them. How others do the testing? Jirka H.
Availability testing of Cassandra nodes
Hi all, we are thinking of how to best proceed with availability testing of Cassandra nodes. It is becoming more and more apparent that it is rather complex task. We thought that we should try to read and write to each cassandra node to monitoring keyspace with a unique value with low TTL. This helps to find an issue but it also triggers flapping of unaffected hosts, as the key of the value which is beining inserted sometimes belongs to an affected host and sometimes not. Now, we could calculate the right value to insert so we can be sure it will hit the host we are connecting to, but then, you have replication factor and consistency level, so you can not be really sure that it actually tests ability of the given host to write values. So we ended up thinking that the best approach is to connect to each individual host, read some system keyspace (which might be on a different disk drive...), which should be local, and then check several JMX values that could indicate an error + JVM statitics (full heap, gc overhead). Moreover, we will more monitor our applications that are using cassandra (with mostly datastax driver) and try to get fail node information from them. How others do the testing? Jirka H.
Re: High GC activity on node with 4TB on data
Number of cores: 2x6Cores x 2(HT). I do agree with you that the the hardware is certainly overestimated for just one Cassandra, but we got a very good price since we ordered several 10s of the same nodes for a different project. That's why we use for multiple cassandra instances. Jirka H. On 02/12/2015 04:18 PM, Eric Stevens wrote: each node has 256G of memory, 24x1T drives, 2x Xeon CPU I don't have first hand experience running Cassandra on such massive hardware, but it strikes me that these machines are dramatically oversized to be good candidates for Cassandra (though I wonder how many cores are in those CPUs; I'm guessing closer to 18 than 2 based on the other hardware). A larger cluster of smaller hardware would be a much better shape for Cassandra. Or several clusters of smaller hardware since you're running multiple instances on this hardware - best practices have one instance per host no matter the hardware size. On Thu, Feb 12, 2015 at 12:36 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi Chris, On 02/09/2015 04:22 PM, Chris Lohfink wrote: - number of tombstones - how can I reliably find it out? https://github.com/spotify/cassandra-opstools https://github.com/cloudian/support-tools thanks. If not getting much compression it may be worth trying to disable it, it may contribute but its very unlikely that its the cause of the gc pressure itself. 7000 sstables but STCS? Sounds like compactions couldn't keep up. Do you have a lot of pending compactions (nodetool)? You may want to increase your compaction throughput (nodetool) to see if you can catch up a little, it would cause a lot of heap overhead to do reads with that many. May even need to take more drastic measures if it cant catch back up. I am sorry, I was wrong. We actually do use LCS (the switch was done recently). There are almost none pending compaction. We have increased the size sstable to 768M, so it should help as as well. May also be good to check `nodetool cfstats` for very wide partitions. There are basically none, this is fine. It seems that the problem really comes from having so much data in so many sstables, so org.apache.cassandra.io.compress.CompressedRandomAccessReader classes consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over and over. We have turned off the compression and so far, the situation seems to be fine. Cheers Jirka H. Theres a good chance if under load and you have over 8gb heap your GCs could use tuning. The bigger the nodes the more manual tweaking it will require to get the most out of them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has some ideas. Chris On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi all, thank you all for the info. To answer the questions: - we have 2 DCs with 5 nodes in each, each node has 256G of memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra instances running for different project. The node itself is powerful enough. - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per DC (because of amount of data and because it serves more or less like a cache) - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation requests - numbers are sum of all nodes - we us STCS (LCS would be quite IO have for this amount of data) - number of tombstones - how can I reliably find it out? - the biggest CF (3.6T per node) has 7000 sstables Now, I understand that the best practice for Cassandra is to run with the minimum size of heap which is enough which for this case we thought is about 12G - there is always 8G consumbed by the SSTable readers. Also, I though that high number of tombstones create pressure in the new space (which can then cause pressure in old space as well), but this is not what we are seeing. We see continuous GC activity in Old generation only. Also, I noticed that the biggest CF has Compression factor of 0.99 which basically means that the data come compressed already. Do you think that turning off the compression should help with memory consumption? Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help here, as it seems that 8G is something that Cassandra needs for bookkeeping this amount of data and that this was sligtly above the 75% limit which triggered the CMS again and again. I will definitely have a look at the presentation. Regards Jiri Horky On 02/08/2015 10:32
Re: How to speed up SELECT * query in Cassandra
Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of our projects found out that doing that with a not trivial in a timely manner is close to impossible in many situations. We are slowly moving to storing the data in HDFS and possibly reprocess them on a daily bases for such usecases (statistics). This is nothing against Cassandra, it can not be perfect for everything. But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). Jirka H. On 02/11/2015 01:51 PM, DuyHai Doan wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 tel:%2B1%20612%20859%206129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se mailto:jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se mailto:jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se http://www.tink.se/ Facebook https://www.facebook.com/#%21/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
Hi, here are some snippets of code in scala which should get you started. Jirka H. loop {lastRow =val query = lastRow match {case Some(row) = nextPageQuery(row, upperLimit)case None = initialQuery(lowerLimit)}session.execute(query).all} private def nextPageQuery(row: Row, upperLimit: String): String = {val tokenPart = token(%s) token(0x%s) and token(%s) %s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)basicQuery.format(tokenPart)} private def initialQuery(lowerLimit: String): String = {val tokenPart = token(%s) = %s.format(rowKeyName, lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = {tokenRange match {case Some((start, end)) =Logger.info(Token range given: {}, + start.underlying.toPlainString + , + end.underlying.toPlainString + )val tokenSpaceSize = end - startval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val tokenSpaceSize = partitioner.max - partitioner.minval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)}} private val basicQuery = {select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s, // templatewhereCondition,pageSize,if (cqlAllowFiltering) allow filtering else )} case object Murmur3 extends Partitioner {override val min = BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case object Random extends Partitioner {override val min = BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1} On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: High GC activity on node with 4TB on data
Hi Chris, On 02/09/2015 04:22 PM, Chris Lohfink wrote: - number of tombstones - how can I reliably find it out? https://github.com/spotify/cassandra-opstools https://github.com/cloudian/support-tools thanks. If not getting much compression it may be worth trying to disable it, it may contribute but its very unlikely that its the cause of the gc pressure itself. 7000 sstables but STCS? Sounds like compactions couldn't keep up. Do you have a lot of pending compactions (nodetool)? You may want to increase your compaction throughput (nodetool) to see if you can catch up a little, it would cause a lot of heap overhead to do reads with that many. May even need to take more drastic measures if it cant catch back up. I am sorry, I was wrong. We actually do use LCS (the switch was done recently). There are almost none pending compaction. We have increased the size sstable to 768M, so it should help as as well. May also be good to check `nodetool cfstats` for very wide partitions. There are basically none, this is fine. It seems that the problem really comes from having so much data in so many sstables, so org.apache.cassandra.io.compress.CompressedRandomAccessReader classes consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over and over. We have turned off the compression and so far, the situation seems to be fine. Cheers Jirka H. Theres a good chance if under load and you have over 8gb heap your GCs could use tuning. The bigger the nodes the more manual tweaking it will require to get the most out of them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has some ideas. Chris On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi all, thank you all for the info. To answer the questions: - we have 2 DCs with 5 nodes in each, each node has 256G of memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra instances running for different project. The node itself is powerful enough. - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per DC (because of amount of data and because it serves more or less like a cache) - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation requests - numbers are sum of all nodes - we us STCS (LCS would be quite IO have for this amount of data) - number of tombstones - how can I reliably find it out? - the biggest CF (3.6T per node) has 7000 sstables Now, I understand that the best practice for Cassandra is to run with the minimum size of heap which is enough which for this case we thought is about 12G - there is always 8G consumbed by the SSTable readers. Also, I though that high number of tombstones create pressure in the new space (which can then cause pressure in old space as well), but this is not what we are seeing. We see continuous GC activity in Old generation only. Also, I noticed that the biggest CF has Compression factor of 0.99 which basically means that the data come compressed already. Do you think that turning off the compression should help with memory consumption? Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help here, as it seems that 8G is something that Cassandra needs for bookkeeping this amount of data and that this was sligtly above the 75% limit which triggered the CMS again and again. I will definitely have a look at the presentation. Regards Jiri Horky On 02/08/2015 10:32 PM, Mark Reddy wrote: Hey Jiri, While I don't have any experience running 4TB nodes (yet), I would recommend taking a look at a presentation by Arron Morton on large nodes: http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/ to see if you can glean anything from that. I would note that at the start of his talk he mentions that in version 1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are storing anything more than that you are getting into very specialised use cases. If you could provide us with some more information about your cluster setup (No. of CFs, read/write patterns, do you delete / update often, etc.) that may help in getting you to a better place. Regards, Mark On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com mailto:bur...@spinn3r.com wrote: Do you have a lot of individual tables? Or lots of small compactions? I think the general consensus is that (at least for Cassandra), 8GB heaps are ideal. If you have lots of small tables it’s a known anti-pattern (I believe) because the Cassandra internals could do a better job on handling the in memory metadata representation. I think this has
Re: How to speed up SELECT * query in Cassandra
The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: High GC activity on node with 4TB on data
Hi all, thank you all for the info. To answer the questions: - we have 2 DCs with 5 nodes in each, each node has 256G of memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra instances running for different project. The node itself is powerful enough. - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per DC (because of amount of data and because it serves more or less like a cache) - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation requests - numbers are sum of all nodes - we us STCS (LCS would be quite IO have for this amount of data) - number of tombstones - how can I reliably find it out? - the biggest CF (3.6T per node) has 7000 sstables Now, I understand that the best practice for Cassandra is to run with the minimum size of heap which is enough which for this case we thought is about 12G - there is always 8G consumbed by the SSTable readers. Also, I though that high number of tombstones create pressure in the new space (which can then cause pressure in old space as well), but this is not what we are seeing. We see continuous GC activity in Old generation only. Also, I noticed that the biggest CF has Compression factor of 0.99 which basically means that the data come compressed already. Do you think that turning off the compression should help with memory consumption? Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help here, as it seems that 8G is something that Cassandra needs for bookkeeping this amount of data and that this was sligtly above the 75% limit which triggered the CMS again and again. I will definitely have a look at the presentation. Regards Jiri Horky On 02/08/2015 10:32 PM, Mark Reddy wrote: Hey Jiri, While I don't have any experience running 4TB nodes (yet), I would recommend taking a look at a presentation by Arron Morton on large nodes: http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/ to see if you can glean anything from that. I would note that at the start of his talk he mentions that in version 1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are storing anything more than that you are getting into very specialised use cases. If you could provide us with some more information about your cluster setup (No. of CFs, read/write patterns, do you delete / update often, etc.) that may help in getting you to a better place. Regards, Mark On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com mailto:bur...@spinn3r.com wrote: Do you have a lot of individual tables? Or lots of small compactions? I think the general consensus is that (at least for Cassandra), 8GB heaps are ideal. If you have lots of small tables it’s a known anti-pattern (I believe) because the Cassandra internals could do a better job on handling the in memory metadata representation. I think this has been improved in 2.0 and 2.1 though so the fact that you’re on 1.2.18 could exasperate the issue. You might want to consider an upgrade (though that has its own issues as well). On Sun, Feb 8, 2015 at 12:44 PM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi all, we are seeing quite high GC pressure (in old space by CMS GC Algorithm) on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory (2G for new space). The node runs fine for couple of days when the GC activity starts to raise and reaches about 15% of the C* activity which causes dropped messages and other problems. Taking a look at heap dump, there is about 8G used by SSTableReader classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader. Is this something expected and we have just reached the limit of how many data a single Cassandra instance can handle or it is possible to tune it better? Regards Jiri Horky -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:* *http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
High GC activity on node with 4TB on data
Hi all, we are seeing quite high GC pressure (in old space by CMS GC Algorithm) on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory (2G for new space). The node runs fine for couple of days when the GC activity starts to raise and reaches about 15% of the C* activity which causes dropped messages and other problems. Taking a look at heap dump, there is about 8G used by SSTableReader classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader. Is this something expected and we have just reached the limit of how many data a single Cassandra instance can handle or it is possible to tune it better? Regards Jiri Horky
Re: Node down during move
Hi, just a follow up. We've seen this behavior multiple times now. It seems that the receiving node loses connectivity to the cluster and thus thinks that it is the sole online node, whereas the rest of the cluster thinks that it is the only offline node, really just after the streaming is over. I am not sure what causes that, but it is reproducible. Restart of the affected node helps. We have 3 datacenters (RF=1 for each datacenter) where we are moving the tokens. This happens only in one of them. Regards Jiri Horky On 12/19/2014 08:20 PM, Jiri Horky wrote: Hi list, we added a new node to existing 8-nodes cluster with C* 1.2.9 without vnodes and because we are almost totally out of space, we are shuffling the token fone node after another (not in parallel). During one of this move operations, the receiving node died and thus the streaming failed: WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227 StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233 ColumnFamilyStore.java (line 629) Enqueuing flush of Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246 CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to /X.Y.Z.18:2,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) After restart of the receiving node, we tried to perform the move again, but it failed with: Exception in thread main java.io.IOException: target token 113427455640312821154458202477256070486 is already owned by another node. at org.apache.cassandra.service.StorageService.move(StorageService.java:2930) So we tried to move it with a token just 1 higher, to trigger the movement. This didn't move anything, but finished successfully: INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256 from /X.Y.Z.18 Now, it is quite improbable that the first streaming was done and it died just after copying everything, as the ERROR was the last message about streaming in the logs. Is there any way how to make sure the data are really moved and thus running nodetool cleanup is safe? Thank you. Jiri Hoky
Node down during move
Hi list, we added a new node to existing 8-nodes cluster with C* 1.2.9 without vnodes and because we are almost totally out of space, we are shuffling the token fone node after another (not in parallel). During one of this move operations, the receiving node died and thus the streaming failed: WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227 StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233 ColumnFamilyStore.java (line 629) Enqueuing flush of Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246 CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to /X.Y.Z.18:2,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) After restart of the receiving node, we tried to perform the move again, but it failed with: Exception in thread main java.io.IOException: target token 113427455640312821154458202477256070486 is already owned by another node. at org.apache.cassandra.service.StorageService.move(StorageService.java:2930) So we tried to move it with a token just 1 higher, to trigger the movement. This didn't move anything, but finished successfully: INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256 from /X.Y.Z.18 Now, it is quite improbable that the first streaming was done and it died just after copying everything, as the ERROR was the last message about streaming in the logs. Is there any way how to make sure the data are really moved and thus running nodetool cleanup is safe? Thank you. Jiri Hoky
Fail to reconnect to other nodes after intermittent network failure
$Segment.lockedGetOrLoad(LocalCache.java:2337) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252) ... 20 more Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105) at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930) at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140) at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245) ... 29 more These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not reapper after that), and cassandra seemed to be happy as it marked the other two nodes as up for a while. But after a few tens of seconds, it marked them as DOWN again and kept it in this state until restart three days later, even the network connectivity was stable: INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) InetAddress /77.234.42.20 is now UP INFO [HintedHandoff:2] 2014-08-01 09:02:37,305 HintedHandOffManager.java (line 296) Started hinted handoff for host: 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /77.234.42.20 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) InetAddress /77.234.44.20 is now UP INFO [HintedHandoff:1] 2014-08-01 09:02:37,308 HintedHandOffManager.java (line 296) Started hinted handoff for host: 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /77.234.44.20 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java (line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968 serialized/live bytes, 32 ops) INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line 398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live bytes, 32 ops) INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line 443) Completed flushing; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1403712545417, position=20482758) INFO [HintedHandoff:1] 2014-08-01 09:02:47,312 HintedHandOffManager.java (line 427) Timed out replaying hints to /77.234.44.20; aborting (0 delivered) INFO [HintedHandoff:2] 2014-08-01 09:02:47,312 HintedHandOffManager.java (line 427) Timed out replaying hints to /77.234.42.20; aborting (0 delivered) INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823) InetAddress /77.234.44.20 is now DOWN INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823) InetAddress mia10.ff.avast.com/77.234.42.20 is now DOWN After one day, the node started to discards hints it saved for the other nodes. WARN [OptionalTasks:1] 2014-08-02 09:09:10,277 HintedHandoffMetrics.java (line 78) /77.234.44.20 has 9039 dropped hints, because node is down past configured hint window. WARN [OptionalTasks:1] 2014-08-02 09:09:10,290 HintedHandoffMetrics.java (line 78) /77.234.42.20 has 8816 dropped hints, because node is down past configured hint window. The nodetool status marked the other two nodes as down. After restart, it reconnected to the cluster without any problem. What puzzles me is the fact that the authentization apparently started to work after the network recovered but the exchange of data did not. I would like to understand what could caused the problems and how to avoid them in the future. Any pointers would be appreciated. Regards Jiri Horky
Re: Fail to reconnect to other nodes after intermittent network failure
OK, ticket 7696 [1] created. Jiri Horky https://issues.apache.org/jira/browse/CASSANDRA-7696 On 08/05/2014 07:57 PM, Robert Coli wrote: On Tue, Aug 5, 2014 at 5:48 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: What puzzles me is the fact that the authentization apparently started to work after the network recovered but the exchange of data did not. I would like to understand what could caused the problems and how to avoid them in the future. Very few people use SSL and very few people use auth, you have probably hit an edge case. I would file a JIRA with the details you described above. =Rob
Re: upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL
Thank you both for the answers! Jiri Horky On 01/10/2014 02:52 AM, Aaron Morton wrote: We avoid mixing versions for a long time, but we always upgrade one node and check the application is happy before proceeding. e.g. wait for 30 minutes before upgrading the others. If you snapshot before upgrading, and have to roll back after 30 minutes you can roll back to the snapshot and use repair to fix the data on disk. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 9/01/2014, at 7:24 am, Robert Coli rc...@eventbrite.com mailto:rc...@eventbrite.com wrote: On Wed, Jan 8, 2014 at 1:17 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: I am specifically interested whether is possible to upgrade just one node and keep it running like that for some time, i.e. if the gossip protocol is compatible in both directions. We are a bit afraid to upgrade all nodes to 1.2.13 at once in a case we would need to rollback. This not not officially supported. It will probably work for these particular versions, but it is not recommended. The most serious potential issue is an inability to replace the new node if it fails. There's also the problem of not being able to repair until you're back on the same versions. And other, similar, undocumented edge cases... =Rob
upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL
Hi all, I would appreciate an advice whether is a good idea to upgrade from cassandra 1.2.3 to 1.2.13 and how to best proceed. The particular cluster consists of 3 nodes (each one in a different DC having 1 replica) with a relativelly low traffic and 10GB load per node. I am specifically interested whether is possible to upgrade just one node and keep it running like that for some time, i.e. if the gossip protocol is compatible in both directions. We are a bit afraid to upgrade all nodes to 1.2.13 at once in a case we would need to rollback. I know that the sstable format changed in 1.2.5 so in case we need to rollback, the newly written data would need to be synchronized from the old servers. Also, after the migration to 1.2.13, we would like to start using node-to-node encryption. I imagine that you need to configure it on all nodes at once, so it would require a small outage. Thank you in advance Jiri Horky
Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled
Hi, On 11/12/2013 05:29 AM, Aaron Morton wrote: Are you doing large slices or do could you have a lot of tombstones on the rows ? don't really know - how can I monitor that? For tombstones, do you do a lot of deletes ? Also in v2.0.2 cfstats has this Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 For large slices you need to check your code. e.g. do you anything that reads lots of columns or very large columns or lets the user select how many columns to read? The org.apache.cassandra.db.ArrayBackedSortedColumns in the trace back is used during reads (.e.g. org.apache.cassandra.db.filter.SliceQueryFilter) thanks for explanation, will try to provide some figures (but unfortunately not from the 2.0.2). You probably want the heap to be 4G to 8G in size, 10G will encounter longer pauses. Also the size of the new heap may be too big depending on the number of cores. I would recommend trying 800M I tried to decrease it first to 384M then to 128M with no change in the behaviour. I don't really care extra memory overhead of the cache - to be able to actual point to it with objects, but I don't really see the reason why it should create/delete those many objects so quickly. Not sure what you changed to 384M. Sorry for the confusion. I meant to say that I tried to decrease row cache size to 384M and then to 128M and the GC times did not change at all (still ~30% of the time). Shows the heap growing very quickly. This could be due to wide reads or a high write throughput. Well, both prg01 and prg02 receive the same load which is about ~150-250 (during peak) read requests per seconds and 100-160 write requests per second. The only with heap growing rapidly and GC kicking in is on nodes with row cache enabled. This sounds like on a row cache miss cassandra is reading the whole row, which happens to be a wide row. I would also guess some writes are going to the rows and they are getting invalidated out of the row cache. The row cache is not great for rows the update frequently and/or wide rows. How big are the rows ? use nodetool cfstats and nodetool cfhistograms. I will get in touch with the developers and take the data from cf* commands in a few days (I am out of office for some days). Thanks for the pointers, will get in touch. Cheers Jiri Horky
Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled
Hi, On 11/07/2013 05:18 AM, Aaron Morton wrote: Class Name | Shallow Heap | Retained Heap --- | | java.nio.HeapByteBuffer @ 0x7806a0848 | 48 |80 '- name org.apache.cassandra.db.Column @ 0x7806424e8 | 32 | 112 |- [338530] java.lang.Object[540217] @ 0x57d62f560 Unreachable |2,160,888 | 2,160,888 |- [338530] java.lang.Object[810325] @ 0x591546540 |3,241,320 | 7,820,328 | '- elementData java.util.ArrayList @ 0x75e8424c0 | 24 | 7,820,352 | |- list org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @ 0x5940e0b18 | 48 | 128 | | '- val$filteredIter org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48 | 32 | 7,820,568 | | '- val$iter org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68 Unreachable | 24 | 7,820,592 | |- this$0, parent java.util.ArrayList$SubList @ 0x5940e0bb8 | 40 |40 | | '- this$1 java.util.ArrayList$SubList$1 @ 0x5940e0be0 | 40 |80 | | '- currentSlice org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @ 0x5940e0b18| 48 | 128 | |'- val$filteredIter org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48 | 32 | 7,820,568 | | '- val$iter org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68 Unreachable | 24 | 7,820,592 | |- columns org.apache.cassandra.db.ArrayBackedSortedColumns @ 0x5b0a33488 | 32 |56 | | '- val$cf org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48 | 32 | 7,820,568 | | '- val$iter org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68 Unreachable | 24 | 7,820,592 | '- Total: 3 entries | | |- [338530] java.lang.Object[360145] @ 0x7736ce2f0 Unreachable |1,440,600 | 1,440,600 '- Total: 3 entries | | Are you doing large slices or do could you have a lot of tombstones on the rows ? don't really know - how can I monitor that? We have disabled row cache on one node to see the difference. Please see attached plots from visual VM, I think that the effect is quite visible. The default row cache is of the JVM heap, have you changed to the ConcurrentLinkedHashCacheProvider ? Answered by Chris already :) No. One way the SerializingCacheProvider could impact GC is if the CF takes a lot of writes. The SerializingCacheProvider invalidates the row when it is written to and had to read the entire row and serialise it on a cache miss. -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G -Xmn1024M -XX:+HeapDumpOnOutOfMemoryError You probably want the heap to be 4G to 8G in size, 10G will encounter longer pauses. Also the size of the new heap may be too big depending on the number of cores. I would recommend trying 800M I tried to decrease it first to 384M then to 128M with no change in the behaviour. I don't really care extra memory overhead of the cache - to be able to actual point to it with objects, but I don't really see the reason why it should create/delete those many objects so quickly. prg01.visual.vm.png Shows the heap growing very quickly. This could be due to wide reads or a high write throughput. Well, both prg01 and prg02 receive the same load which is about ~150-250 (during peak) read requests per seconds and 100-160 write requests per second. The only with heap growing rapidly and GC kicking in is on nodes with row cache enabled. Hope that helps. Thank you! Jiri Horky
Cassandra 1.2.9 - 2.0.0 update can't
Hi, I know it is Not Supported and also Not Advised, but currently we have half of our cassandra cluster on 1.2.9 and the other half on 2.0.0 as we got stuck during upgrade. We know already that it was not very wise decision but that's what we've got now. I have basically two questions. First, it would really help us if there was a way how to make request to newer cassandra work with data on older cassandra. We get Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replica available for query at consistency ONE (1 required but only 0 alive), which is surely caused by the fact that the newer nodes see the older as DOWN in nodetool ring. We solve this problem by connecting to the older nodes which see the whole cluster as online but it would help us deploying newer version of our application if could connect to 2.0.0 as well. Second, to avoid any future problems when upgrading, is there any documentation on what is incompatible between 1.2.x line and 2.0.x, what is not supposed to work together and what are the recommended steps? I don't see any warning regarding incompatibility of the gossip (?) protocol between 2.0.0 and 1.2.9 causing the described problem in the official documentation Changes impacting upgrade. Thank you Jiri Horky
Re: Recompacting all sstables
Hi, On 11/01/2013 09:15 PM, Robert Coli wrote: On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS, we hit CASSANDRA-6284 bug. 1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1] I think you already guessed the answer :) It is a production cluster, we needed some features (particularly, compare and set) only present in 2.0 because of the applications. Besides, somebody had to discover the regression, right? :) Thanks for the link. 3) What do you mean by upgraded half of our Cassandra cluster? That is Not Supported and also Not Advised... for example, before the streaming change in 2.x line, a cluster in such a state may be unable to have nodes added, removed or replaced. We are in the middle of the migration from 1.2.9 to 2.0 when we are also upgrading our application which can only be run against 2.0 due to various technical details. It is rather hard to explain, but we hoped it will last just for few days and it is definitely not the status we wanted to keep. Since we hit the bug, we got stalled in the middle of the migration. So the question. What is the best way to recompact all the sstables so the data in one sstables within a level would contain more or less the right portion of the data ... Based on documentation, I can only think of switching to SizeTiered compaction, doing major compaction and then switching back to LCS. That will work, though be aware of the implication of CASSANDRA-6092 [2]. Briefly, if the CF in question is not receiving write load, you will be unable to promote your One Big SSTable from L0 to L1. In that case, you might want to consider running sstable_split (and then restarting the node) in order to split your One Big SSTable into two or more smaller ones. Hmm, thinking about it a bit more, I am unsure this will actually help. If I understand things correctly, assuming uniform distribution of new received keys in L0 (ensured by RandomPartitioner), in order for LCS to work optimally, I need: a) get uniform distribution of keys across sstables in one level, i.e. in every level each sstable will cover more or less the same range of keys b) sstables in each level should cover almost whole space of keys the node is responsible for c) propagate sstables to higher levels in uniform fashion, e.g. round-robin or random (over time, the probability of choosing an sstables as candidate should be the same for all sstables in the level) By splitting the sorted Big SStable, I will get a bunch of non-overlapping sstables. So I will surely achieve a). Point c) is fixed by the patch. But what about b)? It probably depends on order of compaction across levels, i.e. whether the compactions in various levels are being run in parallel and interleaved or not. In case it compacts all the tables from one level and only after that starts to compact sstables in higher level etc, one will end up in very similar situation as caused by the referenced bug (because of round robin fashion of choosing candidates), i.e. having the biggest keys in L1 and smallest keys in the highest level. So in this case, it would actually not help at all. Does it make sense or am I completely wrong? :) BTW: Not very though-out idea, but wouldn't it actually be better to select candidates completely randomly? Cheers Jiri Horky =Rob [1] https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ [2] https://issues.apache.org/jira/browse/CASSANDRA-6092
Recompacting all sstables
Hi all since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS, we hit CASSANDRA-6284 bug. So basically all data in sstables created after the upgrade are wrongly (non-uniformly within compaction levels) distributed. This causes a huge overhead when compacting new sstables (see the bug for the details). After applying the patch, the distribution of the data within a level is supposed to recover itself over time but we would like to not to wait a month or so until it gets better. So the question. What is the best way to recompact all the sstables so the data in one sstables within a level would contain more or less the right portion of the data, in other worlds, keys would be uniformly distributed across sstables within a level? (e.g.: assumming total token range for a node 1..1, and given that L2 should contain 100 sstables, , all sstables within L2 should cover a range of ~100 tokens). Based on documentation, I can only think of switching to SizeTiered compaction, doing major compaction and then switching back to LCS. Thanks in advance Jiri Horky