Retrieving all row keys of a CF
We have a column family that has about 800K rows and on an average about a million columns. I am interested in getting all the row keys in this column family and I am using the following Astyanax code snippet to do this. This query never finishes (ran it for 2 days but did not finish). This query however works with CF's that have lesser number of columns. This leads me to believe that there might be an API that just retrieves the row keys and does not depend on the number of columns in the CF. Any suggestions are appreciated. I am running Cassandra 2.0.9 and this is a 4 node cluster. keyspace.prepareQuery(*this* .wideRowTables.get(group)).setConsistencyLevel(ConsistencyLevel.CL_QUORUM).getAllRows().setRowLimit(1000) .setRepeatLastToken(*false* ).withColumnRange(*new* RangeBuilder().setLimit(1).build()).executeWithCallback(*new* RowCallbackString, T() { @Override *public* *boolean* failure(ConnectionException e) { *return* *true*; } @Override *public* *void* success(RowsString, T rows) { // iterating over rows here } });
ConnectionException while trying to connect with Astyanax over Java driver
All, I am trying to use the new astyanax over java driver to connect to cassandra version 1.2.12, Following settings are turned on in cassandra.yaml: start_rpc: true native_transport_port: 9042 start_native_transport: true *Code to connect:* final SupplierListHost hostSupplier = new SupplierListHost() { @Override public ListHost get() { ListHost hosts = new ArrayList(); for(String hostPort : StringUtil.getSetFromDelimitedString(seedHosts, ,)) { String[] pair = hostPort.split(:); Host host = new Host(pair[0], Integer.valueOf(pair[1]).intValue()); host.setRack(rack1); hosts.add(host); } return hosts; } }; // get keyspace AstyanaxContextKeyspace context = new AstyanaxContext.Builder() .forCluster(clusterName) .forKeyspace(keyspace) .withHostSupplier(hostSupplier) .withAstyanaxConfiguration( new AstyanaxConfigurationImpl() .setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE) .setDiscoveryDelayInSeconds(6).setCqlVersion(3.0.0).setTargetCassandraVersion(1.2.12) ) .withConnectionPoolConfiguration( new *JavaDriverConfigBuilder*().withPort(9042) .build()) .buildKeyspace(CqlFamilyFactory.getInstance()); context.start(); *Exception in Cassandra Server logs:* WARN [New I/O server boss #1 ([id: 0x6815d6c5, /0.0.0.0:9042])] 2014-10-06 11:11:37,826 Slf4JLogger.java (line 82) Failed to accept a connection. java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.init(IZ)V at org.apache.cassandra.transport.Frame$Decoder.init(Frame.java:147) at org.apache.cassandra.transport.Server$PipelineFactory.getPipeline(Server.java:232) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.registerAcceptedChannel(NioServerSocketPipelineSink.java:276) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:246) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) I also tried using the Java Driver 2.1.1, but I see the NoHostAvailableException, and I feel the underlying reason is the same as during connecting with astyanax java driver.
Re: ConnectionException while trying to connect with Astyanax over Java driver
That exception is on the cassandra server and not on the client. On Mon, Oct 6, 2014 at 2:10 PM, DuyHai Doan doanduy...@gmail.com wrote: java.lang.NoSuchMethodError - Jar dependency issue probably. Did you try to create an issue on the Astyanax github repo ? On Mon, Oct 6, 2014 at 6:01 PM, Ruchir Jha ruchir@gmail.com wrote: All, I am trying to use the new astyanax over java driver to connect to cassandra version 1.2.12, Following settings are turned on in cassandra.yaml: start_rpc: true native_transport_port: 9042 start_native_transport: true *Code to connect:* final SupplierListHost hostSupplier = new SupplierListHost() { @Override public ListHost get() { ListHost hosts = new ArrayList(); for(String hostPort : StringUtil.getSetFromDelimitedString(seedHosts, ,)) { String[] pair = hostPort.split(:); Host host = new Host(pair[0], Integer.valueOf(pair[1]).intValue()); host.setRack(rack1); hosts.add(host); } return hosts; } }; // get keyspace AstyanaxContextKeyspace context = new AstyanaxContext.Builder() .forCluster(clusterName) .forKeyspace(keyspace) .withHostSupplier(hostSupplier) .withAstyanaxConfiguration( new AstyanaxConfigurationImpl() .setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE) .setDiscoveryDelayInSeconds(6).setCqlVersion(3.0.0).setTargetCassandraVersion(1.2.12) ) .withConnectionPoolConfiguration( new *JavaDriverConfigBuilder*().withPort(9042) .build()) .buildKeyspace(CqlFamilyFactory.getInstance()); context.start(); *Exception in Cassandra Server logs:* WARN [New I/O server boss #1 ([id: 0x6815d6c5, /0.0.0.0:9042])] 2014-10-06 11:11:37,826 Slf4JLogger.java (line 82) Failed to accept a connection. java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.init(IZ)V at org.apache.cassandra.transport.Frame$Decoder.init(Frame.java:147) at org.apache.cassandra.transport.Server$PipelineFactory.getPipeline(Server.java:232) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.registerAcceptedChannel(NioServerSocketPipelineSink.java:276) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:246) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) I also tried using the Java Driver 2.1.1, but I see the NoHostAvailableException, and I feel the underlying reason is the same as during connecting with astyanax java driver.
OpsCenter_rollups*
Hi, I see a lot of activity around the OpsCenter_rollups CFs in the logs. Why is there so much OpsCenter work happening? Is there a way to disable it, and whats the impact? Ruchir.
Re: Compression during bootstrap
On Wednesday, August 13, 2014, Robert Coli wrote: On Wed, Aug 13, 2014 at 5:53 AM, Ruchir Jha ruchir@gmail.com javascript:_e(%7B%7D,'cvml','ruchir@gmail.com'); wrote: We are adding nodes currently and it seems like compression is falling behind. I judge that by the fact that the new node which has a 4.5T disk fills up to 100% while its bootstrapping. Can we avoid this problem with the LZ4 compressor because of better compression or do we just need a bigger disk? 2TB per node is a lot of data. 4.5 would be a huge amount of data. Sure. Do you mean we should have started adding nodes before we got here? Do you mean compaction is falling behind? Do you setcompactionthroughput 0 while bootstrapping new nodes? I did a nodetool getcompactionthroughput and I got 0 MB/ s. It seems like that just disables compaction throttling which seems like a good thingin my scenario. Is that correct? I don't think compression is involved here? Why do you think it does? This is why I thought compression is involved: http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_about_config_compress_c.html Side note : we also increased number of concurrent compactors from 3 to 10. Because we had a lot of idle CPU lying around but that's not helping everytime we start bootstrapping we are still hitting 4.5 tb and them we run out of disk space. =Rob
Compression during bootstrap
Hello, We currently are at C* 1.2 and are using the SnappyCompressor for all our CFs. Total data size is at 24 TB, and its a 12 node cluster. Avg node size is 2 TB. We are adding nodes currently and it seems like compression is falling behind. I judge that by the fact that the new node which has a 4.5T disk fills up to 100% while its bootstrapping. Can we avoid this problem with the LZ4 compressor because of better compression or do we just need a bigger disk? The reason why we started with 4.5 TB was because we were assuming that while a new node is bootstrapping it may not need more than 2 times the avg data size. Is that a weak assumption? Ruchir.
Re: Node bootstrap
Still having issues with node bootstrapping. The new node just died, because it Full Gced, the nodes it had actual streams with noticed its down. After the full gc finished the new node printed this log : ERROR 02:52:36,259 Stream failed because /10.10.20.35 died or was restarted/removed (streams may still be active in background, but further streams won't be started) Here 10.10.20.35 is an existing node, the new guy was streaming from. A similar log was printed for every other node on the cluster. Why did the new node just exit after the FGC pause? We have heap dumps enabled on Full GC's and this are the top offenders on the new node. A new entry that I noticed is the CompressionMetaData chunks. Anything I can do to optimize that? num #instances #bytes class name -- 1: 42508421 4818885752 [B 2: 65860543 3161306064 java.nio.HeapByteBuffer 3: 124361093 2984666232 org.apache.cassandra.io.compress.CompressionMetadata$Chunk 4: 29745665 1427791920 edu.stanford.ppl.concurrent.SnapTreeMap$Node 5: 29810362 953931584 org.apache.cassandra.db.Column 6: 31623 498012768 [Lorg.apache.cassandra.io.compress.CompressionMetadata$Chunk; On Tue, Aug 5, 2014 at 2:59 PM, Ruchir Jha ruchir@gmail.com wrote: Also, right now the top command shows that we are at 500-700% CPU, and we have 23 total processors, which means we have a lot of idle CPU left over, so throwing more threads at compaction and flush should alleviate the problem? On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote: Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is set to 0, which I believe disables throttling. Also, Here is the iostat -x 5 5 output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 1.56 14.62 1.05 11.21 dm-0 0.00 0.003.59 18.82 166.52 150.35 14.14 0.44 19.49 0.54 1.22 dm-1 0.00 0.002.325.3718.5642.98 8.00 0.76 98.82 0.43 0.33 dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.87 13.30 5.570.90 0.06 36.00 sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 4.43 20.68 1.29 27.64 avg-cpu: %user %nice %system %iowait %steal %idle 14.64 10.751.81 13.500.00 59.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.40 1344.60 68.80 145.60 4964.80 11790.40 78.15 0.381.80 0.80 17.10 dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 4.883.90 0.09 11.58 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 16.00 2.00 0.32 dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 9.89 2847.50 109.52 0.05 93.50 sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 179.6272.88 66.01 1.04 93.42 avg-cpu: %user %nice %system %iowait %steal %idle 15.51 19.771.975.020.00 57.73 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.20 523.40 60.00 285.00 5220.80 5913.60 32.27 0.250.72 0.60 20.86 dm-0 0.00 0.000.801.4032.0011.20 19.64 0.013.18 1.55 0.34 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 21.00 2.62 0.42 dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.80 18.20 251.66 184.72 0.10 63.48 sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 5.98 40.92 1.23 58.28 avg-cpu: %user %nice %system %iowait %steal %idle 16.59 16.342.039.010.00 56.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.40 320.00 37.40 159.80 2483.20 3529.60 30.49 0.100.52 0.39 7.76 dm-0 0.00 0.000.203.60 1.6028.80 8.00 0.000.68 0.68 0.26 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 11.86 869.18 48.82 0.06 76.96 sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.2521.46 23.75 0.97 75.08 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Ruchir, With the large number of blocked flushes and the number of pending compactions would still indicate IO contention. Can you post the output of 'iostat -x 5 5
Re: Node bootstrap
Thanks Patricia for your response! On the new node, I just see a lot of the following: INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java (line 400) Writing Memtable INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 CompactionTask.java (line 262) Compacted 12 sstables to so basically it is just busy flushing, and compacting. Would you have any ideas on why the 2x disk space blow up. My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. And the heaviest node is around 2.1 TB, and the new node is already at 4 TB. Could this be because compaction is falling behind? Ruchir On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla patri...@thelastpickle.com wrote: Ruchir, What exactly are you seeing in the logs? Are you running major compactions on the new bootstrapping node? With respect to the seed list, it is generally advisable to use 3 seed nodes per AZ / DC. Cheers, On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha ruchir@gmail.com wrote: I am trying to bootstrap the thirteenth node in a 12 node cluster where the average data size per node is about 2.1 TB. The bootstrap streaming has been going on for 2 days now, and the disk size on the new node is already above 4 TB and still going. Is this because the new node is running major compactions while the streaming is going on? One thing that I noticed that seemed off was the seeds property in the yaml of the 13th node comprises of 1..12. Where as the seeds property on the existing 12 nodes consists of all the other nodes except the thirteenth node. Is this an issue? Any other insight is appreciated? Ruchir. -- Patricia Gorla @patriciagorla Consultant Apache Cassandra Consulting http://www.thelastpickle.com http://thelastpickle.com
Re: Node bootstrap
Yes num_tokens is set to 256. initial_token is blank on all nodes including the new one. On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy mark.re...@boxever.com wrote: My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. If you are using vnodes and you have num_tokens set to 256 the new node will take token ranges dynamically. What is the configuration of your other nodes, are you setting num_tokens or initial_token on those? Mark On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote: Thanks Patricia for your response! On the new node, I just see a lot of the following: INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java (line 400) Writing Memtable INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 CompactionTask.java (line 262) Compacted 12 sstables to so basically it is just busy flushing, and compacting. Would you have any ideas on why the 2x disk space blow up. My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. And the heaviest node is around 2.1 TB, and the new node is already at 4 TB. Could this be because compaction is falling behind? Ruchir On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla patri...@thelastpickle.com wrote: Ruchir, What exactly are you seeing in the logs? Are you running major compactions on the new bootstrapping node? With respect to the seed list, it is generally advisable to use 3 seed nodes per AZ / DC. Cheers, On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha ruchir@gmail.com wrote: I am trying to bootstrap the thirteenth node in a 12 node cluster where the average data size per node is about 2.1 TB. The bootstrap streaming has been going on for 2 days now, and the disk size on the new node is already above 4 TB and still going. Is this because the new node is running major compactions while the streaming is going on? One thing that I noticed that seemed off was the seeds property in the yaml of the 13th node comprises of 1..12. Where as the seeds property on the existing 12 nodes consists of all the other nodes except the thirteenth node. Is this an issue? Any other insight is appreciated? Ruchir. -- Patricia Gorla @patriciagorla Consultant Apache Cassandra Consulting http://www.thelastpickle.com http://thelastpickle.com
Re: Node bootstrap
Also not sure if this is relevant but just noticed the nodetool tpstats output: Pool NameActive Pending Completed Blocked All time blocked FlushWriter 0 0 1136 0 512 Looks like about 50% of flushes are blocked. On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha ruchir@gmail.com wrote: Yes num_tokens is set to 256. initial_token is blank on all nodes including the new one. On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy mark.re...@boxever.com wrote: My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. If you are using vnodes and you have num_tokens set to 256 the new node will take token ranges dynamically. What is the configuration of your other nodes, are you setting num_tokens or initial_token on those? Mark On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote: Thanks Patricia for your response! On the new node, I just see a lot of the following: INFO [FlushWriter:75] 2014-08-05 09:53:04,394 Memtable.java (line 400) Writing Memtable INFO [CompactionExecutor:3] 2014-08-05 09:53:11,132 CompactionTask.java (line 262) Compacted 12 sstables to so basically it is just busy flushing, and compacting. Would you have any ideas on why the 2x disk space blow up. My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. And the heaviest node is around 2.1 TB, and the new node is already at 4 TB. Could this be because compaction is falling behind? Ruchir On Mon, Aug 4, 2014 at 7:23 PM, Patricia Gorla patri...@thelastpickle.com wrote: Ruchir, What exactly are you seeing in the logs? Are you running major compactions on the new bootstrapping node? With respect to the seed list, it is generally advisable to use 3 seed nodes per AZ / DC. Cheers, On Mon, Aug 4, 2014 at 11:41 AM, Ruchir Jha ruchir@gmail.com wrote: I am trying to bootstrap the thirteenth node in a 12 node cluster where the average data size per node is about 2.1 TB. The bootstrap streaming has been going on for 2 days now, and the disk size on the new node is already above 4 TB and still going. Is this because the new node is running major compactions while the streaming is going on? One thing that I noticed that seemed off was the seeds property in the yaml of the 13th node comprises of 1..12. Where as the seeds property on the existing 12 nodes consists of all the other nodes except the thirteenth node. Is this an issue? Any other insight is appreciated? Ruchir. -- Patricia Gorla @patriciagorla Consultant Apache Cassandra Consulting http://www.thelastpickle.com http://thelastpickle.com
Re: Node bootstrap
nodetool status: Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.20.27 1.89 TB256 25.4% 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 UN 10.10.20.62 1.83 TB256 25.5% 84b47313-da75-4519-94f3-3951d554a3e5 rack1 UN 10.10.20.47 1.87 TB256 24.7% bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 UN 10.10.20.45 1.7 TB 256 22.6% 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 UN 10.10.20.15 1.86 TB256 24.5% 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 UN 10.10.20.31 1.87 TB256 24.9% 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 UN 10.10.20.35 1.86 TB256 25.8% 17cb8772-2444-46ff-8525-33746514727d rack1 UN 10.10.20.51 1.89 TB256 25.0% 0343cd58-3686-465f-8280-56fb72d161e2 rack1 UN 10.10.20.19 1.91 TB256 25.5% 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 UN 10.10.20.39 1.93 TB256 26.0% b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 UN 10.10.20.52 1.81 TB256 25.4% 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 UN 10.10.20.22 1.89 TB256 24.8% 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 Note: The new node is not part of the above list. nodetool compactionstats: pending tasks: 1649 compaction typekeyspace column family completed total unit progress Compaction iprod customerorder 1682804084 17956558077 bytes 9.37% Compactionprodgatecustomerorder 1664239271 1693502275 bytes98.27% Compaction qa_config_bkupfixsessionconfig_hist 2443 27253 bytes 8.96% Compactionprodgatecustomerorder_hist 1770577280 5026699390 bytes35.22% Compaction iprodgatecustomerorder_hist 2959560205312350192622 bytes 0.95% On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy mark.re...@boxever.com wrote: Yes num_tokens is set to 256. initial_token is blank on all nodes including the new one. Ok so you have num_tokens set to 256 for all nodes with initial_token commented out, this means you are using vnodes and the new node will automatically grab a list of tokens to take over responsibility for. Pool NameActive Pending Completed Blocked All time blocked FlushWriter 0 0 1136 0 512 Looks like about 50% of flushes are blocked. This is a problem as it indicates that the IO system cannot keep up. Just ran this on the new node: nodetool netstats | grep Streaming from | wc -l 10 This is normal as the new node will most likely take tokens from all nodes in the cluster. Sorry for the multiple updates, but another thing I found was all the other existing nodes have themselves in the seeds list, but the new node does not have itself in the seeds list. Can that cause this issue? Seeds are only used when a new node is bootstrapping into the cluster and needs a set of ips to contact and discover the cluster, so this would have no impact on data sizes or streaming. In general it would be considered best practice to have a set of 2-3 seeds from each data center, with all nodes having the same seed list. What is the current output of 'nodetool compactionstats'? Could you also paste the output of nodetool status keyspace? Mark On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha ruchir@gmail.com wrote: Sorry for the multiple updates, but another thing I found was all the other existing nodes have themselves in the seeds list, but the new node does not have itself in the seeds list. Can that cause this issue? On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha ruchir@gmail.com wrote: Just ran this on the new node: nodetool netstats | grep Streaming from | wc -l 10 Seems like the new node is receiving data from 10 other nodes. Is that expected in a vnodes enabled environment? Ruchir. On Tue, Aug 5, 2014 at 10:21 AM, Ruchir Jha ruchir@gmail.com wrote: Also not sure if this is relevant but just noticed the nodetool tpstats output: Pool NameActive Pending Completed Blocked All time blocked FlushWriter 0 0 1136 0 512 Looks like about 50% of flushes are blocked. On Tue, Aug 5, 2014 at 10:14 AM, Ruchir Jha ruchir@gmail.com wrote: Yes num_tokens is set to 256. initial_token is blank on all nodes including the new one. On Tue, Aug 5, 2014 at 10:03 AM, Mark Reddy mark.re...@boxever.com wrote: My understanding was that if initial_token is left empty on the new node, it just contacts the heaviest node and bisects its token range. If you are using vnodes and you have num_tokens set to 256 the new node will take token ranges dynamically
Re: Node bootstrap
Also Mark to your comment on my tpstats output, below is my iostat output, and the iowait is at 4.59%, which means no IO pressure, but we are still seeing the bad flush performance. Should we try increasing the flush writers? Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 _x86_64_(24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.80 10.250.654.590.00 78.72 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 103.83 9630.62 11982.60 3231174328 4020290310 dm-0 13.57 160.1781.12 53739546 27217432 dm-1 7.5916.9443.775682200 14686784 dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 sdb 206.09 22789.19 33569.27 7646015080 11262843224 On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha ruchir@gmail.com wrote: nodetool status: Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.20.27 1.89 TB256 25.4% 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 UN 10.10.20.62 1.83 TB256 25.5% 84b47313-da75-4519-94f3-3951d554a3e5 rack1 UN 10.10.20.47 1.87 TB256 24.7% bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 UN 10.10.20.45 1.7 TB 256 22.6% 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 UN 10.10.20.15 1.86 TB256 24.5% 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 UN 10.10.20.31 1.87 TB256 24.9% 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 UN 10.10.20.35 1.86 TB256 25.8% 17cb8772-2444-46ff-8525-33746514727d rack1 UN 10.10.20.51 1.89 TB256 25.0% 0343cd58-3686-465f-8280-56fb72d161e2 rack1 UN 10.10.20.19 1.91 TB256 25.5% 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 UN 10.10.20.39 1.93 TB256 26.0% b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 UN 10.10.20.52 1.81 TB256 25.4% 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 UN 10.10.20.22 1.89 TB256 24.8% 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 Note: The new node is not part of the above list. nodetool compactionstats: pending tasks: 1649 compaction typekeyspace column family completed total unit progress Compaction iprod customerorder 1682804084 17956558077 bytes 9.37% Compactionprodgatecustomerorder 1664239271 1693502275 bytes98.27% Compaction qa_config_bkupfixsessionconfig_hist 2443 27253 bytes 8.96% Compactionprodgatecustomerorder_hist 1770577280 5026699390 bytes35.22% Compaction iprodgatecustomerorder_hist 2959560205312350192622 bytes 0.95% On Tue, Aug 5, 2014 at 11:37 AM, Mark Reddy mark.re...@boxever.com wrote: Yes num_tokens is set to 256. initial_token is blank on all nodes including the new one. Ok so you have num_tokens set to 256 for all nodes with initial_token commented out, this means you are using vnodes and the new node will automatically grab a list of tokens to take over responsibility for. Pool NameActive Pending Completed Blocked All time blocked FlushWriter 0 0 1136 0 512 Looks like about 50% of flushes are blocked. This is a problem as it indicates that the IO system cannot keep up. Just ran this on the new node: nodetool netstats | grep Streaming from | wc -l 10 This is normal as the new node will most likely take tokens from all nodes in the cluster. Sorry for the multiple updates, but another thing I found was all the other existing nodes have themselves in the seeds list, but the new node does not have itself in the seeds list. Can that cause this issue? Seeds are only used when a new node is bootstrapping into the cluster and needs a set of ips to contact and discover the cluster, so this would have no impact on data sizes or streaming. In general it would be considered best practice to have a set of 2-3 seeds from each data center, with all nodes having the same seed list. What is the current output of 'nodetool compactionstats'? Could you also paste the output of nodetool status keyspace? Mark On Tue, Aug 5, 2014 at 3:59 PM, Ruchir Jha ruchir@gmail.com wrote: Sorry for the multiple updates, but another thing I found was all the other existing nodes have themselves in the seeds list, but the new node does not have itself in the seeds list. Can that cause this issue? On Tue, Aug 5, 2014 at 10:30 AM, Ruchir Jha ruchir@gmail.com wrote: Just ran this on the new node: nodetool netstats | grep Streaming from | wc -l 10 Seems like
Re: Node bootstrap
Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is set to 0, which I believe disables throttling. Also, Here is the iostat -x 5 5 output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 1.56 14.62 1.05 11.21 dm-0 0.00 0.003.59 18.82 166.52 150.3514.14 0.44 19.49 0.54 1.22 dm-1 0.00 0.002.325.3718.5642.98 8.00 0.76 98.82 0.43 0.33 dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.8713.30 5.570.90 0.06 36.00 sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 4.43 20.68 1.29 27.64 avg-cpu: %user %nice %system %iowait %steal %idle 14.64 10.751.81 13.500.00 59.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.40 1344.60 68.80 145.60 4964.80 11790.4078.15 0.381.80 0.80 17.10 dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 4.883.90 0.09 11.58 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 16.00 2.00 0.32 dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 9.89 2847.50 109.52 0.05 93.50 sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 179.62 72.88 66.01 1.04 93.42 avg-cpu: %user %nice %system %iowait %steal %idle 15.51 19.771.975.020.00 57.73 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.20 523.40 60.00 285.00 5220.80 5913.6032.27 0.250.72 0.60 20.86 dm-0 0.00 0.000.801.4032.0011.2019.64 0.013.18 1.55 0.34 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 21.00 2.62 0.42 dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.8018.20 251.66 184.72 0.10 63.48 sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 5.98 40.92 1.23 58.28 avg-cpu: %user %nice %system %iowait %steal %idle 16.59 16.342.039.010.00 56.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.40 320.00 37.40 159.80 2483.20 3529.6030.49 0.100.52 0.39 7.76 dm-0 0.00 0.000.203.60 1.6028.80 8.00 0.000.68 0.68 0.26 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 11.86 869.18 48.82 0.06 76.96 sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.25 21.46 23.75 0.97 75.08 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Ruchir, With the large number of blocked flushes and the number of pending compactions would still indicate IO contention. Can you post the output of 'iostat -x 5 5' If you do in fact have spare IO, there are several configuration options you can tune such as increasing the number of flush writers and compaction_throughput_mb_per_sec Mark On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha ruchir@gmail.com wrote: Also Mark to your comment on my tpstats output, below is my iostat output, and the iowait is at 4.59%, which means no IO pressure, but we are still seeing the bad flush performance. Should we try increasing the flush writers? Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 _x86_64_(24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.80 10.250.654.590.00 78.72 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 103.83 9630.62 11982.60 3231174328 4020290310 dm-0 13.57 160.1781.12 53739546 27217432 dm-1 7.5916.9443.775682200 14686784 dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 sdb 206.09 22789.19 33569.27 7646015080 11262843224 On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha ruchir@gmail.com wrote: nodetool status: Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.20.27 1.89 TB256 25.4% 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 UN 10.10.20.62 1.83 TB256 25.5% 84b47313-da75-4519-94f3-3951d554a3e5 rack1 UN 10.10.20.47 1.87 TB256 24.7% bcd51a92-3150-41ae-9c51
Re: Node bootstrap
Also, right now the top command shows that we are at 500-700% CPU, and we have 23 total processors, which means we have a lot of idle CPU left over, so throwing more threads at compaction and flush should alleviate the problem? On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote: Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is set to 0, which I believe disables throttling. Also, Here is the iostat -x 5 5 output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 1.56 14.62 1.05 11.21 dm-0 0.00 0.003.59 18.82 166.52 150.3514.14 0.44 19.49 0.54 1.22 dm-1 0.00 0.002.325.3718.5642.98 8.00 0.76 98.82 0.43 0.33 dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.8713.30 5.570.90 0.06 36.00 sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 4.43 20.68 1.29 27.64 avg-cpu: %user %nice %system %iowait %steal %idle 14.64 10.751.81 13.500.00 59.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.40 1344.60 68.80 145.60 4964.80 11790.4078.15 0.381.80 0.80 17.10 dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 4.883.90 0.09 11.58 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 16.00 2.00 0.32 dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 9.89 2847.50 109.52 0.05 93.50 sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 179.6272.88 66.01 1.04 93.42 avg-cpu: %user %nice %system %iowait %steal %idle 15.51 19.771.975.020.00 57.73 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.20 523.40 60.00 285.00 5220.80 5913.6032.27 0.250.72 0.60 20.86 dm-0 0.00 0.000.801.4032.0011.2019.64 0.013.18 1.55 0.34 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 21.00 2.62 0.42 dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.8018.20 251.66 184.72 0.10 63.48 sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 5.98 40.92 1.23 58.28 avg-cpu: %user %nice %system %iowait %steal %idle 16.59 16.342.039.010.00 56.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.40 320.00 37.40 159.80 2483.20 3529.6030.49 0.100.52 0.39 7.76 dm-0 0.00 0.000.203.60 1.6028.80 8.00 0.000.68 0.68 0.26 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 11.86 869.18 48.82 0.06 76.96 sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.25 21.46 23.75 0.97 75.08 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Ruchir, With the large number of blocked flushes and the number of pending compactions would still indicate IO contention. Can you post the output of 'iostat -x 5 5' If you do in fact have spare IO, there are several configuration options you can tune such as increasing the number of flush writers and compaction_throughput_mb_per_sec Mark On Tue, Aug 5, 2014 at 5:22 PM, Ruchir Jha ruchir@gmail.com wrote: Also Mark to your comment on my tpstats output, below is my iostat output, and the iowait is at 4.59%, which means no IO pressure, but we are still seeing the bad flush performance. Should we try increasing the flush writers? Linux 2.6.32-358.el6.x86_64 (ny4lpcas13.fusionts.corp) 08/05/2014 _x86_64_(24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.80 10.250.654.590.00 78.72 Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 103.83 9630.62 11982.60 3231174328 4020290310 dm-0 13.57 160.1781.12 53739546 27217432 dm-1 7.5916.9443.775682200 14686784 dm-2 5792.76 32242.66 45427.12 10817753530 15241278360 sdb 206.09 22789.19 33569.27 7646015080 11262843224 On Tue, Aug 5, 2014 at 12:13 PM, Ruchir Jha ruchir@gmail.com wrote: nodetool status: Datacenter: datacenter1 === Status=Up/Down |/ State
Node bootstrap
I am trying to bootstrap the thirteenth node in a 12 node cluster where the average data size per node is about 2.1 TB. The bootstrap streaming has been going on for 2 days now, and the disk size on the new node is already above 4 TB and still going. Is this because the new node is running major compactions while the streaming is going on? One thing that I noticed that seemed off was the seeds property in the yaml of the 13th node comprises of 1..12. Where as the seeds property on the existing 12 nodes consists of all the other nodes except the thirteenth node. Is this an issue? Any other insight is appreciated? Ruchir.
Full GC in cassandra
Really curious to know what's causing the spike in Columns and DeletedColums below : 2014-07-28T09:30:27.471-0400: 127335.928: [Full GC 127335.928: [Class Histogram: num #instances #bytes class name -- 1: 132626060 6366050880 java.nio.HeapByteBuffer 2: 28194918 3920045528 [B 3: 78124737 3749987376 edu.stanford.ppl.concurrent.SnapTreeMap$Node * 4: 67650128 2164804096 2164804096 org.apache.cassandra.db.Column* * 5: 16315310 522089920 org.apache.cassandra.db.DeletedColumn* 6: 6818 392489608 [I 7: 2844374 273059904 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch 8: 5727000 22908 java.util.TreeMap$Entry 9:767742 182921376 [J 10: 2932832 140775936 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder 11: 2844375 9102 edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch 12: 4145131 66322096 java.util.concurrent.atomic.AtomicReference 13:437874 64072392 [C 14: 2660844 63860256 java.util.concurrent.ConcurrentSkipListMap$Node 15: 4920 62849864 [[B 16: 1632063 52226016 edu.stanford.ppl.concurrent.SnapTreeMap
Re: Full GC in cassandra
Also we do subsequent updates (atleat 4) for each piece of data that we write. On Mon, Jul 28, 2014 at 10:36 AM, Ruchir Jha ruchir@gmail.com wrote: Doing about 5K writes / second. Avg Data Size = 1.6 TB / node. Total Data Size = 21 TB. And this is the nodetool cfstats output for one of our busiest column families: SSTable count: 10 Space used (live): 43239294899 Space used (total): 43239419603 SSTable Compression Ratio: 0.2954468408497778 Number of Keys (estimate): 63729152 Memtable Columns Count: 1921620 Memtable Data Size: 257680020 Memtable Switch Count: 9 Read Count: 6167 Read Latency: NaN ms. Write Count: 770984 Write Latency: 0.098 ms. Pending Tasks: 0 Bloom Filter False Positives: 370 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 80103200 Compacted row minimum size: 180 Compacted row maximum size: 3311 Compacted row mean size: 2631 Average live cells per slice (last five minutes): 73.0 Average tombstones per slice (last five minutes): 13.0 On Mon, Jul 28, 2014 at 10:14 AM, Mark Reddy mark.re...@boxever.com wrote: What is your data size and number of columns in Cassandra. Do you do many deletions? On Mon, Jul 28, 2014 at 2:53 PM, Ruchir Jha ruchir@gmail.com wrote: Really curious to know what's causing the spike in Columns and DeletedColums below : 2014-07-28T09:30:27.471-0400: 127335.928: [Full GC 127335.928: [Class Histogram: num #instances #bytes class name -- 1: 132626060 6366050880 java.nio.HeapByteBuffer 2: 28194918 3920045528 [B 3: 78124737 3749987376 edu.stanford.ppl.concurrent.SnapTreeMap$Node * 4: 67650128 2164804096 2164804096 org.apache.cassandra.db.Column* * 5: 16315310 522089920 org.apache.cassandra.db.DeletedColumn* 6: 6818 392489608 [I 7: 2844374 273059904 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch 8: 5727000 22908 java.util.TreeMap$Entry 9:767742 182921376 [J 10: 2932832 140775936 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder 11: 2844375 9102 edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch 12: 4145131 66322096 java.util.concurrent.atomic.AtomicReference 13:437874 64072392 [C 14: 2660844 63860256 java.util.concurrent.ConcurrentSkipListMap$Node 15: 4920 62849864 [[B 16: 1632063 52226016 edu.stanford.ppl.concurrent.SnapTreeMap
Re: UnavailableException
Mark, Here you go: *NodeTool status:* Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.10.20.15 1.62 TB256 8.1% 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 UN 10.10.20.19 1.66 TB256 8.3% 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 UN 10.10.20.35 1.62 TB256 9.0% 17cb8772-2444-46ff-8525-33746514727d rack1 UN 10.10.20.31 1.64 TB256 8.3% 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 UN 10.10.20.52 1.59 TB256 9.1% 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 UN 10.10.20.27 1.66 TB256 7.7% 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 UN 10.10.20.22 1.66 TB256 8.9% 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 UN 10.10.20.39 1.68 TB256 8.0% b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 UN 10.10.20.45 1.49 TB256 7.7% 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 UN 10.10.20.47 1.64 TB256 7.9% bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 UN 10.10.20.62 1.59 TB256 8.2% 84b47313-da75-4519-94f3-3951d554a3e5 rack1 UN 10.10.20.51 1.66 TB256 8.9% 0343cd58-3686-465f-8280-56fb72d161e2 rack1 *Astyanax Connection Settings:* seeds :12 maxConns :16 maxConnsPerHost:16 connectTimeout :2000 socketTimeout :6 maxTimeoutCount:16 maxBlockedThreadsPerHost:16 maxOperationsPerConnection:16 DiscoveryType: RING_DESCRIBE ConnectionPoolType: TOKEN_AWARE DefaultReadConsistencyLevel: CL_QUORUM DefaultWriteConsistencyLevel: CL_QUORUM On Fri, Jul 11, 2014 at 5:04 PM, Mark Reddy mark.re...@boxever.com wrote: Can you post the output of nodetool status and your Astyanax connection settings? On Fri, Jul 11, 2014 at 9:06 PM, Ruchir Jha ruchir@gmail.com wrote: This is how we create our keyspace. We just ran this command once through a cqlsh session on one of the nodes, so don't quite understand what you mean by check that your DC names match up CREATE KEYSPACE prod WITH replication = { 'class': 'NetworkTopologyStrategy', 'datacenter1': '3' }; On Fri, Jul 11, 2014 at 3:48 PM, Chris Lohfink clohf...@blackbirdit.com wrote: What replication strategy are you using? if using NetworkTopolgyStrategy double check that your DC names match up (case sensitive) Chris On Jul 11, 2014, at 9:38 AM, Ruchir Jha ruchir@gmail.com wrote: Here's the complete stack trace: com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=ny4lpcas5.fusionts.corp(10.10.20.47):9160, latency=22784(42874), attempts=3]UnavailableException() at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151) at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69) at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:485) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:79) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:123) Caused by: UnavailableException() at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20841) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:129) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:126) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ... 12 more On Fri, Jul 11, 2014 at 9:11 AM, Prem Yadav ipremya...@gmail.com wrote: Please post the full exception. On Fri, Jul 11, 2014 at 1:50 PM, Ruchir Jha ruchir@gmail.com wrote: We have a 12 node cluster and we are consistently seeing this exception being thrown during peak write traffic. We have a replication factor of 3 and a write consistency level of QUORUM. Also note there is no unusual Or Full GC activity during this time. Appreciate any help. Sent from my iPhone
Re: UnavailableException
Yes the line is : Datacenter: datacenter1 which matches with my create keyspace command. As for the NodeDiscoveryType, we will follow it but I don't believe it to be the root of my issue here because the nodes start up atleast 6 hours before the UnavailableException and as far as adding nodes is concerned we would only do it after hours. On Mon, Jul 14, 2014 at 2:34 PM, Chris Lohfink clohf...@blackbirdit.com wrote: If you list all 12 nodes in seeds list, you can try using NodeDiscoveryType.NONE instead of RING_DESCRIBE. Its been recommended that way by some anyway so if you add nodes to cluster your app wont start using it until all bootstrapping and everythings settled down. Chris On Jul 14, 2014, at 12:04 PM, Ruchir Jha ruchir@gmail.com wrote: Mark, Here you go: *NodeTool status:* Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.10.20.15 1.62 TB256 8.1% 01a01f07-4df2-4c87-98e9-8dd38b3e4aee rack1 UN 10.10.20.19 1.66 TB256 8.3% 30ddf003-4d59-4a3e-85fa-e94e4adba1cb rack1 UN 10.10.20.35 1.62 TB256 9.0% 17cb8772-2444-46ff-8525-33746514727d rack1 UN 10.10.20.31 1.64 TB256 8.3% 1435acf9-c64d-4bcd-b6a4-abcec209815e rack1 UN 10.10.20.52 1.59 TB256 9.1% 6b5aca07-1b14-4bc2-a7ba-96f026fa0e4e rack1 UN 10.10.20.27 1.66 TB256 7.7% 76023cdd-c42d-4068-8b53-ae94584b8b04 rack1 UN 10.10.20.22 1.66 TB256 8.9% 46af9664-8975-4c91-847f-3f7b8f8d5ce2 rack1 UN 10.10.20.39 1.68 TB256 8.0% b7d44c26-4d75-4d36-a779-b7e7bdaecbc9 rack1 UN 10.10.20.45 1.49 TB256 7.7% 8d6bce33-8179-4660-8443-2cf822074ca4 rack1 UN 10.10.20.47 1.64 TB256 7.9% bcd51a92-3150-41ae-9c51-104ea154f6fa rack1 UN 10.10.20.62 1.59 TB256 8.2% 84b47313-da75-4519-94f3-3951d554a3e5 rack1 UN 10.10.20.51 1.66 TB256 8.9% 0343cd58-3686-465f-8280-56fb72d161e2 rack1 *Astyanax Connection Settings:* seeds :12 maxConns :16 maxConnsPerHost:16 connectTimeout :2000 socketTimeout :6 maxTimeoutCount:16 maxBlockedThreadsPerHost:16 maxOperationsPerConnection:16 DiscoveryType: RING_DESCRIBE ConnectionPoolType: TOKEN_AWARE DefaultReadConsistencyLevel: CL_QUORUM DefaultWriteConsistencyLevel: CL_QUORUM On Fri, Jul 11, 2014 at 5:04 PM, Mark Reddy mark.re...@boxever.com wrote: Can you post the output of nodetool status and your Astyanax connection settings? On Fri, Jul 11, 2014 at 9:06 PM, Ruchir Jha ruchir@gmail.com wrote: This is how we create our keyspace. We just ran this command once through a cqlsh session on one of the nodes, so don't quite understand what you mean by check that your DC names match up CREATE KEYSPACE prod WITH replication = { 'class': 'NetworkTopologyStrategy', 'datacenter1': '3' }; On Fri, Jul 11, 2014 at 3:48 PM, Chris Lohfink clohf...@blackbirdit.com wrote: What replication strategy are you using? if using NetworkTopolgyStrategy double check that your DC names match up (case sensitive) Chris On Jul 11, 2014, at 9:38 AM, Ruchir Jha ruchir@gmail.com wrote: Here's the complete stack trace: com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=ny4lpcas5.fusionts.corp(10.10.20.47):9160, latency=22784(42874), attempts=3]UnavailableException() at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151) at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69) at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:485) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:79) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:123) Caused by: UnavailableException() at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20841) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute
UnavailableException
We have a 12 node cluster and we are consistently seeing this exception being thrown during peak write traffic. We have a replication factor of 3 and a write consistency level of QUORUM. Also note there is no unusual Or Full GC activity during this time. Appreciate any help. Sent from my iPhone
Re: UnavailableException
Here's the complete stack trace: com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=ny4lpcas5.fusionts.corp(10.10.20.47):9160, latency=22784(42874), attempts=3]UnavailableException() at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151) at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69) at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:485) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:79) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:123) Caused by: UnavailableException() at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20841) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:129) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:126) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ... 12 more On Fri, Jul 11, 2014 at 9:11 AM, Prem Yadav ipremya...@gmail.com wrote: Please post the full exception. On Fri, Jul 11, 2014 at 1:50 PM, Ruchir Jha ruchir@gmail.com wrote: We have a 12 node cluster and we are consistently seeing this exception being thrown during peak write traffic. We have a replication factor of 3 and a write consistency level of QUORUM. Also note there is no unusual Or Full GC activity during this time. Appreciate any help. Sent from my iPhone
Re: UnavailableException
This is how we create our keyspace. We just ran this command once through a cqlsh session on one of the nodes, so don't quite understand what you mean by check that your DC names match up CREATE KEYSPACE prod WITH replication = { 'class': 'NetworkTopologyStrategy', 'datacenter1': '3' }; On Fri, Jul 11, 2014 at 3:48 PM, Chris Lohfink clohf...@blackbirdit.com wrote: What replication strategy are you using? if using NetworkTopolgyStrategy double check that your DC names match up (case sensitive) Chris On Jul 11, 2014, at 9:38 AM, Ruchir Jha ruchir@gmail.com wrote: Here's the complete stack trace: com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=ny4lpcas5.fusionts.corp(10.10.20.47):9160, latency=22784(42874), attempts=3]UnavailableException() at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151) at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69) at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:485) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:79) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:123) Caused by: UnavailableException() at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20841) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:129) at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:126) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ... 12 more On Fri, Jul 11, 2014 at 9:11 AM, Prem Yadav ipremya...@gmail.com wrote: Please post the full exception. On Fri, Jul 11, 2014 at 1:50 PM, Ruchir Jha ruchir@gmail.com wrote: We have a 12 node cluster and we are consistently seeing this exception being thrown during peak write traffic. We have a replication factor of 3 and a write consistency level of QUORUM. Also note there is no unusual Or Full GC activity during this time. Appreciate any help. Sent from my iPhone
Re: TTransportException (java.net.SocketException: Broken pipe)
We have these precise settings but are still seeing the broken pipe exception in our gc logs. Any clues? Sent from my iPhone On Jul 8, 2014, at 1:17 PM, Bhaskar Singhal bhaskarsing...@yahoo.com wrote: Thanks Mark. Yes the 1024 is the limit. I haven't changed it as per the recommended production settings. But I am wondering why does Cassandra need to keep 3000+ commit log segment files open? Regards, Bhaskar On Tuesday, 8 July 2014 1:50 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Bhaskar, Can you check your limits using 'ulimit -a'? The default is 1024, which needs to be increased if you have not done so already. Here you will find a list of recommended production settings: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html Mark On Tue, Jul 8, 2014 at 5:30 AM, Bhaskar Singhal bhaskarsing...@yahoo.com wrote: Hi, I am using Cassandra 2.0.7 (with default settings and 16GB heap on quad core ubuntu server with 32gb ram) and trying to ingest 1MB values using cassandra-stress. It works fine for a while(1600secs) but after ingesting around 120GB data, I start getting the following error: Operation [70668] retried 10 times - error inserting key 0070668 ((TTransportException): java.net.SocketException: Broken pipe) The cassandra server is still running but in the system.log I see the below mentioned errors. ERROR [COMMIT-LOG-ALLOCATOR] 2014-07-07 22:39:23,617 CassandraDaemon.java (line 198) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] java.lang.NoClassDefFoundError: org/apache/cassandra/db/commitlog/CommitLog$4 at org.apache.cassandra.db.commitlog.CommitLog.handleCommitError(CommitLog.java:374) at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:116) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.ClassNotFoundException: org.apache.cassandra.db.commitlog.CommitLog$4 at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 4 more Caused by: java.io.FileNotFoundException: /path/2.0.7/cassandra/build/classes/main/org/apache/cassandra/db/commitlog/CommitLog$4.class (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at sun.misc.URLClassPath$FileLoader$1.getInputStream(URLClassPath.java:1086) at sun.misc.Resource.cachedInputStream(Resource.java:77) at sun.misc.Resource.getByteBuffer(Resource.java:160) at java.net.URLClassLoader.defineClass(URLClassLoader.java:436) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) ... 10 more ERROR [FlushWriter:7] 2014-07-07 22:39:24,924 CassandraDaemon.java (line 198) Exception in thread Thread[FlushWriter:7,5,main] FSWriteError in /cassandra/data4/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-593-Filter.db at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:475) at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212) at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:301) at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:417) at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:350) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.FileNotFoundException: /cassandra/data4/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-593-Filter.db (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:110) at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:466) ... 9 more There are around 9685 open files by the Cassandra server process (using lsof),
A
Sent from my iPhone
Re: clearing tombstones?
I tried to do this, however the doubling in disk space is not temporary as you state in your note. What am I missing? On Fri, Apr 11, 2014 at 10:44 AM, William Oberman ober...@civicscience.comwrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will
Re: Cassanda JVM GC defaults question
Lowering CMSInitiatingOccupancyFraction to less than 0.75 will lead to more GC interference and will impact write performance. If you're not sensitive to this impact, your expectation is correct, however make sure your flush_largest_memtables_at is always set to less than or equal to the occupancy fraction. On 4/23/14, Ken Hancock ken.hanc...@schange.com wrote: I'm in the process of trying to tune the GC and I'm far from an expert in this area, so hoping someone can tell me I'm either out in left field or on-track. Cassandra's default GC settings are (abbreviated): +UseConcMarkSweepGC +CMSInitiaitingOccupancyFraction=75 +UseCMSInitiatingOccupancyOnly Also in cassandra.yaml: flush_largest_memtables_at: 0.75 Since the new heap is relatively small, if I'm understanding this correctly CMS will normally not kick in until it's at roughly 75% of the heap (75% of size-new, new being relatively small compared to the overall heap). These two settings being very close would seem that both trigger at nearly the same point which might be undesirable as the flushing would also create more GC pressure (in addition to FlushWriter blocking if multiple tables are queued for flushing because of this). Clearly more heap will give us more peak running room, but would also lowering the CMSInitiatingOccupancyFraction help at the expense of some added CPU for more frequent, smaller collections? Mikio Bruan's blog had some interesting tests in this area http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
GC histogram analysis
Hi, I am trying to investigate ParNew promotion failures happening routinely in production. As part of this exercise, I enabled -XX:PrintHistogramBeforeFullGC and saw the following output. As you can see there are a ton of Columns, ExpiringColumns and DeletedColumns before GC ran and these numbers go down significantly right after GC. Why are there so many expiring and deleted columns? *Before GC:* num #instances #bytes class name -- 1: 113539896 5449915008 java.nio.*HeapByteBuffer* 2: 15979061 2681431488 [B 3: 36364545 1745498160 edu.stanford.ppl.concurrent.SnapTreeMap$Node 4: 23583282 754665024 org.apache.cassandra.db.*Column* 5: 8745428 209890272 java.util.concurrent.ConcurrentSkipListMap$Node 6: 5062619 202504760 org.apache.cassandra.db.*ExpiringColumn* 7: 45261 198998216 [I 8: 1801535 172947360 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch 9: 1473677 169570040 [J 10: 4713304 113119296 java.lang.Double 11: 3246729 103895328 org.apache.cassandra.db.*DeletedColumn* *After GC:* num #instances #bytes class name -- 1: 11807204 1505962728 [B 2: 12525536 601225728 java.nio.*HeapByteBuffer* 3: 8839073 424275504 edu.stanford.ppl.concurrent.SnapTreeMap$Node 4: 8194496 262223872 org.apache.cassandra.db.*Column* cache.KeyCacheKey 17:432119 17284760 org.apache.cassandra.db.*ExpiringColumn* 21:351096 11235072 org.apache.cassandra.db.*DeletedColumn*
Re: GC histogram analysis
No we don't. Sent from my iPhone On Apr 16, 2014, at 9:21 AM, Mark Reddy mark.re...@boxever.com wrote: Do you delete and/or set TTLs on your data? On Wed, Apr 16, 2014 at 2:14 PM, Ruchir Jha ruchir@gmail.com wrote: Hi, I am trying to investigate ParNew promotion failures happening routinely in production. As part of this exercise, I enabled -XX:PrintHistogramBeforeFullGC and saw the following output. As you can see there are a ton of Columns, ExpiringColumns and DeletedColumns before GC ran and these numbers go down significantly right after GC. Why are there so many expiring and deleted columns? Before GC: num #instances #bytes class name -- 1: 113539896 5449915008 java.nio.HeapByteBuffer 2: 15979061 2681431488 [B 3: 36364545 1745498160 edu.stanford.ppl.concurrent.SnapTreeMap$Node 4: 23583282 754665024 org.apache.cassandra.db.Column 5: 8745428 209890272 java.util.concurrent.ConcurrentSkipListMap$Node 6: 5062619 202504760 org.apache.cassandra.db.ExpiringColumn 7: 45261 198998216 [I 8: 1801535 172947360 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch 9: 1473677 169570040 [J 10: 4713304 113119296 java.lang.Double 11: 3246729 103895328 org.apache.cassandra.db.DeletedColumn After GC: num #instances #bytes class name -- 1: 11807204 1505962728 [B 2: 12525536 601225728 java.nio.HeapByteBuffer 3: 8839073 424275504 edu.stanford.ppl.concurrent.SnapTreeMap$Node 4: 8194496 262223872 org.apache.cassandra.db.Column cache.KeyCacheKey 17:432119 17284760 org.apache.cassandra.db.ExpiringColumn 21:351096 11235072 org.apache.cassandra.db.DeletedColumn