If you go above ~1GB, the primary symptom you’ll see is a LOT of garbage created on reads (CASSANDRA-9754 details this).
As redesigning data model is often expensive (engineering time, reloading data, etc), one workaround is to tune your JVM to better handle situations where you create a lot of trash. One method that can help work around this is to use a much larger eden size than default – up to 50% of your total heap size. For example, if you were using 8G heap and 2G eden, going to 3G or 4G eden (new heap size in cassandra-env.sh) MAY work better for you if you’re reading from large partitions (it can also crash your server in some cases, so TEST IT IN A LAB FIRST). - Jeff From: Alexander Dejanovski <a...@thelastpickle.com> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Thursday, October 27, 2016 at 2:13 PM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: Tools to manage repairs The "official" recommendation would be 100MB, but it's hard to give a precise answer. Keeping it under the GB seems like a good target. A few patches are pushing the limits of partition sizes so we may soon be more comfortable with big partitions. Cheers Le jeu. 27 oct. 2016 21:28, Vincent Rischmann <m...@vrischmann.me> a écrit : Yeah that particular table is badly designed, I intend to fix it, when the roadmap allows us to do it :) What is the recommended maximum partition size ? Thanks for all the information. On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote: 3.3GB is already too high, and it's surely not good to have well performing compactions. Still I know changing a data model is no easy thing to do, but you should try to do something here. Anticompaction is a special type of compaction and if an sstable is being anticompacted, then any attempt to run validation compaction on it will fail, telling you that you cannot have an sstable being part of 2 repair sessions at the same time, so incremental repair must be run one node at a time, waiting for anticompactions to end before moving from one node to the other. Be mindful of running incremental repair on a regular basis once you started as you'll have two separate pools of sstables (repaired and unrepaired) that won't get compacted together, which could be a problem if you want tombstones to be purged efficiently. Cheers, Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <m...@vrischmann.me> a écrit : Ok, I think we'll give incremental repairs a try on a limited number of CFs first and then if it goes well we'll progressively switch more CFs to incremental. I'm not sure I understand the problem with anticompaction and validation running concurrently. As far as I can tell, right now when a CF is repaired (either via reaper, or via nodetool) there may be compactions running at the same time. In fact, it happens very often. Is it a problem ? As far as big partitions, the biggest one we have is around 3.3Gb. Some less big partitions are around 500Mb and less. On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote: Oh right, that's what they advise :) I'd say that you should skip the full repair phase in the migration procedure as that will obviously fail, and just mark all sstables as repaired (skip 1, 2 and 6). Anyway you can't do better, so take a leap of faith there. Intensity is already very low and 10000 segments is a whole lot for 9 nodes, you should not need that many. You can definitely pick which CF you'll run incremental repair on, and still run full repair on the rest. If you pick our Reaper fork, watch out for schema changes that add incremental repair fields, and I do not advise to run incremental repair without it, otherwise you might have issues with anticompaction and validation compactions running concurrently from time to time. One last thing : can you check if you have particularly big partitions in the CFs that fail to get repaired ? You can run nodetool cfhistograms to check that. Cheers, On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <m...@vrischmann.me> wrote: Thanks for the response. We do break up repairs between tables, we also tried our best to have no overlap between repair runs. Each repair has 10000 segments (purely arbitrary number, seemed to help at the time). Some runs have an intensity of 0.4, some have as low as 0.05. Still, sometimes one particular app (which does a lot of read/modify/write batches in quorum) gets slowed down to the point we have to stop the repair run. But more annoyingly, since 2 to 3 weeks as I said, it looks like runs don't progress after some time. Every time I restart reaper, it starts to repair correctly again, up until it gets stuck. I have no idea why that happens now, but it means I have to baby sit reaper, and it's becoming annoying. Thanks for the suggestion about incremental repairs. It would probably be a good thing but it's a little challenging to setup I think. Right now running a full repair of all keyspaces (via nodetool repair) is going to take a lot of time, probably like 5 days or more. We were never able to run one to completion. I'm not sure it's a good idea to disable autocompaction for that long. But maybe I'm wrong. Is it possible to use incremental repairs on some column family only ? On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote: Hi Vincent, most people handle repair with : - pain (by hand running nodetool commands) - cassandra range repair : https://github.com/BrianGallew/cassandra_range_repair - Spotify Reaper - and OpsCenter repair service for DSE users Reaper is a good option I think and you should stick to it. If it cannot do the job here then no other tool will. You have several options from here : Try to break up your repair table by table and see which ones actually get stuck Check your logs for any repair/streaming error Avoid repairing everything : you may have expendable tables you may have TTLed only tables with no deletes, accessed with QUORUM CL only You can try to relieve repair pressure in Reaper by lowering repair intensity (on the tables that get stuck) You can try adding steps to your repair process by putting a higher segment count in reaper (on the tables that get stuck) And lastly, you can turn to incremental repair. As you're familiar with Reaper already, you might want to take a look at our Reaper fork that handles incremental repair : https://github.com/thelastpickle/cassandra-reaper If you go down that way, make sure you first mark all sstables as repaired before you run your first incremental repair, otherwise you'll end up in anticompaction hell (bad bad place) : https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html Even if people say that's not necessary anymore, it'll save you from a very bad first experience with incremental repair. Furthermore, make sure you run repair daily after your first inc repair run, in order to work on small sized repairs. Cheers, On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <m...@vrischmann.me> wrote: Hi, we have two Cassandra 2.1.15 clusters at work and are having some trouble with repairs. Each cluster has 9 nodes, and the amount of data is not gigantic but some column families have 300+Gb of data. We tried to use `nodetool repair` for these tables but at the time we tested it, it made the whole cluster load too much and it impacted our production apps. Next we saw https://github.com/spotify/cassandra-reaper , tried it and had some success until recently. Since 2 to 3 weeks it never completes a repair run, deadlocking itself somehow. I know DSE includes a repair service but I'm wondering how do other Cassandra users manage repairs ? Vincent. -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.
smime.p7s
Description: S/MIME cryptographic signature