Hi Vincent, it is very clear, thanks for all the info.
I would not stick with G1 in your case, as it requires much more heap to perform correctly (>24GB). CMS/ParNew should be much more efficient here and I would go with some settings I usually apply on big workloads : 16GB heap / 6GB new gen / MaxTenuringThreshold = 5 Large partitions are indeed putting pressure on your heap and tombstones as well. One of your queries is particularly caveated : SELECT app, platform, slug, partition, user_id, attributes, state, timezone, version FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition = ? LIMIT ? Although you're using the LIMIT clause, it will read the whole partition, merge it in memory and only then will it apply the LIMIT. Check this blog post for more detailed info : http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html This can lead you to read the whole 450MB and all the tombstones even though you're only targeting a few rows in the partition. Large partitions are also creating heap pressure during compactions, which will issue warnings in the logs (look for "large partition"). You should remove the delete/insert logged batch as it will spread over multiple partitions, which is bad for many reasons. It gives you no real atomicity, but just the guaranty that if one query succeeds, then the rest of the queries will eventually succeed (and that could possibly take some time, leaving the cluster in an inconsistent state in the meantime). Logged batches have a lot of overheads, one of them being a write of the queries to the batchlog table, which will be replicated to 2 other nodes, and then deleted after the batch has completed. You'd better turn those into async queries with an external retry mechanism. Tuning the GC should help coping with your data modeling issues. For safety reasons, only change the GC settings for one canary node, observe and compare its behavior over a full day. If the results are satisfying, generalize to the rest of the cluster. You need to experience peak load to make sure the new settings are fixing your issues. Cheers, On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann <m...@vrischmann.me> wrote: > Hi Alexander. > > Yeah, the minor GCs I see are usually around 300ms but sometimes jumping > to 1s or even more. > > Hardware specs are: > - 8 core CPUs > - 32 GB of RAM > - 4 SSDs in hardware Raid 0, around 3TB of space per node > > GC settings: -Xmx12G -Xms12G -XX:+UseG1GC > -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 > -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 > -XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled > > According to the graphs, there are approximately one Young GC every 10s or > so, and almost no Full GCs (for example the last one was 2h45 after the > previous one). > > Computed from the log files, average Young GC seems to be around 280ms and > max is 2.5s. > Average Full GC seems to be around 4.6s and max is 5.3s. > I only computed this on one node but the problem occurs on every node as > far as I can see. > > I'm open to tuning the GC, I stuck with defaults (that I think I saw in > the cassandra conf, I'm not sure). > > Number of SSTables looks ok, p75 is at 4 (as is the max for that matter). > Partitions size is a problem yeah, this particular table from which we read > a lot has a max partition size of 450 MB. I've known about this problem for > a long time actually, we already did a bunch of work reducing partition > size I think a year ago, but this particular table is tricky to change. > > One thing to note about this table is that we do a ton of DELETEs > regularly (that we can't really stop doing except completely redesigning > the table), so we have a ton of tombstones too. We have a lot of warnings > about the tombstone threshold when we do our selects (things like "Read > 2001 live and 2528 tombstone cells"). I suppose this could be a factor ? > > Each query reads from a single partition key yes, but as said we issue a > lot of them at the same time. > > The table looks like this (simplified): > > CREATE TABLE table ( > app text, > platform text, > slug text, > partition int, > user_id text, > attributes blob, > state int, > timezone text, > version int, > PRIMARY KEY ((app, platform, slug, partition), user_id) > ) WITH CLUSTERING ORDER BY (user_id ASC) > > And the main queries are: > > SELECT app, platform, slug, partition, user_id, attributes, state, > timezone, version > FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition = > ? LIMIT ? > > SELECT app, platform, slug, partition, user_id, attributes, state, > timezone, version > FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition = > ? AND user_id >= ? LIMIT ? > > partition is basically an integer that goes from 0 to 15, and we always > select the 16 partitions in parallel. > > Note that we write constantly to this table, to update some fields, insert > the user into the new "slug" (a slug is an amalgamation of different > parameters like state, timezone etc that allows us the efficiently query > all users from a particular "app" with a given "slug". At least that's the > idea, as seen here it causes us some trouble). > > And yes, we do use batches to write this data, this is how we process each > user update: > - SELECT from a "master" slug to get the fields we need > - from that, compute a list of slugs the user had and a list of slugs > the user should have (for example if he changes timezone we have to update > the slug) > - delete the user from the slug he shouldn't be in and insert the user > where he should be. > The last part, delete/insert is done in a logged batch. > > I hope it's relatively clear. > > On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote: > > Hi Vincent, > > dropped messages are indeed common in case of long GC pauses. > Having 4s to 6s pauses is not normal and is the sign of an unhealthy > cluster. Minor GCs are usually faster but you can have long ones too. > > If you can share your hardware specs along with your current GC settings > (CMS or G1, heap size, young gen size) and a distribution of GC pauses > (rate of minor GCs, average and max duration of GCs) we could try to help > you tune your heap settings. > You can activate full GC logging which could help in fine tuning > MaxTenuringThreshold and survivor space sizing. > > You should also check for max partition sizes and number of SSTables > accessed per read. Run nodetool cfstats/cfhistograms on your tables to get > both. p75 should be less or equal to 4 in number of SSTables and you > shouldn't have partitions over... let's say 300 MBs. Partitions > 1GB are a > critical problem to address. > > Other things to consider are : > Do you read from a single partition for each query ? > Do you use collections that could spread over many SSTables ? > Do you use batches for writes (although your problem doesn't seem to be > write related) ? > Can you share the queries from your scheduled selects and the data model ? > > Cheers, > > > On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me> wrote: > > > Hi, > > we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly > get READ messages dropped: > > > READ messages were dropped in last 5000 ms: 974 for internal timeout and > 0 for cross node timeout > > Looking at the logs, some are logged at the same time as Old Gen GCs. > These GCs all take around 4 to 6s to run. To me, it's "normal" that these > could cause reads to be dropped. > However, we also have reads dropped without Old Gen GCs occurring, only > Young Gen. > > I'm wondering if anyone has a good way of determining what the _root_ > cause could be. Up until now, the only way we managed to decrease load on > our cluster was by guessing some stuff, trying it out and being lucky, > essentially. I'd love a way to make sure what the problem is before > tackling it. Doing schema changes is not a problem, but changing stuff > blindly is not super efficient :) > > What I do see in the logs, is that these happen almost exclusively when we > do a lot of SELECT. The time logged almost always correspond to times > where our schedules SELECTs are happening. That narrows the scope a little, > but still. > > Anyway, I'd appreciate any information about troubleshooting this scenario. > Thanks. > > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com