Thanks Alexander for the help, lots of good info in there. I'll try to switch back to CMS and see how it fares.
On Tue, Jun 6, 2017, at 05:06 PM, Alexander Dejanovski wrote: > Hi Vincent, > > it is very clear, thanks for all the info. > > I would not stick with G1 in your case, as it requires much more heap > to perform correctly (>24GB).> CMS/ParNew should be much more efficient here > and I would go with some > settings I usually apply on big workloads : 16GB heap / 6GB new gen / > MaxTenuringThreshold = 5> > Large partitions are indeed putting pressure on your heap and > tombstones as well.> One of your queries is particularly caveated : SELECT > app, > platform, slug, partition, user_id, attributes, state, timezone, > version FROM table WHERE app = ? AND platform = ? AND slug = ? AND > partition = ? LIMIT ?> Although you're using the LIMIT clause, it will read > the whole > partition, merge it in memory and only then will it apply the LIMIT. > Check this blog post for more detailed info : > http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html> > This can lead you to read the whole 450MB and all the tombstones even > though you're only targeting a few rows in the partition.> Large partitions > are also creating heap pressure during compactions, > which will issue warnings in the logs (look for "large partition").> > You should remove the delete/insert logged batch as it will spread > over multiple partitions, which is bad for many reasons. It gives you > no real atomicity, but just the guaranty that if one query succeeds, > then the rest of the queries will eventually succeed (and that could > possibly take some time, leaving the cluster in an inconsistent state > in the meantime). Logged batches have a lot of overheads, one of them > being a write of the queries to the batchlog table, which will be > replicated to 2 other nodes, and then deleted after the batch has > completed.> You'd better turn those into async queries with an external retry > mechanism.> > Tuning the GC should help coping with your data modeling issues. > > For safety reasons, only change the GC settings for one canary > node, observe and compare its behavior over a full day. If the > results are satisfying, generalize to the rest of the cluster. You > need to experience peak load to make sure the new settings are > fixing your issues.> > Cheers, > > > > On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann > <m...@vrischmann.me> wrote:>> __ >> Hi Alexander. >> >> Yeah, the minor GCs I see are usually around 300ms but sometimes >> jumping to 1s or even more.>> >> Hardware specs are: >> - 8 core CPUs >> - 32 GB of RAM >> - 4 SSDs in hardware Raid 0, around 3TB of space per node >> >> GC settings: -Xmx12G -Xms12G -XX:+UseG1GC - >> XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 - >> XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 - >> XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled>> >> According to the graphs, there are approximately one Young GC every >> 10s or so, and almost no Full GCs (for example the last one was 2h45 >> after the previous one).>> >> Computed from the log files, average Young GC seems to be around >> 280ms and max is 2.5s.>> Average Full GC seems to be around 4.6s and max is >> 5.3s. >> I only computed this on one node but the problem occurs on every node >> as far as I can see.>> >> I'm open to tuning the GC, I stuck with defaults (that I think I saw >> in the cassandra conf, I'm not sure).>> >> Number of SSTables looks ok, p75 is at 4 (as is the max for that >> matter). Partitions size is a problem yeah, this particular table >> from which we read a lot has a max partition size of 450 MB. I've >> known about this problem for a long time actually, we already did a >> bunch of work reducing partition size I think a year ago, but this >> particular table is tricky to change.>> >> One thing to note about this table is that we do a ton of DELETEs >> regularly (that we can't really stop doing except completely >> redesigning the table), so we have a ton of tombstones too. We have a >> lot of warnings about the tombstone threshold when we do our selects >> (things like "Read 2001 live and 2528 tombstone cells"). I suppose >> this could be a factor ?>> >> Each query reads from a single partition key yes, but as said we >> issue a lot of them at the same time.>> >> The table looks like this (simplified): >> >> CREATE TABLE table ( >> app text, >> platform text, >> slug text, >> partition int, >> user_id text, >> attributes blob, >> state int, >> timezone text, >> version int, >> PRIMARY KEY ((app, platform, slug, partition), user_id) >> ) WITH CLUSTERING ORDER BY (user_id ASC) >> >> And the main queries are: >> >> SELECT app, platform, slug, partition, user_id, attributes, >> state, timezone, version>> FROM table WHERE app = ? AND platform = ? >> AND slug = ? AND >> partition = ? LIMIT ?>> >> SELECT app, platform, slug, partition, user_id, attributes, >> state, timezone, version>> FROM table WHERE app = ? AND platform = ? >> AND slug = ? AND >> partition = ? AND user_id >= ? LIMIT ?>> >> partition is basically an integer that goes from 0 to 15, and we >> always select the 16 partitions in parallel.>> >> Note that we write constantly to this table, to update some fields, >> insert the user into the new "slug" (a slug is an amalgamation of >> different parameters like state, timezone etc that allows us the >> efficiently query all users from a particular "app" with a given >> "slug". At least that's the idea, as seen here it causes us some >> trouble).>> >> And yes, we do use batches to write this data, this is how we process >> each user update:>> - SELECT from a "master" slug to get the fields we need >> - from that, compute a list of slugs the user had and a list of >> slugs the user should have (for example if he changes timezone we >> have to update the slug)>> - delete the user from the slug he >> shouldn't be in and insert the >> user where he should be.>> The last part, delete/insert is done in a >> logged batch. >> >> I hope it's relatively clear. >> >> On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote: >>> Hi Vincent, >>> >>> dropped messages are indeed common in case of long GC pauses. >>> Having 4s to 6s pauses is not normal and is the sign of an >>> unhealthy cluster. Minor GCs are usually faster but you can have >>> long ones too.>>> >>> If you can share your hardware specs along with your current GC >>> settings (CMS or G1, heap size, young gen size) and a distribution >>> of GC pauses (rate of minor GCs, average and max duration of GCs) we >>> could try to help you tune your heap settings.>>> You can activate full GC >>> logging which could help in fine tuning >>> MaxTenuringThreshold and survivor space sizing.>>> >>> You should also check for max partition sizes and number of SSTables >>> accessed per read. Run nodetool cfstats/cfhistograms on your tables >>> to get both. p75 should be less or equal to 4 in number of SSTables >>> and you shouldn't have partitions over... let's say 300 MBs. >>> Partitions > 1GB are a critical problem to address.>>> >>> Other things to consider are : >>> Do you read from a single partition for each query ? >>> Do you use collections that could spread over many SSTables ? >>> Do you use batches for writes (although your problem doesn't seem to >>> be write related) ?>>> Can you share the queries from your scheduled >>> selects and the data >>> model ?>>> >>> Cheers, >>> >>> >>> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me> >>> wrote:>>>> __ >>>> Hi, >>>> >>>> we have a cluster of 11 nodes running Cassandra 2.2.9 where we >>>> regularly get READ messages dropped:>>>> >>>> > READ messages were dropped in last 5000 ms: 974 for internal >>>> > timeout and 0 for cross node timeout>>>> >>>> Looking at the logs, some are logged at the same time as Old Gen >>>> GCs. These GCs all take around 4 to 6s to run. To me, it's "normal" >>>> that these could cause reads to be dropped.>>>> However, we also have >>>> reads dropped without Old Gen GCs occurring, >>>> only Young Gen.>>>> >>>> I'm wondering if anyone has a good way of determining what the >>>> _root_ cause could be. Up until now, the only way we managed to >>>> decrease load on our cluster was by guessing some stuff, trying it >>>> out and being lucky, essentially. I'd love a way to make sure what >>>> the problem is before tackling it. Doing schema changes is not a >>>> problem, but changing stuff blindly is not super efficient :)>>>> >>>> What I do see in the logs, is that these happen almost exclusively >>>> when we do a lot of SELECT. The time logged almost always >>>> correspond to times where our schedules SELECTs are happening. That >>>> narrows the scope a little, but still.>>>> >>>> Anyway, I'd appreciate any information about troubleshooting this >>>> scenario.>>>> Thanks. >>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com[1] >> > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com[2] Links: 1. http://www.thelastpickle.com/ 2. http://www.thelastpickle.com/