Hi Alexander. Yeah, the minor GCs I see are usually around 300ms but sometimes jumping to 1s or even more. Hardware specs are: - 8 core CPUs - 32 GB of RAM - 4 SSDs in hardware Raid 0, around 3TB of space per node GC settings: -Xmx12G -Xms12G -XX:+UseG1GC - XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 - XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 - XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled According to the graphs, there are approximately one Young GC every 10s or so, and almost no Full GCs (for example the last one was 2h45 after the previous one). Computed from the log files, average Young GC seems to be around 280ms and max is 2.5s.Average Full GC seems to be around 4.6s and max is 5.3s. I only computed this on one node but the problem occurs on every node as far as I can see. I'm open to tuning the GC, I stuck with defaults (that I think I saw in the cassandra conf, I'm not sure). Number of SSTables looks ok, p75 is at 4 (as is the max for that matter). Partitions size is a problem yeah, this particular table from which we read a lot has a max partition size of 450 MB. I've known about this problem for a long time actually, we already did a bunch of work reducing partition size I think a year ago, but this particular table is tricky to change. One thing to note about this table is that we do a ton of DELETEs regularly (that we can't really stop doing except completely redesigning the table), so we have a ton of tombstones too. We have a lot of warnings about the tombstone threshold when we do our selects (things like "Read 2001 live and 2528 tombstone cells"). I suppose this could be a factor ? Each query reads from a single partition key yes, but as said we issue a lot of them at the same time. The table looks like this (simplified):
CREATE TABLE table ( app text, platform text, slug text, partition int, user_id text, attributes blob, state int, timezone text, version int, PRIMARY KEY ((app, platform, slug, partition), user_id) ) WITH CLUSTERING ORDER BY (user_id ASC) And the main queries are: SELECT app, platform, slug, partition, user_id, attributes, state, timezone, version FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition = ? LIMIT ? SELECT app, platform, slug, partition, user_id, attributes, state, timezone, version FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition = ? AND user_id >= ? LIMIT ? partition is basically an integer that goes from 0 to 15, and we always select the 16 partitions in parallel. Note that we write constantly to this table, to update some fields, insert the user into the new "slug" (a slug is an amalgamation of different parameters like state, timezone etc that allows us the efficiently query all users from a particular "app" with a given "slug". At least that's the idea, as seen here it causes us some trouble). And yes, we do use batches to write this data, this is how we process each user update: - SELECT from a "master" slug to get the fields we need - from that, compute a list of slugs the user had and a list of slugs the user should have (for example if he changes timezone we have to update the slug) - delete the user from the slug he shouldn't be in and insert the user where he should be.The last part, delete/insert is done in a logged batch. I hope it's relatively clear. On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote: > Hi Vincent, > > dropped messages are indeed common in case of long GC pauses. > Having 4s to 6s pauses is not normal and is the sign of an unhealthy > cluster. Minor GCs are usually faster but you can have long ones too.> > If you can share your hardware specs along with your current GC > settings (CMS or G1, heap size, young gen size) and a distribution of > GC pauses (rate of minor GCs, average and max duration of GCs) we > could try to help you tune your heap settings.> You can activate full GC > logging which could help in fine tuning > MaxTenuringThreshold and survivor space sizing.> > You should also check for max partition sizes and number of SSTables > accessed per read. Run nodetool cfstats/cfhistograms on your tables to > get both. p75 should be less or equal to 4 in number of SSTables and > you shouldn't have partitions over... let's say 300 MBs. Partitions > > 1GB are a critical problem to address.> > Other things to consider are : > Do you read from a single partition for each query ? > Do you use collections that could spread over many SSTables ? > Do you use batches for writes (although your problem doesn't seem to > be write related) ?> Can you share the queries from your scheduled selects > and the > data model ?> > Cheers, > > > On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann > <m...@vrischmann.me> wrote:>> __ >> Hi, >> >> we have a cluster of 11 nodes running Cassandra 2.2.9 where we >> regularly get READ messages dropped:>> >> > READ messages were dropped in last 5000 ms: 974 for internal >> > timeout and 0 for cross node timeout>> >> Looking at the logs, some are logged at the same time as Old Gen GCs. >> These GCs all take around 4 to 6s to run. To me, it's "normal" that >> these could cause reads to be dropped.>> However, we also have reads dropped >> without Old Gen GCs occurring, >> only Young Gen.>> >> I'm wondering if anyone has a good way of determining what the _root_ >> cause could be. Up until now, the only way we managed to decrease >> load on our cluster was by guessing some stuff, trying it out and >> being lucky, essentially. I'd love a way to make sure what the >> problem is before tackling it. Doing schema changes is not a problem, >> but changing stuff blindly is not super efficient :)>> >> What I do see in the logs, is that these happen almost exclusively >> when we do a lot of SELECT. The time logged almost always correspond >> to times where our schedules SELECTs are happening. That narrows the >> scope a little, but still.>> >> Anyway, I'd appreciate any information about troubleshooting this >> scenario.>> Thanks. > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com[1] Links: 1. http://www.thelastpickle.com/