Hi Alexander.

Yeah, the minor GCs I see are usually around 300ms but sometimes jumping
to 1s or even more.
Hardware specs are:
  - 8 core CPUs
  - 32 GB of RAM
  - 4 SSDs in hardware Raid 0, around 3TB of space per node
 
GC settings:    -Xmx12G -Xms12G -XX:+UseG1GC -
XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled
According to the graphs, there are approximately one Young GC every 10s
or so, and almost no Full GCs (for example the last one was 2h45 after
the previous one).
Computed from the log files, average Young GC seems to be around 280ms
and max is 2.5s.Average Full GC seems to be around 4.6s and max is 5.3s.
I only computed this on one node but the problem occurs on every node as
far as I can see.
I'm open to tuning the GC, I stuck with defaults (that I think I saw in
the cassandra conf, I'm not sure).
Number of SSTables looks ok, p75 is at 4 (as is the max for that
matter). Partitions size is a problem yeah, this particular table from
which we read a lot has a max partition size of 450 MB. I've known about
this problem for a long time actually, we already did a bunch of work
reducing partition size I think a year ago, but this particular table is
tricky to change.
One thing to note about this table is that we do a ton of DELETEs
regularly (that we can't really stop doing except completely redesigning
the table), so we have a ton of tombstones too. We have a lot of
warnings about the tombstone threshold when we do our selects (things
like "Read 2001 live and 2528 tombstone cells"). I suppose this could be
a factor ?
Each query reads from a single partition key yes, but as said we issue a
lot of them at the same time.
The table looks like this (simplified):

CREATE TABLE table (
    app text,
    platform text,
    slug text,
    partition int,
    user_id text,
    attributes blob,
    state int,
    timezone text,
    version int,
    PRIMARY KEY ((app, platform, slug, partition), user_id)
) WITH CLUSTERING ORDER BY (user_id ASC)

And the main queries are:

    SELECT app, platform, slug, partition, user_id, attributes, state,
    timezone, version    FROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
    = ? LIMIT ?
    SELECT app, platform, slug, partition, user_id, attributes, state,
    timezone, version    FROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
    = ? AND user_id >= ? LIMIT ?
partition is basically an integer that goes from 0 to 15, and we always
select the 16 partitions in parallel.
Note that we write constantly to this table, to update some fields,
insert the user into the new "slug" (a slug is an amalgamation of
different parameters like state, timezone etc that allows us the
efficiently query all users from a particular "app" with a given "slug".
At least that's the idea, as seen here it causes us some trouble).
And yes, we do use batches to write this data, this is how we process
each user update:  - SELECT from a "master" slug to get the fields we need
  - from that, compute a list of slugs the user had and a list of slugs
    the user should have (for example if he changes timezone we have to
    update the slug)  - delete the user from the slug he shouldn't be in and 
insert the user
    where he should be.The last part, delete/insert is done in a logged batch. 

I hope it's relatively clear.

On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
> Hi Vincent, 
> 
> dropped messages are indeed common in case of long GC pauses. 
> Having 4s to 6s pauses is not normal and is the sign of an unhealthy
> cluster. Minor GCs are usually faster but you can have long ones too.> 
> If you can share your hardware specs along with your current GC
> settings (CMS or G1, heap size, young gen size) and a distribution of
> GC pauses (rate of minor GCs, average and max duration of GCs) we
> could try to help you tune your heap settings.> You can activate full GC 
> logging which could help in fine tuning
> MaxTenuringThreshold and survivor space sizing.> 
> You should also check for max partition sizes and number of SSTables
> accessed per read. Run nodetool cfstats/cfhistograms on your tables to
> get both. p75 should be less or equal to 4 in number of SSTables  and
> you shouldn't have partitions over... let's say 300 MBs. Partitions >
> 1GB are a critical problem to address.> 
> Other things to consider are : 
> Do you read from a single partition for each query ? 
> Do you use collections that could spread over many SSTables ? 
> Do you use batches for writes (although your problem doesn't seem to
> be write related) ?> Can you share the queries from your scheduled selects 
> and the
> data model ?> 
> Cheers,
> 
> 
> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:>> __
>> Hi,
>> 
>> we have a cluster of 11 nodes running Cassandra 2.2.9 where we
>> regularly get READ messages dropped:>> 
>> > READ messages were dropped in last 5000 ms: 974 for internal
>> > timeout and 0 for cross node timeout>> 
>> Looking at the logs, some are logged at the same time as Old Gen GCs.
>> These GCs all take around 4 to 6s to run. To me, it's "normal" that
>> these could cause reads to be dropped.>> However, we also have reads dropped 
>> without Old Gen GCs occurring,
>> only Young Gen.>> 
>> I'm wondering if anyone has a good way of determining what the _root_
>> cause could be. Up until now, the only way we managed to decrease
>> load on our cluster was by guessing some stuff, trying it out and
>> being lucky, essentially. I'd love a way to make sure what the
>> problem is before tackling it. Doing schema changes is not a problem,
>> but changing stuff blindly is not super efficient :)>> 
>> What I do see in the logs, is that these happen almost exclusively
>> when we do a lot of SELECT.  The time logged almost always correspond
>> to times where our schedules SELECTs are happening. That narrows the
>> scope a little, but still.>> 
>> Anyway, I'd appreciate any information about troubleshooting this
>> scenario.>> Thanks.
> -- 
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[1]


Links:

  1. http://www.thelastpickle.com/

Reply via email to