Thanks Alexander for the help, lots of good info in there.

I'll try to switch back to CMS and see how it fares.


On Tue, Jun 6, 2017, at 05:06 PM, Alexander Dejanovski wrote:
> Hi Vincent,
> 
> it is very clear, thanks for all the info.
> 
> I would not stick with G1 in your case, as it requires much more heap
> to perform correctly (>24GB).> CMS/ParNew should be much more efficient here 
> and I would go with some
> settings I usually apply on big workloads : 16GB heap / 6GB new gen /
> MaxTenuringThreshold = 5> 
> Large partitions are indeed putting pressure on your heap and
> tombstones as well.> One of your queries is particularly caveated : SELECT 
> app,
> platform, slug, partition, user_id, attributes, state, timezone,
> version FROM table WHERE app = ? AND platform = ? AND slug = ? AND
> partition = ? LIMIT ?> Although you're using the LIMIT clause, it will read 
> the whole
> partition, merge it in memory and only then will it apply the LIMIT.
> Check this blog post for more detailed info :
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html>
>  This can lead you to read the whole 450MB and all the tombstones even
> though you're only targeting a few rows in the partition.> Large partitions 
> are also creating heap pressure during compactions,
> which will issue warnings in the logs (look for "large partition").> 
> You should remove the delete/insert logged batch as it will spread
> over multiple partitions, which is bad for many reasons. It gives you
> no real atomicity, but just the guaranty that if one query succeeds,
> then the rest of the queries will eventually succeed (and that could
> possibly take some time, leaving the cluster in an inconsistent state
> in the meantime). Logged batches have a lot of overheads, one of them
> being a write of the queries to the batchlog table, which will be
> replicated to 2 other nodes, and then deleted after the batch has
> completed.> You'd better turn those into async queries with an external retry
> mechanism.> 
> Tuning the GC should help coping with your data modeling issues. 
> 
> For safety reasons, only change the GC settings for one canary
> node, observe and compare its behavior over a full day. If the
> results are satisfying, generalize to the rest of the cluster. You
> need to experience peak load to make sure the new settings are
> fixing your issues.> 
> Cheers,
> 
> 
> 
> On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:>> __
>> Hi Alexander.
>> 
>> Yeah, the minor GCs I see are usually around 300ms but sometimes
>> jumping to 1s or even more.>> 
>> Hardware specs are:
>>   - 8 core CPUs
>>   - 32 GB of RAM
>>   - 4 SSDs in hardware Raid 0, around 3TB of space per node
>>  
>> GC settings:    -Xmx12G -Xms12G -XX:+UseG1GC -
>> XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
>> XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
>> XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled>> 
>> According to the graphs, there are approximately one Young GC every
>> 10s or so, and almost no Full GCs (for example the last one was 2h45
>> after the previous one).>> 
>> Computed from the log files, average Young GC seems to be around
>> 280ms and max is 2.5s.>> Average Full GC seems to be around 4.6s and max is 
>> 5.3s.
>> I only computed this on one node but the problem occurs on every node
>> as far as I can see.>> 
>> I'm open to tuning the GC, I stuck with defaults (that I think I saw
>> in the cassandra conf, I'm not sure).>> 
>> Number of SSTables looks ok, p75 is at 4 (as is the max for that
>> matter). Partitions size is a problem yeah, this particular table
>> from which we read a lot has a max partition size of 450 MB. I've
>> known about this problem for a long time actually, we already did a
>> bunch of work reducing partition size I think a year ago, but this
>> particular table is tricky to change.>> 
>> One thing to note about this table is that we do a ton of DELETEs
>> regularly (that we can't really stop doing except completely
>> redesigning the table), so we have a ton of tombstones too. We have a
>> lot of warnings about the tombstone threshold when we do our selects
>> (things like "Read 2001 live and 2528 tombstone cells"). I suppose
>> this could be a factor ?>> 
>> Each query reads from a single partition key yes, but as said we
>> issue a lot of them at the same time.>> 
>> The table looks like this (simplified):
>> 
>> CREATE TABLE table (
>>     app text,
>>     platform text,
>>     slug text,
>>     partition int,
>>     user_id text,
>>     attributes blob,
>>     state int,
>>     timezone text,
>>     version int,
>>     PRIMARY KEY ((app, platform, slug, partition), user_id)
>> ) WITH CLUSTERING ORDER BY (user_id ASC)
>> 
>> And the main queries are:
>> 
>>     SELECT app, platform, slug, partition, user_id, attributes,
>>     state, timezone, version>>     FROM table WHERE app = ? AND platform = ? 
>> AND slug = ? AND
>>     partition = ? LIMIT ?>> 
>>     SELECT app, platform, slug, partition, user_id, attributes,
>>     state, timezone, version>>     FROM table WHERE app = ? AND platform = ? 
>> AND slug = ? AND
>>     partition = ? AND user_id >= ? LIMIT ?>> 
>> partition is basically an integer that goes from 0 to 15, and we
>> always select the 16 partitions in parallel.>> 
>> Note that we write constantly to this table, to update some fields,
>> insert the user into the new "slug" (a slug is an amalgamation of
>> different parameters like state, timezone etc that allows us the
>> efficiently query all users from a particular "app" with a given
>> "slug". At least that's the idea, as seen here it causes us some
>> trouble).>> 
>> And yes, we do use batches to write this data, this is how we process
>> each user update:>>   - SELECT from a "master" slug to get the fields we need
>>   - from that, compute a list of slugs the user had and a list of
>>     slugs the user should have (for example if he changes timezone we
>>     have to update the slug)>>   - delete the user from the slug he 
>> shouldn't be in and insert the
>>     user where he should be.>> The last part, delete/insert is done in a 
>> logged batch. 
>> 
>> I hope it's relatively clear.
>> 
>> On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
>>> Hi Vincent, 
>>> 
>>> dropped messages are indeed common in case of long GC pauses. 
>>> Having 4s to 6s pauses is not normal and is the sign of an
>>> unhealthy cluster. Minor GCs are usually faster but you can have
>>> long ones too.>>> 
>>> If you can share your hardware specs along with your current GC
>>> settings (CMS or G1, heap size, young gen size) and a distribution
>>> of GC pauses (rate of minor GCs, average and max duration of GCs) we
>>> could try to help you tune your heap settings.>>> You can activate full GC 
>>> logging which could help in fine tuning
>>> MaxTenuringThreshold and survivor space sizing.>>> 
>>> You should also check for max partition sizes and number of SSTables
>>> accessed per read. Run nodetool cfstats/cfhistograms on your tables
>>> to get both. p75 should be less or equal to 4 in number of SSTables
>>> and you shouldn't have partitions over... let's say 300 MBs.
>>> Partitions > 1GB are a critical problem to address.>>> 
>>> Other things to consider are : 
>>> Do you read from a single partition for each query ? 
>>> Do you use collections that could spread over many SSTables ? 
>>> Do you use batches for writes (although your problem doesn't seem to
>>> be write related) ?>>> Can you share the queries from your scheduled 
>>> selects and the data
>>> model ?>>> 
>>> Cheers,
>>> 
>>> 
>>> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me>
>>> wrote:>>>> __
>>>> Hi,
>>>> 
>>>> we have a cluster of 11 nodes running Cassandra 2.2.9 where we
>>>> regularly get READ messages dropped:>>>> 
>>>> > READ messages were dropped in last 5000 ms: 974 for internal
>>>> > timeout and 0 for cross node timeout>>>> 
>>>> Looking at the logs, some are logged at the same time as Old Gen
>>>> GCs. These GCs all take around 4 to 6s to run. To me, it's "normal"
>>>> that these could cause reads to be dropped.>>>> However, we also have 
>>>> reads dropped without Old Gen GCs occurring,
>>>> only Young Gen.>>>> 
>>>> I'm wondering if anyone has a good way of determining what the
>>>> _root_ cause could be. Up until now, the only way we managed to
>>>> decrease load on our cluster was by guessing some stuff, trying it
>>>> out and being lucky, essentially. I'd love a way to make sure what
>>>> the problem is before tackling it. Doing schema changes is not a
>>>> problem, but changing stuff blindly is not super efficient :)>>>> 
>>>> What I do see in the logs, is that these happen almost exclusively
>>>> when we do a lot of SELECT.  The time logged almost always
>>>> correspond to times where our schedules SELECTs are happening. That
>>>> narrows the scope a little, but still.>>>> 
>>>> Anyway, I'd appreciate any information about troubleshooting this
>>>> scenario.>>>> Thanks.
>>> -- 
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>> 
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com[1]
>> 
> -- 
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[2]


Links:

  1. http://www.thelastpickle.com/
  2. http://www.thelastpickle.com/

Reply via email to