Re: Regular dropped READ messages

Vincent Rischmann Tue, 06 Jun 2017 08:50:30 -0700

Thanks Alexander for the help, lots of good info in there.

I'll try to switch back to CMS and see how it fares.



On Tue, Jun 6, 2017, at 05:06 PM, Alexander Dejanovski wrote:
> Hi Vincent,
> 
> it is very clear, thanks for all the info.
> 
> I would not stick with G1 in your case, as it requires much more heap
> to perform correctly (>24GB).> CMS/ParNew should be much more efficient here 
> and I would go with some
> settings I usually apply on big workloads : 16GB heap / 6GB new gen /
> MaxTenuringThreshold = 5> 
> Large partitions are indeed putting pressure on your heap and
> tombstones as well.> One of your queries is particularly caveated : SELECT 
> app,
> platform, slug, partition, user_id, attributes, state, timezone,
> version FROM table WHERE app = ? AND platform = ? AND slug = ? AND
> partition = ? LIMIT ?> Although you're using the LIMIT clause, it will read 
> the whole
> partition, merge it in memory and only then will it apply the LIMIT.
> Check this blog post for more detailed info :
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html>
>  This can lead you to read the whole 450MB and all the tombstones even
> though you're only targeting a few rows in the partition.> Large partitions 
> are also creating heap pressure during compactions,
> which will issue warnings in the logs (look for "large partition").> 
> You should remove the delete/insert logged batch as it will spread
> over multiple partitions, which is bad for many reasons. It gives you
> no real atomicity, but just the guaranty that if one query succeeds,
> then the rest of the queries will eventually succeed (and that could
> possibly take some time, leaving the cluster in an inconsistent state
> in the meantime). Logged batches have a lot of overheads, one of them
> being a write of the queries to the batchlog table, which will be
> replicated to 2 other nodes, and then deleted after the batch has
> completed.> You'd better turn those into async queries with an external retry
> mechanism.> 
> Tuning the GC should help coping with your data modeling issues. 
> 
> For safety reasons, only change the GC settings for one canary
> node, observe and compare its behavior over a full day. If the
> results are satisfying, generalize to the rest of the cluster. You
> need to experience peak load to make sure the new settings are
> fixing your issues.> 
> Cheers,
> 
> 
> 
> On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:>> __
>> Hi Alexander.
>> 
>> Yeah, the minor GCs I see are usually around 300ms but sometimes
>> jumping to 1s or even more.>> 
>> Hardware specs are:
>>   - 8 core CPUs
>>   - 32 GB of RAM
>>   - 4 SSDs in hardware Raid 0, around 3TB of space per node
>>  
>> GC settings:    -Xmx12G -Xms12G -XX:+UseG1GC -
>> XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
>> XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
>> XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled>> 
>> According to the graphs, there are approximately one Young GC every
>> 10s or so, and almost no Full GCs (for example the last one was 2h45
>> after the previous one).>> 
>> Computed from the log files, average Young GC seems to be around
>> 280ms and max is 2.5s.>> Average Full GC seems to be around 4.6s and max is 
>> 5.3s.
>> I only computed this on one node but the problem occurs on every node
>> as far as I can see.>> 
>> I'm open to tuning the GC, I stuck with defaults (that I think I saw
>> in the cassandra conf, I'm not sure).>> 
>> Number of SSTables looks ok, p75 is at 4 (as is the max for that
>> matter). Partitions size is a problem yeah, this particular table
>> from which we read a lot has a max partition size of 450 MB. I've
>> known about this problem for a long time actually, we already did a
>> bunch of work reducing partition size I think a year ago, but this
>> particular table is tricky to change.>> 
>> One thing to note about this table is that we do a ton of DELETEs
>> regularly (that we can't really stop doing except completely
>> redesigning the table), so we have a ton of tombstones too. We have a
>> lot of warnings about the tombstone threshold when we do our selects
>> (things like "Read 2001 live and 2528 tombstone cells"). I suppose
>> this could be a factor ?>> 
>> Each query reads from a single partition key yes, but as said we
>> issue a lot of them at the same time.>> 
>> The table looks like this (simplified):
>> 
>> CREATE TABLE table (
>>     app text,
>>     platform text,
>>     slug text,
>>     partition int,
>>     user_id text,
>>     attributes blob,
>>     state int,
>>     timezone text,
>>     version int,
>>     PRIMARY KEY ((app, platform, slug, partition), user_id)
>> ) WITH CLUSTERING ORDER BY (user_id ASC)
>> 
>> And the main queries are:
>> 
>>     SELECT app, platform, slug, partition, user_id, attributes,
>>     state, timezone, version>>     FROM table WHERE app = ? AND platform = ? 
>> AND slug = ? AND
>>     partition = ? LIMIT ?>> 
>>     SELECT app, platform, slug, partition, user_id, attributes,
>>     state, timezone, version>>     FROM table WHERE app = ? AND platform = ? 
>> AND slug = ? AND
>>     partition = ? AND user_id >= ? LIMIT ?>> 
>> partition is basically an integer that goes from 0 to 15, and we
>> always select the 16 partitions in parallel.>> 
>> Note that we write constantly to this table, to update some fields,
>> insert the user into the new "slug" (a slug is an amalgamation of
>> different parameters like state, timezone etc that allows us the
>> efficiently query all users from a particular "app" with a given
>> "slug". At least that's the idea, as seen here it causes us some
>> trouble).>> 
>> And yes, we do use batches to write this data, this is how we process
>> each user update:>>   - SELECT from a "master" slug to get the fields we need
>>   - from that, compute a list of slugs the user had and a list of
>>     slugs the user should have (for example if he changes timezone we
>>     have to update the slug)>>   - delete the user from the slug he 
>> shouldn't be in and insert the
>>     user where he should be.>> The last part, delete/insert is done in a 
>> logged batch. 
>> 
>> I hope it's relatively clear.
>> 
>> On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
>>> Hi Vincent, 
>>> 
>>> dropped messages are indeed common in case of long GC pauses. 
>>> Having 4s to 6s pauses is not normal and is the sign of an
>>> unhealthy cluster. Minor GCs are usually faster but you can have
>>> long ones too.>>> 
>>> If you can share your hardware specs along with your current GC
>>> settings (CMS or G1, heap size, young gen size) and a distribution
>>> of GC pauses (rate of minor GCs, average and max duration of GCs) we
>>> could try to help you tune your heap settings.>>> You can activate full GC 
>>> logging which could help in fine tuning
>>> MaxTenuringThreshold and survivor space sizing.>>> 
>>> You should also check for max partition sizes and number of SSTables
>>> accessed per read. Run nodetool cfstats/cfhistograms on your tables
>>> to get both. p75 should be less or equal to 4 in number of SSTables
>>> and you shouldn't have partitions over... let's say 300 MBs.
>>> Partitions > 1GB are a critical problem to address.>>> 
>>> Other things to consider are : 
>>> Do you read from a single partition for each query ? 
>>> Do you use collections that could spread over many SSTables ? 
>>> Do you use batches for writes (although your problem doesn't seem to
>>> be write related) ?>>> Can you share the queries from your scheduled 
>>> selects and the data
>>> model ?>>> 
>>> Cheers,
>>> 
>>> 
>>> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me>
>>> wrote:>>>> __
>>>> Hi,
>>>> 
>>>> we have a cluster of 11 nodes running Cassandra 2.2.9 where we
>>>> regularly get READ messages dropped:>>>> 
>>>> > READ messages were dropped in last 5000 ms: 974 for internal
>>>> > timeout and 0 for cross node timeout>>>> 
>>>> Looking at the logs, some are logged at the same time as Old Gen
>>>> GCs. These GCs all take around 4 to 6s to run. To me, it's "normal"
>>>> that these could cause reads to be dropped.>>>> However, we also have 
>>>> reads dropped without Old Gen GCs occurring,
>>>> only Young Gen.>>>> 
>>>> I'm wondering if anyone has a good way of determining what the
>>>> _root_ cause could be. Up until now, the only way we managed to
>>>> decrease load on our cluster was by guessing some stuff, trying it
>>>> out and being lucky, essentially. I'd love a way to make sure what
>>>> the problem is before tackling it. Doing schema changes is not a
>>>> problem, but changing stuff blindly is not super efficient :)>>>> 
>>>> What I do see in the logs, is that these happen almost exclusively
>>>> when we do a lot of SELECT.  The time logged almost always
>>>> correspond to times where our schedules SELECTs are happening. That
>>>> narrows the scope a little, but still.>>>> 
>>>> Anyway, I'd appreciate any information about troubleshooting this
>>>> scenario.>>>> Thanks.
>>> -- 
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>> 
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com[1]
>> 
> -- 
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[2]


Links:

  1. http://www.thelastpickle.com/
  2. http://www.thelastpickle.com/

Re: Regular dropped READ messages

Reply via email to