Alain,

I really appreciate your answers! A little typo is not changing the
valuable content! For sure I will give a shot to your GC settings and come
back with my findings.
Right now I have 6 nodes up and running and everything looks good so far
(at least much better).

I agree, the hardware I am using is quite old but rather experimenting with
new hardware combinations (on prod) I decided to get safe and scale
horizontally with the hardware we have tested. I'm preparing to migrate
inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
Multi-AZ.

Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
certainly gain in terms of random i/o however I'd like to hear what is your
stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
when using EBS?

Thanks!

PS: I defintely own you a coffee, actually much more than that!

On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Ah excuse my confusion. I now understand I guide you through changing the
>> throughput when you wanted to change the compaction throughput.
>
>
>
> Wow, I meant to say "I guided you through changing the compaction
> throughput when you wanted to change the number of concurrent compactors."
>
> I should not answer messages before waking up fully...
>
> :)
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:
>
>> Ah excuse my confusion. I now understand I guide you through changing the
>> throughput when you wanted to change the compaction throughput.
>>
>> I also found some commands I ran in the past using jmxterm. As mentioned
>> by Chris - and thanks Chris for answering the question properly -, the
>> 'max' can never be lower than the 'core'.
>>
>> Use JMXTERM to REDUCE the concurrent compactors:
>>
>> ```
>> # if we have more than 2 threads:
>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
>> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>> -l 127.0.0.1:7199
>> ```
>>
>> Use JMXTERM to INCREASE the concurrent compactors:
>>
>> ```
>> # if we have currently less than 6 threads:
>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>> -l 127.0.0.1:7199 && echo "set -b 
>> org.apache.cassandra.db:type=CompactionManager
>> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>> 127.0.0.1:7199
>> ```
>>
>> Some comments about the information you shared, as you said, 'thinking
>> out loud' :):
>>
>> *About the hardware*
>>
>> I remember using the 'm1.xlarge' :). They are not that recent. It will
>> probably worth it to reconsider this hardware choice and migrate to newer
>> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
>> reduce the number of nodes and make it equivalent (or maybe slightly more
>> expensive but so it works properly). I once moved from a lot of these nodes
>> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
>> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
>> hardware for your case should avoid headaches to you and your team. I
>> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
>> ...). It's good for learning, not for business.
>>
>> Especially, this does not work well together:
>>
>> my instances are still on magnetic drivers
>>>
>>
>> with
>>
>> most tables on LCS
>>
>> frequent r/w pattern
>>>
>>
>> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
>> probably help to reduce the latency. I would also pick an instance with
>> more memory (30 GB would probably be more comfortable). The more memory,
>> the better it is possible to tune the JVM and the more page caching can be
>> done (thus avoiding some disk reads). Given the number of nodes you use,
>> it's complex to keep the cost low doing this change. When the cluster will
>> grow you might want to consider changing the instance type again and maybe
>> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
>> memory and the same number of cpu (or more) and see how many nodes are
>> needed. It might be slightly more expensive, but I really believe it could
>> do  some good.
>>
>> As a middle term solution, I think you might be really happy with a
>> change of this kind.
>>
>> *About DTCS/TWCS?*
>>
>>
>>>
>>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>>
>> Indeed switching to DTCS rather than TWCS can be a real relief for a
>> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
>> must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
>> a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
>> https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you?
>>
>> *Garbage Collection?*
>>
>> That being said, the CPU load is really high, I suspect Garbage
>> Collection is taking a lot of time to the nodes of this cluster. It is
>> probably not helping the CPUs either. This might even be the biggest pain
>> point for this cluster.
>>
>> Would you like to try using following settings on a canary node and see
>> how it goes? These settings are quite arbitrary. With the gc.log I could be
>> more precise on what I believe is a correct setting.
>>
>> GC Type: CMS
>> Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total).
>> New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values)
>> TenuringThreshold: 15 (instead of 1, that is definitely too small and
>> tend to have short living object still being promoted to the old gen)
>>
>> For those settings, I do not trust the cassandra defaults in most cases. 
>> New_heap_size
>> should be 25-50% of the heap (and not related to the number of CPU cores).
>> Also below 16 GB I never had a better result with G1GC than CMS. But I must
>> say I have been fighting a lot with CMS in the past to tune it nicely while
>> I did not even play much with G1GC.
>>
>> This (or similar settings) worked for distinct cases having heavy read
>> patterns. In the mailing list I explained recently to someone else my
>> understanding of JVM and GC, also there is a blog post my colleague Jon
>> wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I
>> believe he suggested a slightly different tuning.
>> If none of this is helping, please send the gc.log file over with and
>> without this change we could have a look what is going on. SurvivorRatio
>> can also be moved down to 2 or 4, if you want to play around and check the
>> difference.
>>
>> Make sure to use a canary node first, there is no 'good' configuration
>> here, it really depends on the workload and the settings above could harm
>> the cluster.
>>
>> I think we can make more of these instances. Nonetheless after adding a
>> few more nodes, scaling up the instance type instead of the number of nodes
>> to have SSDs and bit more of memory will make things smoother, and probably
>> cheaper as well at some point.
>>
>>
>>
>>
>> 2018-07-18 17:27 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>
>>> Chris,
>>>
>>> Thank you for mbean reference.
>>>
>>> On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com>
>>> wrote:
>>>
>>>> Alain, thank you for email. I really really appreciate it!
>>>>
>>>> I am actually trying to remove the disk io from the suspect list, thus
>>>> I'm want to reduce the number of concurrent compactors. I'll give
>>>> thorughput a shot.
>>>> No, I don't have a long list of pending compactions, however my
>>>> instances are still on magnetic drivers and can't really afford high number
>>>> of compactors.
>>>>
>>>> We started to have slow downs and most likely we were undersized, new
>>>> features are coming in and I want to be ready for them.
>>>> *About the issue:*
>>>>
>>>>
>>>>    - High system load on cassanda nodes. This means top saying
>>>>    6.0/12.0 on a 4 vcpu instance (!)
>>>>
>>>>
>>>>    - CPU is high:
>>>>          - Dynatrace says 50%
>>>>          - top easily goes to 80%
>>>>       - Network around 30Mb (according to Dynatrace)
>>>>       - Disks:
>>>>          - ~40 iops
>>>>          - high latency: ~20ms (min 8 max 50!)
>>>>          - negligible iowait
>>>>          - testing an empty instance with fio I get 1200 r_iops / 400
>>>>          w_iops
>>>>
>>>>
>>>>    - Clients timeout
>>>>       - mostly when reading
>>>>       - few cases when writing
>>>>    - Slowly growing number of "All time blocked of Native T-R"
>>>>       - small numbers: hundreds vs millions of successfully serverd
>>>>       requests
>>>>
>>>> The system:
>>>>
>>>>    - Cassandra 3.0.6
>>>>       - most tables on LCS
>>>>          - frequent r/w pattern
>>>>       - few tables with DTCS
>>>>          - need to upgrade to 3.0.8 for TWCS
>>>>          - mostly TS data, stream write / batch read
>>>>       - All our keyspaces have RF: 3
>>>>
>>>>
>>>>    - All nodes on the same AZ
>>>>    - m1.xlarge
>>>>    - 4x420 drives (emphemerial storage) configured in striping (raid0)
>>>>       - 4 vcpu
>>>>       - 15GB ram
>>>>    - workload:
>>>>       - Java applications;
>>>>          - Mostly feeding cassandra writing data coming in
>>>>          - Apache Spark applications:
>>>>          - batch processes to read and write back to C* or other
>>>>          systems
>>>>          - not co-located
>>>>
>>>> So far my effort was put into growing the ring to better distribute the
>>>> load and decrease the pressure, including:
>>>>
>>>>    - Increasing the node number from 3 to 5 (6th node joining)
>>>>    - jvm memory "optimization":
>>>>    - heaps were set by default script to something bit smaller that
>>>>       4GB with CMS gc
>>>>       - gc pressure was high / long gc pauses
>>>>          - clients were suffering of read timeouts
>>>>       - increased the heap still using CMS:
>>>>          - very long GC pauses
>>>>          - not much tuning around CMS
>>>>          - switched to G1 and forced 6/7GB heap on each node using
>>>>       almost suggested settings
>>>>       - much more stable
>>>>             - generally < 300ms
>>>>          - I still have long pauses from time to time (mostly around
>>>>          1200ms, sometimes on some nodes 3000)
>>>>
>>>> *Thinking out loud:*
>>>> Things are much better, however I still see a high cpu usage specially
>>>> when Spark kicks even though spark jobs are very small in terms of
>>>> resources (single worker with very limited parallelism).
>>>>
>>>> On LCS tables cfstats reports single digit read latencies and generally
>>>> 0.X write latencies (as per today).
>>>> On DTCS tables I have 0.x ms write latency but still double digit read
>>>> latency, but I guess I should spend some time to tune that or upgrade and
>>>> move away from DTCS :(
>>>> Yes, Saprk reads mostly from DTCS tables
>>>>
>>>> Still is kinda common to to have dropped READ, HINT and MUTATION.
>>>>
>>>>    - not on all nodes
>>>>    - this generally happen on node restart
>>>>
>>>>
>>>> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
>>>> 14.04 and 16.04) with terrible results, much slower instance startup and
>>>> responsiveness, how could that be?
>>>>
>>>> Once everything will be stabilized I'll prepare our move to vpc and
>>>> possibly upgrade to i3 instance, any comment on the hardware side?  is
>>>> 4cores still a reasonble hardware?
>>>>
>>>> Best,
>>>>
>>>> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Riccardo,
>>>>>
>>>>> I noticed I have been writing a novel to answer a simple couple of
>>>>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
>>>>> what you were looking for :). Also, there is a warning that it might be
>>>>> counter-productive and stress the cluster even more to increase the
>>>>> compaction throughput. There is more information below ('about the 
>>>>> issue').
>>>>>
>>>>> *tl;dr*:
>>>>>
>>>>> What about using 'nodetool setcompactionthroughput XX' instead. It
>>>>> should available there.
>>>>>
>>>>> In the same way 'nodetool getcompactionthroughput' gives you the
>>>>> current value. Be aware that this change done through JMX/nodetool is
>>>>> *not* permanent. You still need to update the cassandra.yaml file.
>>>>>
>>>>> If you really want to use the MBean through JMX, because using
>>>>> 'nodetool' is too easy (or for any other reason :p):
>>>>>
>>>>> Mbean: org.apache.cassandra.service.StorageServiceMBean
>>>>> Attribute: CompactionThroughputMbPerSec
>>>>>
>>>>> *Long story* with the "how to" since I went through this search
>>>>> myself, I did not know where this MBean was.
>>>>>
>>>>> Can someone point me to the right mbean?
>>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>>
>>>>>
>>>>> I am not sure about the doc, but you can use jmxterm (
>>>>> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>>>>>
>>>>> To replace the doc I use CCM (https://github.com/riptano/ccm) +
>>>>> jconsole to find the mbeans locally:
>>>>>
>>>>> * Add loopback addresses for ccm (see the readme file)
>>>>> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n
>>>>> 3 -s'
>>>>> * Start jconsole using the right pid: 'jconsole $(ccm node1 show |
>>>>> grep pid | cut -d "=" -f 2)'
>>>>> * Explore MBeans, try to guess where this could be (and discover other
>>>>> funny stuff in there :)).
>>>>>
>>>>> I must admit I did not find it this way using C*3.0.6 and jconsole.
>>>>> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
>>>>> CompactionThroughput' with this result: https://gist.github.co
>>>>> m/arodrime/f9591e4bdd2b1367a496447cdd959006
>>>>>
>>>>> With this I could find the right MBean, the only code documentation
>>>>> that is always up to date is the code itself I am afraid:
>>>>>
>>>>> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:
>>>>> public void setCompactionThroughputMbPerSec(int value);'
>>>>>
>>>>> Note that the research in the code also leads to nodetool ;-).
>>>>>
>>>>> I could finally find the MBean in the 'jconsole' too:
>>>>> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link
>>>>> will live).
>>>>>
>>>>> jconsole also allows you to see what attributes it is possible to set
>>>>> or not.
>>>>>
>>>>> You can now find any other MBean you would need I hope :).
>>>>>
>>>>>
>>>>> see if it helps when the system is under stress
>>>>>
>>>>>
>>>>> *About the issue*
>>>>>
>>>>> You don't exactly say what you are observing, what is that "stress"?
>>>>> How is it impacting the cluster?
>>>>>
>>>>> I ask because I am afraid this change might not help and even be
>>>>> counter-productive. Even though having SSTables nicely compacted make a
>>>>> huge difference at the read time, if that's already the case for you and
>>>>> the data is already nicely compacted, doing this change won't help. It
>>>>> might even make things slightly worse if the current bottleneck is the 
>>>>> disk
>>>>> IO during a stress period as the compactors would increase their disk read
>>>>> throughput, thus maybe fight with the read requests for disk throughput.
>>>>>
>>>>> If you have a similar number of sstables on all nodes, not many
>>>>> compactions pending (nodetool netstats -H) and read operations are hitting
>>>>> a small number sstables (nodetool tablehistogram) then you probably
>>>>> don't need to increase the compaction speed.
>>>>>
>>>>> Let's say that the compaction throughput is not often the cause of
>>>>> stress during peak hours nor a direct way to make things 'faster'.
>>>>> Generally when compaction goes wrong, the number of sstables goes *t*
>>>>> *hrou**g**h* the roof. If you have a chart showing the number
>>>>> sstables, you can see this really well.
>>>>>
>>>>> Of course, if you feel you are in this case, increasing the compaction
>>>>> throughput will definitely help if the cluster also has spared disk
>>>>> throughput.
>>>>>
>>>>> To check what's wrong, if you believe it's something different, here
>>>>> are some useful commands:
>>>>>
>>>>> - nodetool tpstats (check for pending/blocked/dropped threads there)
>>>>> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
>>>>> /var/log/cassandra/system.log)
>>>>> - Check local latencies (nodetool tablestats /
>>>>> nodetool tablehistogram) and compare it to the client request latency. At
>>>>> the node level, reads should probably be a single digit in milliseconds,
>>>>> rather close to 1 ms with SSDs and writes below the millisecond most
>>>>> probably (it depends on the data size too, etc...).
>>>>> - Trace a query during this period, see what takes time (for example
>>>>> from  'cqlsh' - 'TRACING ON; SELECT ...')
>>>>>
>>>>> You can also analyze the *Garbage Collection* activity. As Cassandre
>>>>> uses the JVM, a badly tuned GC will induce long pauses. Depending on the
>>>>> workload, and I must say for most of the cluster I work on, default the
>>>>> tuning is not that good and can keep server busy 10-15% of the time with
>>>>> stop the world GC.
>>>>> You might find this post of my colleague Jon about GC tuning for
>>>>> Apache Cassandra interesting: http://thelastpic
>>>>> kle.com/blog/2018/04/11/gc-tuning.html. GC pressure is a very common
>>>>> way to optimize a Cassandra cluster, to adapt it to your 
>>>>> workload/hardware.
>>>>>
>>>>> C*heers,
>>>>> -----------------------
>>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>>> France / Spain
>>>>>
>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>>
>>>>> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>>>>
>>>>>> Hi list,
>>>>>>
>>>>>> Cassandra 3.0.6
>>>>>>
>>>>>> I'd like to test the change of concurrent compactors to see if it
>>>>>> helps when the system is under stress.
>>>>>>
>>>>>> Can someone point me to the right mbean?
>>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>>>
>>>>>> Any suggestion much appreciated, best
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to