Hi All,
We have a kafka cluster with 12 nodes and we are pretty much seeing 90% cpu
usage on all the nodes. Here is all the information. Need some help on
figuring out what the problem is and how to overcome this issue.
*Cluster:*
Kafka version: 2.3.0
Number of brokers in cluster: 12
Node type: 4 vCores 32GB mem
Network In: 10Mbps per broker
Network Out: 16Mbps per broker
Topics: 10 (approximately)
Partitions: 20 (Max), some has only partitions
Replication Factor: 3
*CPU Usage:*
[image: image.png]
*VMStat*
[root]# vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
8 0 0 234444 19064 24046980 0 0 17 2026 1 3 38 33
28 0 1
7 0 0 256444 19036 24023880 0 0 768 0 64027 22708 44 40
16 0 1
7 0 0 245356 19052 24034560 0 0 256 472 63509 23276 44 39
17 0 1
7 0 0 235096 19052 24046616 0 0 0 0 62277 22516 46 38
15 0 1
8 0 0 260548 19036 24020084 0 0 516 49888 62364 22894 43 38
18 0 1
5 0 0 249232 19036 24030924 0 0 512 0 61022 24589 41 39
20 0 1
6 0 0 238072 19036 24042512 0 0 1024 0 63358 23063 44 38
17 0 0
5 0 0 262904 19052 24017972 0 0 0 440 63078 23499 46 37
17 0 1
7 0 0 250324 19052 24030008 0 0 0 0 64615 22617 48 38
14 0 1
6 0 0 237920 19052 24042372 0 0 1024 48900 63223 23029 42 40
18 0 1
*IO Stat:*
[root]# iostat -m
Linux 4.14.72-73.55.amzn2.x86_64 (loc-kafka11.internal.dnaspaces.io)
01/02/2020 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
38.11 0.00 33.09 0.11 0.61 28.08
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
xvda 2.36 0.01 0.01 26760 43360
nvme0n1 0.00 0.00 0.00 2 0
xvdf 70.95 0.06 7.67 185908 25205338
*Top Kafka broker threads:*
[image: image.png]
*Top 3:*
"data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-0"
#60 prio=5 os_prio=0 tid=0x00007f8b1ab56000 nid=0x581f runnable
[0x00007f8a886ce000]
"data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-2"
#62 prio=5 os_prio=0 tid=0x00007f8b1ab59000 nid=0x5821 runnable
[0x00007f8a6aefd000]
"data-plane-kafka-network-thread-10-ListenerName(PLAINTEXT)-PLAINTEXT-1"
#61 prio=5 os_prio=0 tid=0x00007f8b1ab57800 nid=0x5820 runnable
[0x00007f8a885cd000]
It doesn't looks like GC and IO is the problem.
Thanks