Controller is not running in this node. We found that one of the new producer added was publishing only to a specific partition which was in this node, that explains the high utilization only in this node. The publisher was running in an Async thread invoking the send() asynchronously but the messages did not show up in kafka consumer. This but resulted in high CPU utilization. Though the producer was disabled, the async threads were still running. It normalized only after the application with the producer was restarted.
These were in 2 data centers and I am not sure if the latency caused the async thread to get completed before the Kafka producer thread to write the message to the broker. When we did a synchronous send(message).get() it worked fine. We might have published 500 messages every 15 minutes but not sure why it boosted the CPU utilization without even the consumption happening in the kafka broker. We are rewriting the producer implementation, it seems async send invoked from an async java thread does not work in networks with low latency. On Sun, Aug 5, 2018 at 10:16 AM, Manjunath N <manj...@gmail.com> wrote: > After you deleted a topic was it a clean delete. Did you verify in > zookeeper and kafka logs directory? if not you may need to do some clean up > if there are inconsistency in kafka logs dir and zookeeper. > did you try to move the replicas assignment to different machines for this > topic and see if it behaves same way on other machines for this particular > topic? > Check number of open file handle before you start writing to this topic > and after you kick off the writer/producer on all the replica machines for > this topic. > Check how many log files are being created for the partition segment in > kafka logs dir. > Is there anything in log files to trace back this behavior? if you could > check and share any errors or warning it will help. > > > > On Aug 5, 2018, at 4:13 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > bq. only one specific node is showing this issue > > > > Is controller running on this node ? updating the metrics is expensive. > > > > Cheers > > > > On Sat, Aug 4, 2018 at 3:00 PM Abhijith Sreenivasan < > > abhijithonl...@gmail.com> wrote: > > > >> Hello > >> > >> We are seeing high CPU usage for the Kafka process. I am using 0.11 > >> version. Has 5 topics out of which 1 was created newly. We attempted to > >> publish message this new topic which did not show up in the consumer, > but > >> no errors in the publisher end. Not sure why the message did not show > up in > >> consumer. > >> > >> This ran for a couple of days (30K messages) when we noticed 100%+ CPU > >> usage. Tried deleting the topic (config is enabled), it was marked for > >> deletion but after which usage rose to below levels 240%+. We restarted > the > >> process many times and disabled the publisher/producer but no > difference. > >> After some time (1 or 2 hours) we are getting a "Too many open files" > error > >> and process is shutting down. > >> > >> We have 3 nodes with Kafka and 3 other nodes running with ZK, but only > one > >> specific node is showing this issue. (where new topic partition is > >> present). > >> > >> Still debugging and this is a prod environment.. please help! > >> > >> Thanks, > >> Abhi > >> > >> top - 17:47:43 up 289 days, 18:54, 2 users, load average: 2.65, 2.72, > >> 2.52 > >> Tasks: 144 total, 1 running, 143 sleeping, 0 stopped, 0 zombie > >> %Cpu(s): 37.5 us, 19.4 sy, 0.0 ni, 37.0 id, 0.0 wa, 0.0 hi, 5.4 si, > >> 0.7 st > >> KiB Mem : 16266464 total, 1431916 free, 5769976 used, 9064572 > buff/cache > >> KiB Swap: 0 total, 0 free, 0 used. 9230548 avail > Mem > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > >> 32058 root 20 0 5898348 1.078g 15548 S 253.0 6.9 99:03.68 java > >> 10 root 20 0 0 0 0 S 0.3 0.0 921:42.77 > >> rcu_sche > >> > >