System: jdk-7u76 (Oracle), on Ubuntu 14.04.1 LTS, Trusty Tahr, Kernel: 3.13.0-44-generic
Kafka Version: 0.8.1.2 Broker Cfg: Three brokers, Zookeeper Cfg: Three brokers, Replication Factor: 2 Client Machine Configuration:: - Memory: 4G - CPU: 2 (Xeon @ 2.50GHz) - Java args:: -server -Xmx2G -Xms2G -XX:+HeapDumpOnOutOfMemoryError -Djava.awt.headless=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -Dorg.jboss.netty.epollBugWorkaround -Xrunhprof:cpu=samples,depth=30,file=/tmp/java.hprof Folks, I'll appreciate your help in trying to understand high CPU usage by our kafka clients. We are running four simple java consumers within the same java process. One of these consumers uses ten streams (off 50 topics/partitions) consumed by ten worker threads and the rest of the consumers (three with four independent topics/partitions each) draw three individual streams fed into three independent worker threads respectively. The overall messaging traffic is very low (around two messages/sec). The process runs fine for some time but after few minutes of normal run it escalates into a high CPU mode. Following is a 10 minutes of trace showing the reason for high CPU -----------------------------8<----------------------------- CPU SAMPLES BEGIN (total = 1426307) Fri May 1 18:34:37 2015 rank self accum count trace method 1 50.26% 50.26% 716803 306229 sun.nio.ch.EPollArrayWrapper.epollWait (kafka consumer see TRACE # below) 2 22.27% 72.53% 317632 302942 sun.nio.ch.EPollArrayWrapper.epollWait (netty related see TRACE # below) 3 21.85% 94.38% 311678 302855 sun.nio.ch.EPollArrayWrapper.epollWait (zookeeper client see TRACE # below) 4 2.78% 97.16% 39709 302957 sun.nio.ch.EPollArrayWrapper.epollWait 5 2.78% 99.95% 39706 304668 sun.nio.ch.EPollArrayWrapper.epollWait Where:: TRACE 306229: sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line) sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:221) sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) kafka.utils.Utils$.read(Utils.scala:375) kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) kafka.network.Receive$class.readCompletely(Transmission.scala:56) kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29) kafka.network.BlockingChannel.receive(BlockingChannel.scala:100) kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:73) kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109) kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108) kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107) kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) TRACE 302855: sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line) sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1134) TRACE 302942: sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line) sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68) org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415) org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212) org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -------------------------------->8------------------------------------------ We are using the following connector configuration: -------------------------8<----------------------------- "socket.timeout.ms" = "60000" "socket.receive.buffer.bytes" = "65536" "fetch.message.max.bytes" = "1000000" "auto.commit.enable" = "true" "auto.commit.interval.ms" = "60000" "queued.max.message.chunks" = "10" "rebalance.max.retries" = "10" "fetch.min.bytes" = "1" "fetch.wait.max.ms" = "50000" "rebalance.backoff.ms" = "2000" "refresh.leader.backoff.ms" = "200" "auto.offset.reset" = "smallest" "consumer.timeout.ms" = "60000") "zookeeper.session.timeout.ms" = "6000" "zookeeper.connection.timeout.ms" = "6000" "zookeeper.sync.time.ms" = "2000" ------------------------->8---------------------------- My guess is that we are missing/misusing some configuration which is driving up the CPU, but we are not able to pin point exactly. Will appreciate if you can provide some pointers. Thanks a lot, jsh