When you see this happening (on broker 4 in this instance), can you confirm the Kafka process handle limit?
cat /proc/<pid>/limits On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote: > *Issue : *Kafka cluster goes to an un-responsive state after some time with > producers getting Socket time-outs on every request made. > > *Kafka Version* - 0.8.1 > > *Machines* : VMs , 2 cores 8gb RAM, linux , 3 node cluster. > > ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 16382 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 10240 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) unlimited > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > *Broker Properties :* > > 3 topics, 5 partitions, 5 i/o , network threads. > socket.send.buffer.bytes=1048576 > socket.receive.buffer.bytes=1048576 > socket.request.max.bytes=104857600 > zookeeper.connection.timeout.ms=1000000 > No replication amongst brokers. > > *Producer Properties * - > sync producer, > required.ack = -1 ( Waits for ack from broker, most durable). > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000 > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400 > (requested 102400). > > *NO Consumers* > > *QPS From producer to broker* .. > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent. Each > Message size is about 600 bytes > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each > message size is about 1 kb. > > *Exception seen on producer after 9-10 hours of seamless message send.* > > Exception > kafka.common.FailedToSendMessageException: Failed to send messages after 3 > tries. > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests > for topics user with correlation ids in [25617,25932] > > WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][] > kafka.producer.async.DefaultEventHandler - Failed to send producer request > with correlation id 25931 to broker 4 with data for partitions > [userservice,2] > java.net.SocketTimeoutException > at > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) > at > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) > at kafka.utils.Utils$.read(Utils.scala:375) > > > *No of Open FDs in one of the broker machines*- 999 > > *Netstat of kafka broker :* > > Lot of sockets in CLOSE_WAIT with high receive queue nos. > > Sample : > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT > tcp6 198549 0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT > tcp6 44 0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909 > CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT > tcp6 209647 0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT > tcp6 200731 0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT > tcp6 34133 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767 > CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT > tcp6 208136 0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154 > CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT > tcp6 199587 0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT > tcp6 44 0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080 > CLOSE_WAIT > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT > tcp6 186606 0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971 > CLOSE_WAIT > tcp6 196856 0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT > tcp6 44 0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069 > CLOSE_WAIT > > *On producer boxes :* > > Lot of FIN_WAIT 1 ,2 with high send queues. > tcp6 0 44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092 > FIN_WAIT2 > tcp6 0 0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092 > FIN_WAIT2 > tcp6 0 60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092 > ESTABLISHED > tcp6 0 32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092 > FIN_WAIT1 > > tcp6 0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED > > tcp6 0 0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092 > FIN_WAIT2 > tcp6 0 63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092 > FIN_WAIT2 > tcp6 0 44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092 > FIN_WAIT1 > tcp6 0 28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092 > ESTABLISHED > tcp6 0 44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092 > FIN_WAIT1 > > > *Broker Logs : * > In server.log, mostly INFO [2014-04-10 03:01:42,230] > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket > connection to /10.65.40.137. > > EXCEPT in 1 machine : > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][] > kafka.network.Processor - Closing socket connection to /10.65.40.137. > > No logs in-between. First fatal exception at 07.11. > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][] > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing > to highwatermark file: > java.io.FileNotFoundException: > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open > files) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:194) > at java.io.FileOutputStream.<init>(FileOutputStream.java:145) > at java.io.FileWriter.<init>(FileWriter.java:73) > at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) > at > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447) > at > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444) > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > at > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444) > at > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94) > at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor - > Error in acceptor > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152) > at kafka.network.Acceptor.accept(SocketServer.scala:200) > at kafka.network.Acceptor.run(SocketServer.scala:154) > at java.lang.Thread.run(Thread.java:662) > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][] > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task > 'kafka-recovery-point-checkpoint' > java.io.FileNotFoundException: > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many > open files) > > > *No EXCEPTIONS in any other logs in any other broker machine..* > > Attached is the thread dump from leader.
