Hi Guozhang, I just have 1 producer client per producer machine. The producer is in a singleton scope.
Is there a possibility to close producer sockets by force or use a producer socket pool?? -- Arya Arya On Thu, Apr 10, 2014 at 11:38 AM, Guozhang Wang <[email protected]> wrote: > Hello Arya, > > The broker seems dead due to too many open file handlers, which are likely > due to too many open sockets. Hhow many producer clientss do you have on > these 5 machines, and could you check if there is any socket leak? > > Guozhang > > > On Wed, Apr 9, 2014 at 8:50 PM, Arya Ketan <[email protected]> wrote: > > > *Issue : *Kafka cluster goes to an un-responsive state after some time > > with producers getting Socket time-outs on every request made. > > > > *Kafka Version* - 0.8.1 > > > > *Machines* : VMs , 2 cores 8gb RAM, linux , 3 node cluster. > > > > ulimit -a > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 16382 > > max locked memory (kbytes, -l) 64 > > max memory size (kbytes, -m) unlimited > > open files (-n) 10240 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 8192 > > cpu time (seconds, -t) unlimited > > max user processes (-u) unlimited > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > *Broker Properties :* > > > > 3 topics, 5 partitions, 5 i/o , network threads. > > socket.send.buffer.bytes=1048576 > > socket.receive.buffer.bytes=1048576 > > socket.request.max.bytes=104857600 > > zookeeper.connection.timeout.ms=1000000 > > No replication amongst brokers. > > > > *Producer Properties * - > > sync producer, > > required.ack = -1 ( Waits for ack from broker, most durable). > > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000 > > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400 > > (requested 102400). > > > > *NO Consumers* > > > > *QPS From producer to broker* .. > > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent. > Each > > Message size is about 600 bytes > > > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each > > message size is about 1 kb. > > > > *Exception seen on producer after 9-10 hours of seamless message send.* > > > > Exception > > kafka.common.FailedToSendMessageException: Failed to send messages after > 3 > > tries. > > > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send > requests > > for topics user with correlation ids in [25617,25932] > > > > WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][] > > kafka.producer.async.DefaultEventHandler - Failed to send producer > request > > with correlation id 25931 to broker 4 with data for partitions > > [userservice,2] > > java.net.SocketTimeoutException > > at > > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) > > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) > > at > > > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) > > at kafka.utils.Utils$.read(Utils.scala:375) > > > > > > *No of Open FDs in one of the broker machines*- 999 > > > > *Netstat of kafka broker :* > > > > Lot of sockets in CLOSE_WAIT with high receive queue nos. > > > > Sample : > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT > > tcp6 198549 0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT > > tcp6 209647 0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT > > tcp6 200731 0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT > > tcp6 34133 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT > > tcp6 208136 0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT > > tcp6 199587 0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT > > tcp6 186606 0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971 > > CLOSE_WAIT > > tcp6 196856 0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069 > > CLOSE_WAIT > > > > *On producer boxes :* > > > > Lot of FIN_WAIT 1 ,2 with high send queues. > > tcp6 0 44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092 > > ESTABLISHED > > tcp6 0 32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092 > > FIN_WAIT1 > > > > tcp6 0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED > > > > tcp6 0 0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092 > > ESTABLISHED > > tcp6 0 44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092 > > FIN_WAIT1 > > > > > > *Broker Logs : * > > In server.log, mostly INFO [2014-04-10 03:01:42,230] > > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket > > connection to /10.65.40.137. > > > > EXCEPT in 1 machine : > > > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][] > > kafka.network.Processor - Closing socket connection to /10.65.40.137. > > > > No logs in-between. First fatal exception at 07.11. > > > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][] > > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error > writing > > to highwatermark file: > > java.io.FileNotFoundException: > > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many > open > > files) > > at java.io.FileOutputStream.open(Native Method) > > at java.io.FileOutputStream.<init>(FileOutputStream.java:194) > > at java.io.FileOutputStream.<init>(FileOutputStream.java:145) > > at java.io.FileWriter.<init>(FileWriter.java:73) > > at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) > > at > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447) > > at > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444) > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > > at > > > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444) > > at > > > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94) > > at > kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > > at > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > > at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor > > - Error in acceptor > > java.io.IOException: Too many open files > > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > > at > > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152) > > at kafka.network.Acceptor.accept(SocketServer.scala:200) > > at kafka.network.Acceptor.run(SocketServer.scala:154) > > at java.lang.Thread.run(Thread.java:662) > > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][] > > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task > > 'kafka-recovery-point-checkpoint' > > java.io.FileNotFoundException: > > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many > > open files) > > > > > > *No EXCEPTIONS in any other logs in any other broker machine..* > > > > Attached is the thread dump from leader. > > > > > > -- > -- Guozhang >
