By the way, we may have found the issue ..
On going through the thread dump, we found that 4-5 threads were blocked on log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to write logs to disk. The network threads were there-fore blocked on log4j threads, thus hanging the kafka broker. Also in logging, in our log4j file, we have a console appender.. which writes to STDOUT and STDERR, we suspected that while writing to stdout and stderr, the threads are getting blocked. So, we removed the console appender and started writing to a file. Its stable for quite some time. ( Fingers crossed). *Process limits : * Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 1031009 1031009 processes Max open files 1024 1024 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1031009 1031009 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Arya On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <[email protected]> wrote: > When you see this happening (on broker 4 in this instance), can you > confirm the Kafka process handle limit? > > cat /proc/<pid>/limits > > On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote: > > *Issue : *Kafka cluster goes to an un-responsive state after some time > with > > producers getting Socket time-outs on every request made. > > > > *Kafka Version* - 0.8.1 > > > > *Machines* : VMs , 2 cores 8gb RAM, linux , 3 node cluster. > > > > ulimit -a > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 16382 > > max locked memory (kbytes, -l) 64 > > max memory size (kbytes, -m) unlimited > > open files (-n) 10240 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 8192 > > cpu time (seconds, -t) unlimited > > max user processes (-u) unlimited > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > *Broker Properties :* > > > > 3 topics, 5 partitions, 5 i/o , network threads. > > socket.send.buffer.bytes=1048576 > > socket.receive.buffer.bytes=1048576 > > socket.request.max.bytes=104857600 > > zookeeper.connection.timeout.ms=1000000 > > No replication amongst brokers. > > > > *Producer Properties * - > > sync producer, > > required.ack = -1 ( Waits for ack from broker, most durable). > > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000 > > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400 > > (requested 102400). > > > > *NO Consumers* > > > > *QPS From producer to broker* .. > > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent. > Each > > Message size is about 600 bytes > > > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each > > message size is about 1 kb. > > > > *Exception seen on producer after 9-10 hours of seamless message send.* > > > > Exception > > kafka.common.FailedToSendMessageException: Failed to send messages after > 3 > > tries. > > > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send > requests > > for topics user with correlation ids in [25617,25932] > > > > WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][] > > kafka.producer.async.DefaultEventHandler - Failed to send producer > request > > with correlation id 25931 to broker 4 with data for partitions > > [userservice,2] > > java.net.SocketTimeoutException > > at > > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) > > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) > > at > > > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) > > at kafka.utils.Utils$.read(Utils.scala:375) > > > > > > *No of Open FDs in one of the broker machines*- 999 > > > > *Netstat of kafka broker :* > > > > Lot of sockets in CLOSE_WAIT with high receive queue nos. > > > > Sample : > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT > > tcp6 198549 0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT > > tcp6 209647 0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT > > tcp6 200731 0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT > > tcp6 34133 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT > > tcp6 208136 0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT > > tcp6 199587 0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080 > > CLOSE_WAIT > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT > > tcp6 186606 0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971 > > CLOSE_WAIT > > tcp6 196856 0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT > > tcp6 44 0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069 > > CLOSE_WAIT > > > > *On producer boxes :* > > > > Lot of FIN_WAIT 1 ,2 with high send queues. > > tcp6 0 44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092 > > ESTABLISHED > > tcp6 0 32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092 > > FIN_WAIT1 > > > > tcp6 0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED > > > > tcp6 0 0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092 > > FIN_WAIT2 > > tcp6 0 44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092 > > FIN_WAIT1 > > tcp6 0 28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092 > > ESTABLISHED > > tcp6 0 44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092 > > FIN_WAIT1 > > > > > > *Broker Logs : * > > In server.log, mostly INFO [2014-04-10 03:01:42,230] > > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket > > connection to /10.65.40.137. > > > > EXCEPT in 1 machine : > > > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][] > > kafka.network.Processor - Closing socket connection to /10.65.40.137. > > > > No logs in-between. First fatal exception at 07.11. > > > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][] > > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error > writing > > to highwatermark file: > > java.io.FileNotFoundException: > > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many > open > > files) > > at java.io.FileOutputStream.open(Native Method) > > at java.io.FileOutputStream.<init>(FileOutputStream.java:194) > > at java.io.FileOutputStream.<init>(FileOutputStream.java:145) > > at java.io.FileWriter.<init>(FileWriter.java:73) > > at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) > > at > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447) > > at > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444) > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > > at > > > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444) > > at > > > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94) > > at > kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > > at > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > > at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] > kafka.network.Acceptor - > > Error in acceptor > > java.io.IOException: Too many open files > > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > > at > > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152) > > at kafka.network.Acceptor.accept(SocketServer.scala:200) > > at kafka.network.Acceptor.run(SocketServer.scala:154) > > at java.lang.Thread.run(Thread.java:662) > > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][] > > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task > > 'kafka-recovery-point-checkpoint' > > java.io.FileNotFoundException: > > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many > > open files) > > > > > > *No EXCEPTIONS in any other logs in any other broker machine..* > > > > Attached is the thread dump from leader. > > >
