Interesting - can you share those thread dumps? Also, your process file handle limit is extremely low (and different from what seems to be the default for your user-id from your earlier message). You should bump that up to a couple of ten thousand at least.
Joel On Fri, Apr 11, 2014 at 08:13:34AM +0530, Arya Ketan wrote: > By the way, we may have found the issue .. > > > On going through the thread dump, we found that 4-5 threads were blocked on > log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to > write logs to disk. The network threads were there-fore blocked on log4j > threads, thus hanging the kafka broker. > > Also in logging, in our log4j file, we have a console appender.. which > writes to STDOUT and STDERR, we suspected that while writing to stdout and > stderr, the threads are getting blocked. So, we removed the console > appender and started writing to a file. Its stable for quite some time. ( > Fingers crossed). > > > *Process limits : * > Max cpu time unlimited unlimited > seconds > Max file size unlimited unlimited > bytes > Max data size unlimited unlimited > bytes > Max stack size 10485760 unlimited > bytes > Max core file size 0 unlimited > bytes > Max resident set unlimited unlimited > bytes > Max processes 1031009 1031009 > processes > Max open files 1024 1024 > files > Max locked memory 65536 65536 > bytes > Max address space unlimited unlimited > bytes > Max file locks unlimited unlimited > locks > Max pending signals 1031009 1031009 > signals > Max msgqueue size 819200 819200 > bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > > > > > Arya > > > On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <[email protected]> wrote: > > > When you see this happening (on broker 4 in this instance), can you > > confirm the Kafka process handle limit? > > > > cat /proc/<pid>/limits > > > > On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote: > > > *Issue : *Kafka cluster goes to an un-responsive state after some time > > with > > > producers getting Socket time-outs on every request made. > > > > > > *Kafka Version* - 0.8.1 > > > > > > *Machines* : VMs , 2 cores 8gb RAM, linux , 3 node cluster. > > > > > > ulimit -a > > > core file size (blocks, -c) 0 > > > data seg size (kbytes, -d) unlimited > > > scheduling priority (-e) 0 > > > file size (blocks, -f) unlimited > > > pending signals (-i) 16382 > > > max locked memory (kbytes, -l) 64 > > > max memory size (kbytes, -m) unlimited > > > open files (-n) 10240 > > > pipe size (512 bytes, -p) 8 > > > POSIX message queues (bytes, -q) 819200 > > > real-time priority (-r) 0 > > > stack size (kbytes, -s) 8192 > > > cpu time (seconds, -t) unlimited > > > max user processes (-u) unlimited > > > virtual memory (kbytes, -v) unlimited > > > file locks (-x) unlimited > > > > > > *Broker Properties :* > > > > > > 3 topics, 5 partitions, 5 i/o , network threads. > > > socket.send.buffer.bytes=1048576 > > > socket.receive.buffer.bytes=1048576 > > > socket.request.max.bytes=104857600 > > > zookeeper.connection.timeout.ms=1000000 > > > No replication amongst brokers. > > > > > > *Producer Properties * - > > > sync producer, > > > required.ack = -1 ( Waits for ack from broker, most durable). > > > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000 > > > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400 > > > (requested 102400). > > > > > > *NO Consumers* > > > > > > *QPS From producer to broker* .. > > > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent. > > Each > > > Message size is about 600 bytes > > > > > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each > > > message size is about 1 kb. > > > > > > *Exception seen on producer after 9-10 hours of seamless message send.* > > > > > > Exception > > > kafka.common.FailedToSendMessageException: Failed to send messages after > > 3 > > > tries. > > > > > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send > > requests > > > for topics user with correlation ids in [25617,25932] > > > > > > WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][] > > > kafka.producer.async.DefaultEventHandler - Failed to send producer > > request > > > with correlation id 25931 to broker 4 with data for partitions > > > [userservice,2] > > > java.net.SocketTimeoutException > > > at > > > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201) > > > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) > > > at > > > > > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) > > > at kafka.utils.Utils$.read(Utils.scala:375) > > > > > > > > > *No of Open FDs in one of the broker machines*- 999 > > > > > > *Netstat of kafka broker :* > > > > > > Lot of sockets in CLOSE_WAIT with high receive queue nos. > > > > > > Sample : > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT > > > tcp6 198549 0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT > > > tcp6 44 0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909 > > > CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT > > > tcp6 209647 0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT > > > tcp6 200731 0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT > > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT > > > tcp6 34133 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767 > > > CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT > > > tcp6 208136 0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT > > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154 > > > CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT > > > tcp6 199587 0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT > > > tcp6 44 0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080 > > > CLOSE_WAIT > > > tcp6 45 0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT > > > tcp6 196401 0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT > > > tcp6 186606 0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT > > > tcp6 44 0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971 > > > CLOSE_WAIT > > > tcp6 196856 0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT > > > tcp6 44 0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069 > > > CLOSE_WAIT > > > > > > *On producer boxes :* > > > > > > Lot of FIN_WAIT 1 ,2 with high send queues. > > > tcp6 0 44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092 > > > FIN_WAIT2 > > > tcp6 0 0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092 > > > FIN_WAIT2 > > > tcp6 0 60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092 > > > ESTABLISHED > > > tcp6 0 32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > > > > tcp6 0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED > > > > > > tcp6 0 0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092 > > > FIN_WAIT2 > > > tcp6 0 63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092 > > > FIN_WAIT2 > > > tcp6 0 44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > tcp6 0 28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092 > > > ESTABLISHED > > > tcp6 0 44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092 > > > FIN_WAIT1 > > > > > > > > > *Broker Logs : * > > > In server.log, mostly INFO [2014-04-10 03:01:42,230] > > > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket > > > connection to /10.65.40.137. > > > > > > EXCEPT in 1 machine : > > > > > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][] > > > kafka.network.Processor - Closing socket connection to /10.65.40.137. > > > > > > No logs in-between. First fatal exception at 07.11. > > > > > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][] > > > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error > > writing > > > to highwatermark file: > > > java.io.FileNotFoundException: > > > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many > > open > > > files) > > > at java.io.FileOutputStream.open(Native Method) > > > at java.io.FileOutputStream.<init>(FileOutputStream.java:194) > > > at java.io.FileOutputStream.<init>(FileOutputStream.java:145) > > > at java.io.FileWriter.<init>(FileWriter.java:73) > > > at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) > > > at > > > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447) > > > at > > > > > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444) > > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > > > at > > > > > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444) > > > at > > > > > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94) > > > at > > kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100) > > > at > > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > > > at > > > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > > > at > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > > > at > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > > > at > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > > > at > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > > at java.lang.Thread.run(Thread.java:662) > > > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] > > kafka.network.Acceptor - > > > Error in acceptor > > > java.io.IOException: Too many open files > > > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > > > at > > > > > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152) > > > at kafka.network.Acceptor.accept(SocketServer.scala:200) > > > at kafka.network.Acceptor.run(SocketServer.scala:154) > > > at java.lang.Thread.run(Thread.java:662) > > > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][] > > > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task > > > 'kafka-recovery-point-checkpoint' > > > java.io.FileNotFoundException: > > > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many > > > open files) > > > > > > > > > *No EXCEPTIONS in any other logs in any other broker machine..* > > > > > > Attached is the thread dump from leader. > > > > > >
