Hmm, without knowing the client ip, it's hard to tell whether those are
from replication fetcher threads or not. Are most of those connections in
established mode?

Thanks,

Jun


On Tue, Jan 21, 2014 at 8:06 AM, Ahmy Yulrizka <a...@yulrizka.com> wrote:

> this is the the line i copied on lsof
>
> ...
> java      11818      kafka   98u     sock                0,7       0t0
>  615628183 can't identify protocol
> java      11818      kafka   99u     IPv4          615077352       0t0
>    TCP somedomain.com:9092->121-123-123-123.someprovider.net:37547
> (CLOSE_WAIT)
> java      11818      kafka  100u     IPv4          615077353       0t0
>    TCP somedomain.com:9092->121-123-123-123.someprovider.net:37553
> (ESTABLISHED)
> java      11818      kafka  101u     sock                0,7       0t0
>  615628184 can't identify protocol
> java      11818      kafka  102u     sock                0,7       0t0
>  615628185 can't identify protocol
> java      11818      kafka  103u     sock                0,7       0t0
>  615628186 can't identify protocol
> ...
>
> as you can see, from the output, i could see the connection state on some
> of the TCP, but the sock only gives information "can't identify protocol"
> so I could not see where or from this sock is originating
>
> I could not see the connection also when i run netstat -nat
>
> --
> Ahmy Yulrizka
> http://ahmy.yulrizka.com
> @yulrizka
>
>
> On Tue, Jan 21, 2014 at 4:42 PM, Jun Rao <jun...@gmail.com> wrote:
>
> > What mode are those sockets in (established, closed, etc)? Also, from the
> > ip, could you tell whether those sockets are from the client or from the
> > replica fetcher in the brokers.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Tue, Jan 21, 2014 at 3:29 AM, Ahmy Yulrizka <a...@yulrizka.com>
> wrote:
> >
> > > We are running 3 kafka nodes, which servers 4 partition.
> > > We have been experiencing weird behavior during network outage.
> > >
> > > we had been experiencing twice in the last couple of days. the previous
> > one
> > > took down all of the cluster.
> > > while this one only 2 out of 3 survive. and 1 node became the leader of
> > all
> > > partition, and other node only in ISR of 1 partition (out of 4)
> > >
> > > my best guess now is that when the network down, the broker can't
> connect
> > > to other broker to do replication and keep opening the socket
> > > without closing it. But I'm not entirely sure about this.
> > >
> > > Is there any way to mitigate the problem ? or is there any
> configuration
> > > options to stop this from happening again ?
> > >
> > >
> > > The java/kafka process open too many socket file descriptor.
> > > running `lsof -a -p 11818` yield thousand of this line.
> > >
> > > ...
> > > java    11818 kafka 3059u  sock                0,7       0t0 615637305
> > > can't identify protocol
> > > java    11818 kafka 3060u  sock                0,7       0t0 615637306
> > > can't identify protocol
> > > java    11818 kafka 3061u  sock                0,7       0t0 615637307
> > > can't identify protocol
> > > java    11818 kafka 3062u  sock                0,7       0t0 615637308
> > > can't identify protocol
> > > java    11818 kafka 3063u  sock                0,7       0t0 615637309
> > > can't identify protocol
> > > java    11818 kafka 3064u  sock                0,7       0t0 615637310
> > > can't identify protocol
> > > java    11818 kafka 3065u  sock                0,7       0t0 615637311
> > > can't identify protocol
> > > ...
> > >
> > > i verify that the the open socket did not close when i repeated the
> > command
> > > after 2 minutes.
> > >
> > >
> > > and the kafka log on the broken node, generate lots of error like this:
> > >
> > > [2014-01-21 04:21:48,819]  64573925 [kafka-acceptor] ERROR
> > > kafka.network.Acceptor  - Error in acceptor
> > > java.io.IOException: Too many open files
> > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >         at
> > >
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> > >         at java.lang.Thread.run(Thread.java:701)
> > > [2014-01-21 04:21:48,819]  64573925 [kafka-acceptor] ERROR
> > > kafka.network.Acceptor  - Error in acceptor
> > > java.io.IOException: Too many open files
> > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >         at
> > >
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:165)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> > >         at java.lang.Thread.run(Thread.java:701)
> > > [2014-01-21 04:21:48,811]  64573917 [ReplicaFetcherThread-0-1] INFO
> > >  kafka.consumer.SimpleConsumer  - Reconnect due to socket error: null
> > > [2014-01-21 04:21:48,819]  64573925 [ReplicaFetcherThread-0-1] WARN
> > >  kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-1], Error
> > in
> > > fetch Name: FetchRequest; Version: 0; CorrelationId: 74930218;
> ClientId:
> > > ReplicaFetcherThread-0-1; ReplicaId: 2; MaxWait: 500 ms; MinBytes: 1
> > bytes;
> > > RequestInfo: [some-topic,0] ->
> > > PartitionFetchInfo(959825,1048576),[some-topic,3] ->
> > > PartitionFetchInfo(551546,1048576)
> > > java.net.SocketException: Too many open files
> > >         at sun.nio.ch.Net.socket0(Native Method)
> > >         at sun.nio.ch.Net.socket(Net.java:156)
> > >         at
> > sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:102)
> > >         at
> > >
> > >
> >
> sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:55)
> > >         at java.nio.channels.SocketChannel.open(SocketChannel.java:122)
> > >         at
> > kafka.network.BlockingChannel.connect(BlockingChannel.scala:48)
> > >         at
> kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
> > >         at
> > kafka.consumer.SimpleConsumer.reconnect(SimpleConsumer.scala:57)
> > >         at
> > > kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:110)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:110)
> > >         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:109)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109)
> > >         at
> > >
> > >
> >
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:109)
> > >         at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> > >         at
> kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:108)
> > >         at
> > >
> > >
> >
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:94)
> > >         at
> > >
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:86)
> > >         at
> > kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > >
> > >
> > > --
> > > Ahmy Yulrizka
> > > http://ahmy.yulrizka.com
> > > @yulrizka
> > >
> >
>

Reply via email to