Is it possible than a max.poll.records too high can cause this instability?
On Wed, Jul 12, 2017 at 8:43 AM, Pierre Coquentin < pierre.coquen...@gmail.com> wrote: > It was on our test environment and nothing was running when the incident > occurred. > In the server log we have a bunch of > [2017-07-11 11:52:15,330] WARN Attempting to send response via channel for > which there is no open connection, connection id 0 (kafka.network.Processor) > But the time doesn't match so I don't know if it's correlated or not > > On Tue, Jul 11, 2017 at 1:08 PM, John Yost <hokiege...@gmail.com> wrote: > >> Hi Pierre, >> >> Do your brokers remain responsive? In other words, do you see any other >> symptoms such as decreased write or read throughput which may indicate >> long >> GC pauses or possibly heavy load on your zookeeper cluster as evidenced by >> any SocketTimeoutExceptions on the Kafka and/or Zookeeper sides? >> >> --John >> >> On Tue, Jul 11, 2017 at 6:15 AM, Pierre Coquentin < >> pierre.coquen...@gmail.com> wrote: >> >> > Hi, >> > >> > We are using kafka 0.10.2 with 2 brokers and 2 application nodes >> composed >> > of 6 consumers each (all in one group). And recently we experienced >> > disconnection of both nodes simultaneously and an infinite retry to >> connect >> > to the coordinator. Currently, just restarting the nodes solve the >> problem >> > but it will occur a few hours later. >> > In the application log we see a lot of : >> > 11.07.2017 06:47:08,905 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] >> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead >> for >> > group ABC >> > 11.07.2017 06:47:09,007 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] >> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for >> group >> > ABC. >> > 11.07.2017 06:47:09,008 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] >> > (Re-)joining group ABC >> > 11.07.2017 06:47:09,274 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] >> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead >> for >> > group ABC >> > 11.07.2017 06:47:09,375 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] >> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for >> group >> > ABC. >> > 11.07.2017 06:47:09,375 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] >> > (Re-)joining group ABC >> > 11.07.2017 06:47:10,820 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] >> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead >> for >> > group ABC >> > 11.07.2017 06:47:10,921 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] >> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for >> group >> > ABC. >> > 11.07.2017 06:47:10,922 INFO >> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] >> > (Re-)joining group ABC >> > >> > There is nothing in the log of the brokers. >> > We have no problem to contact the coordinator from both nodes. Could it >> be >> > a periodic instability of the network which leads to this infinite >> retries? >> > This problem could it be related to >> > https://issues.apache.org/jira/browse/KAFKA-5464 ? >> > >> > Here is the configuration of the Stream (lots of option are default >> ones) >> > application.id = ABC >> > application.server = >> > bootstrap.servers = [kafka-1:9092, kafka-2:9092] >> > buffered.records.per.partition = 1000 >> > cache.max.bytes.buffering = 10485760 >> > client.id = >> > commit.interval.ms = 30000 >> > connections.max.idle.ms = 540000 >> > key.serde = class >> > org.apache.kafka.common.serialization.Serdes$StringSerde >> > metadata.max.age.ms = 300000 >> > num.standby.replicas = 0 >> > num.stream.threads = 6 >> > partition.grouper = class >> > org.apache.kafka.streams.processor.DefaultPartitionGrouper >> > poll.ms = 100 >> > receive.buffer.bytes = 32768 >> > reconnect.backoff.ms = 50 >> > replication.factor = 1 >> > request.timeout.ms = 40000 >> > retry.backoff.ms = 100 >> > rocksdb.config.setter = null >> > security.protocol = PLAINTEXT >> > send.buffer.bytes = 131072 >> > state.cleanup.delay.ms = 60000 >> > state.dir = null >> > timestamp.extractor = class >> > org.apache.kafka.streams.processor.FailOnInvalidTimestamp >> > value.serde = class com.sigfox.kafka.serde.AvroStr >> eamRecordSerde >> > windowstore.changelog.additional.retention.ms = 86400000 >> > zookeeper.connect = >> > zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181 >> > >> > >> > Any thoughts? >> > Regards, >> > >> > Pierre >> > >> > >