Is it possible than a max.poll.records too high can cause this instability?

On Wed, Jul 12, 2017 at 8:43 AM, Pierre Coquentin <
pierre.coquen...@gmail.com> wrote:

> It was on our test environment and nothing was running when the incident
> occurred.
> In the server log we have a bunch of
> [2017-07-11 11:52:15,330] WARN Attempting to send response via channel for
> which there is no open connection, connection id 0 (kafka.network.Processor)
> But the time doesn't match so I don't know if it's correlated or not
>
> On Tue, Jul 11, 2017 at 1:08 PM, John Yost <hokiege...@gmail.com> wrote:
>
>> Hi Pierre,
>>
>> Do your brokers remain responsive? In other words, do you see any other
>> symptoms such as decreased write or read throughput which may indicate
>> long
>> GC pauses or possibly heavy load on your zookeeper cluster as evidenced by
>> any SocketTimeoutExceptions on the Kafka and/or Zookeeper sides?
>>
>> --John
>>
>> On Tue, Jul 11, 2017 at 6:15 AM, Pierre Coquentin <
>> pierre.coquen...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > We are using kafka 0.10.2 with 2 brokers and 2 application nodes
>> composed
>> > of 6 consumers each (all in one group). And recently we experienced
>> > disconnection of both nodes simultaneously and an infinite retry to
>> connect
>> > to the coordinator. Currently, just restarting the nodes solve the
>> problem
>> > but it will occur a few hours later.
>> > In the application log we see a lot of :
>> > 11.07.2017 06:47:08,905 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
>> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead
>> for
>> > group ABC
>> > 11.07.2017 06:47:09,007 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
>> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for
>> group
>> > ABC.
>> > 11.07.2017 06:47:09,008 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
>> > (Re-)joining group ABC
>> > 11.07.2017 06:47:09,274 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
>> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead
>> for
>> > group ABC
>> > 11.07.2017 06:47:09,375 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
>> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for
>> group
>> > ABC.
>> > 11.07.2017 06:47:09,375 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
>> > (Re-)joining group ABC
>> > 11.07.2017 06:47:10,820 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
>> > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead
>> for
>> > group ABC
>> > 11.07.2017 06:47:10,921 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
>> > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for
>> group
>> > ABC.
>> > 11.07.2017 06:47:10,922 INFO
>> > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
>> > (Re-)joining group ABC
>> >
>> > There is nothing in the log of the brokers.
>> > We have no problem to contact the coordinator from both nodes. Could it
>> be
>> > a periodic instability of the network which leads to this infinite
>> retries?
>> > This problem could it be related to
>> > https://issues.apache.org/jira/browse/KAFKA-5464 ?
>> >
>> > Here is the configuration of the Stream (lots of option are default
>> ones)
>> >         application.id = ABC
>> >         application.server =
>> >         bootstrap.servers = [kafka-1:9092, kafka-2:9092]
>> >         buffered.records.per.partition = 1000
>> >         cache.max.bytes.buffering = 10485760
>> >         client.id =
>> >         commit.interval.ms = 30000
>> >         connections.max.idle.ms = 540000
>> >         key.serde = class
>> > org.apache.kafka.common.serialization.Serdes$StringSerde
>> >         metadata.max.age.ms = 300000
>> >         num.standby.replicas = 0
>> >         num.stream.threads = 6
>> >         partition.grouper = class
>> > org.apache.kafka.streams.processor.DefaultPartitionGrouper
>> >         poll.ms = 100
>> >         receive.buffer.bytes = 32768
>> >         reconnect.backoff.ms = 50
>> >         replication.factor = 1
>> >         request.timeout.ms = 40000
>> >         retry.backoff.ms = 100
>> >         rocksdb.config.setter = null
>> >         security.protocol = PLAINTEXT
>> >         send.buffer.bytes = 131072
>> >         state.cleanup.delay.ms = 60000
>> >         state.dir = null
>> >         timestamp.extractor = class
>> > org.apache.kafka.streams.processor.FailOnInvalidTimestamp
>> >         value.serde = class com.sigfox.kafka.serde.AvroStr
>> eamRecordSerde
>> >         windowstore.changelog.additional.retention.ms = 86400000
>> >         zookeeper.connect =
>> > zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181
>> >
>> >
>> > Any thoughts?
>> > Regards,
>> >
>> > Pierre
>> >
>>
>
>

Reply via email to