[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992119#comment-16992119 ] ASF GitHub Bot commented on KAFKA-9212: --- hachikuji commented on pull request #7805: KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment URL: https://github.com/apache/kafka/pull/7805 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0, 2.3.1 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991826#comment-16991826 ] ASF GitHub Bot commented on KAFKA-9212: --- hachikuji commented on pull request #7805: KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment URL: https://github.com/apache/kafka/pull/7805 This is a cherry-pick of https://github.com/apache/kafka/commit/5d0cb1419cd1f1cdfb7bc04ed4760d5a0eae0aa1. The main differences are 1) leader epoch validation is unconditionally disable, and 2) the test case has been refactored due to the absence of the reassignment admin APIs. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0, 2.3.1 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991737#comment-16991737 ] ASF GitHub Bot commented on KAFKA-9212: --- hachikuji commented on pull request #7800: KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment (#7795) URL: https://github.com/apache/kafka/pull/7800 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0, 2.3.1 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991075#comment-16991075 ] ASF GitHub Bot commented on KAFKA-9212: --- ijuma commented on pull request #7800: KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment (#7795) URL: https://github.com/apache/kafka/pull/7800 KIP-320 improved fetch semantics by adding leader epoch validation. This relies on reliable propagation of leader epoch information from the controller. Unfortunately, we have encountered a bug during partition reassignment in which the leader epoch in the controller context does not get properly updated. This causes UpdateMetadata requests to be sent with stale epoch information which results in the metadata caches on the brokers falling out of sync. This bug has existed for a long time, but it is only a problem due to the new epoch validation done by the client. Because the client includes the stale leader epoch in its requests, the leader rejects them, yet the stale metadata cache on the brokers prevents the consumer from getting the latest epoch. Hence the consumer cannot make progress while a reassignment is ongoing. Although it is straightforward to fix this problem in the controller for the new releases (which this patch does), it is not so easy to fix older brokers which means new clients could still encounter brokers with this bug. To address this problem, this patch also modifies the client to treat the leader epoch returned from the Metadata response as "unreliable" if it comes from an older version of the protocol. The client in this case will discard the returned epoch and it won't be included in any requests. Also, note that the correct epoch is still forwarded to replicas correctly in the LeaderAndIsr request, so this bug does not affect replication. Reviewers: Jun Rao , Stanislav Kozlovski , Ismael Juma *More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.* *Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.* ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0, 2.3.1 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990971#comment-16990971 ] ASF GitHub Bot commented on KAFKA-9212: --- ijuma commented on pull request #7795: KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment URL: https://github.com/apache/kafka/pull/7795 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0, 2.3.1 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990283#comment-16990283 ] ASF GitHub Bot commented on KAFKA-9212: --- hachikuji commented on pull request #7795: KAFKA-9212; Update LeaderAndIsr state in controller context after reassignment URL: https://github.com/apache/kafka/pull/7795 KIP-320 improved fetch semantics by adding leader epoch validation. This relies on reliable propagation of leader epoch information from the controller. Unfortunately, we have encountered a bug during partition reassignment in which the leader epoch in the controller context does not get properly updated. This causes UpdateMetadata requests to be sent with stale epoch information which results in the metadata caches on the brokers falling out of sync. This bug has existed for a long time, but it is only a problem due to the new epoch validation done by the client. Because the client includes the stale leader epoch in its requests, the leader rejects them, but the stale metadata cache on the brokers prevents the consumer from getting the latest epoch. Hence the consumer cannot make progress while a reassignment is ongoing. Although it is straightforward to fix this problem in the controller for the new releases (which is what this patch does), it is not so easy to fix older brokers which means new clients could still encounter brokers with this bug. To address this problem, this patch also modifies the client to treat the leader epoch returned from the Metadata response as "unreliable" if it comes from an older version of the protocol. The client in this case we discard the returned epoch and it won't be included in any requests. Also, note that the correct epoch is still forwarded to replicas correctly in the LeaderAndIsr request, so this bug does not affect replication. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > >
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989491#comment-16989491 ] Jason Gustafson commented on KAFKA-9212: [~mjasc...@twilio.com] Really appreciate the extra detail. I was able to reproduce this on trunk following your instructions. What I see is the controller sending a stale epoch in the UPDATE_METADATA request which follows the initiation of the reassignment. I will work on a patch to fix the controller and I will try to make the case for the 2.4.0 release. Note that fixing this does require a broker upgrade. Until a patch is available, probably the best option is to use the 2.2 or lower clients. > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Assignee: Jason Gustafson >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989287#comment-16989287 ] Michael Jaschob commented on KAFKA-9212: Chiming in here, I believe we've experienced the same error. I've been able to reproduce the behavior quite simply, as follows: - 3-broker cluster (running Apache Kafka 2.3.1) - one partition with replica assignment (0, 1, 2) - booted fourth broker (id 3) - initiated partition reassignment from (0, 1, 2) to (0, 1, 2, 3) with a very low throttle (for testing) As soon as the assignment begins, a 2.3.0 console consumer simply hangs when started. A 1.1.1 consumer does not have any issues. I see this in leader broker's request logs: {code:java} [2019-12-05 16:38:36,790] DEBUG Completed request:RequestHeader(apiKey=LIST_OFFSETS, apiVersion=5, clientId=consumer-1, correlationId=1529) -- {replica_id=-1,isolation_level=0,topics=[{topic=DataPlatform.CGSynthTests,partitions=[{partition=0,current_leader_epoch=0,timestamp=-1}]}]},response:{throttle_time_ms=0,responses=[{topic=DataPlatform.CGSynthTests,partition_responses=[{partition=0,error_code=74,timestamp=-1,offset=-1,leader_epoch=-1}]}]} from connection 172.22.15.67:9092-172.22.23.98:46974-9;totalTime:0.27,requestQueueTime:0.044,localTime:0.185,remoteTime:0.0,throttleTime:0.036,responseQueueTime:0.022,sendTime:0.025,securityProtocol:PLAINTEXT,principal:User:data-pipeline-monitor,listener:PLAINTEXT (kafka.request.logger) {code} Note the producer fenced error code for list offsets, as in the original report. Once the reassignment completes, the 2.3.1 console consumer starts working. I've also tried a different reassignment (0, 1, 2) -> (3, 1, 2) with the same results. Where we stand right now is we can't initiate partition reassignments in our production cluster without paralyzing a Spark application (using 2.3.0 client libs under the hood). Downgrading the Kafka client libs there isn't possible since they are part of the Spark assembly. Any pointers on what the issue might be here? Struggling to understand the bug because it seems like any partition reassignment breaks LIST_OFFSETS requests from 2.3 clients, but that just seems to be too severe a problem to have gone unnoticed for so long. Even ideas for a workaround would help here, since we don't see a path to do partition reassignments without causing a production incident right now. > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982412#comment-16982412 ] Yannick commented on KAFKA-9212: Yeah the topic is correctly replicated according metadata output from tools like kafkacat : As of today, we downgrade our clients to 2.2.1 to avoid being stuck in this fencing loop ( 2.3 client handle the FENCED_LEADER_EPOCH ). We restarted the 3 brokers ( rolling restart) and still have discrepancies between those checkpoint files as follows : Broker ID 4 : cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 2 0 0 6 22 Broker ID 1 : cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 2 0 0 5 22 Broker ID 3: cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 1 0 0 Regarding the dump of this topic, here they are ( there is just one .log file . for all brokers) ( cannot show the content using print-data-log as it might contain sensitive info) : Broker ID 1 : /opt/kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /var/lib/kafka/logs/connect_ls_config-0/.log Dumping /var/lib/kafka/logs/connect_ls_config-0/.log Starting offset: 0 baseOffset: 0 lastOffset: 0 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 0 CreateTime: 1573660711038 size: 962 magic: 2 compresscodec: NONE crc: 1786879997 isvalid: true baseOffset: 1 lastOffset: 1 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 962 CreateTime: 1573660712089 size: 1009 magic: 2 compresscodec: NONE crc: 1230182444 isvalid: true baseOffset: 2 lastOffset: 3 count: 2 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 1971 CreateTime: 1573660712091 size: 1957 magic: 2 compresscodec: NONE crc: 2419651795 isvalid: true baseOffset: 4 lastOffset: 4 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 3928 CreateTime: 1573660712611 size: 89 magic: 2 compresscodec: NONE crc: 3321423372 isvalid: true baseOffset: 5 lastOffset: 5 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 4017 CreateTime: 1573751698440 size: 962 magic: 2 compresscodec: NONE crc: 704355531 isvalid: true baseOffset: 6 lastOffset: 6 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 4979 CreateTime: 1573751699462 size: 1009 magic: 2 compresscodec: NONE crc: 1489459952 isvalid: true baseOffset: 7 lastOffset: 8 count: 2 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 5988 CreateTime: 1573751699463 size: 1957 magic: 2 compresscodec: NONE crc: 657348671 isvalid: true baseOffset: 9 lastOffset: 9 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 7945 CreateTime: 1573751699985 size: 89 magic: 2 compresscodec: NONE crc: 1825092385 isvalid: true baseOffset: 10 lastOffset: 11 count: 2 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 8034 CreateTime: 1573828311242 size: 104 magic: 2 compresscodec: NONE crc: 3533917687 isvalid: true baseOffset: 12 lastOffset: 12 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 8138 CreateTime: 1573828467292 size: 953 magic: 2 compresscodec: NONE crc: 232359935 isvalid: true baseOffset: 13 lastOffset: 13 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 9091 CreateTime: 1573828467807 size: 1000 magic: 2 compresscodec: NONE crc: 1484213287 isvalid: true baseOffset: 14 lastOffset: 15 count: 2 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 10091 CreateTime: 1573828467808 size: 1939 magic: 2 compresscodec: NONE crc: 49865436 isvalid: true baseOffset: 16 lastOffset: 16 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 12030 CreateTime: 1573828468331 size: 94 magic: 2 compresscodec: NONE crc: 1480833250 isvalid: true baseOffset: 17 lastOffset:
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982018#comment-16982018 ] Jason Gustafson commented on KAFKA-9212: [~Lambruschi] The output from the leader epoch checkpoints is curious. The contents should match for all replicas. Is replication working for this partition? I'd suggest using `bin/kafka-dump-log.sh` to dump the log contents for each `.log` file for that partition and see if they match. It would also be useful to see the broker config. > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979153#comment-16979153 ] Yannick commented on KAFKA-9212: Here are leader-epoch-checkpoint on each broker ( 3 in total which are 1, 3 and 4l) Broker ID 4 ( the current partition leader during issue): cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 2 0 0 2 22 Broker ID 1 : cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 1 0 0 Broker ID 3: cat /var/lib/kafka/logs/connect_ls_config-0/leader-epoch-checkpoint 0 1 0 0 And config topic comes from kafka connect worker default creation ( compacted topic) : Topic:connect_ls_config PartitionCount:1 ReplicationFactor:3 Configs:min.insync.replicas=2,cleanup.policy=compact,segment.bytes=1073741824,max.message.bytes=3000 > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming ignoring > FENCED_LEADER_EPOCH): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
[ https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978305#comment-16978305 ] Yannick commented on KAFKA-9212: As expected, when we use a 2.2.1 API client ( kafka-console-consumer) this works fine, sureley because it's ingnoring FENCED_LEADER_EPOCH answers > Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest > -- > > Key: KAFKA-9212 > URL: https://issues.apache.org/jira/browse/KAFKA-9212 > Project: Kafka > Issue Type: Bug > Components: consumer, offset manager >Affects Versions: 2.3.0 > Environment: Linux >Reporter: Yannick >Priority: Critical > > When running Kafka connect s3 sink connector ( confluent 5.3.0), after one > broker got restarted (leaderEpoch updated at this point), the connect worker > crashed with the following error : > [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, > groupId=connect-ls] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30003ms > > After investigation, it seems it's because it got fenced when sending > ListOffsetRequest in loop and then got timed out , as follows : > [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, > replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, > maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, > isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 > rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905) > [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher:985) > > The above happens multiple times until timeout. > > According to the debugs, the consumer always get a leaderEpoch of 1 for this > topic when starting up : > > [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, > groupId=connect-ls] Updating last seen epoch from null to 1 for partition > connect_ls_config-0 (org.apache.kafka.clients.Metadata:178) > > > But according to our brokers log, the leaderEpoch should be 2, as follows : > > [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] > connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader > Epoch was: 1 (kafka.cluster.Partition) > > > This make impossible to restart the worker as it will always get fenced and > then finally timeout. > > It is also impossible to consume with a 2.3 kafka-console-consumer as > follows : > > kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic > connect_ls_config --from-beginning > > the above will just hang forever ( which is not expected cause there is > data) and we can see those debug messages : > [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, > groupId=console-consumer-3844] Attempt to fetch offsets for partition > connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. > (org.apache.kafka.clients.consumer.internals.Fetcher) > > > Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we > can consume without problem ( must be the way kafkacat is consuming which is > different somehow): > > kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning > > -- This message was sent by Atlassian Jira (v8.3.4#803005)