[ https://issues.apache.org/jira/browse/KAFKA-10689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227077#comment-17227077 ]
A. Sophie Blee-Goldman commented on KAFKA-10689: ------------------------------------------------ It's a pretty obnoxious bug, since the application stays stuck in REBALANCING while StreamThreads slowly drop out of the group one-by-one as the current group leader gets stuck and a new rebalance has to be triggered. Meanwhile we don't log anything within this loop so it's impossible to know what happened based on the logs. Ideally we would just limit the number of iterations and shut down the application if we can't seem to figure out the number of partitions for some reason. Unfortunately, given the random way that setRepartitionTopicMetadataNumberOfPartitions walks through the topology and the lack of a ceiling on topological cycles/complexity, it's not immediately obvious how (or if) we can pick a limit on the number of necessary iterations. Still, we can probably improve the current situation and do better than just silently looping forever. One simple option would be to just start logging a warning once we're past some large iteration number. Another option is to keep track of the set of repartition topics whose partitions are still unknown, and if this set fails to change over one full iteration of the outer `topicGroups.values()` loop, then break out and shut down the application. This seems pretty airtight, although obviously a bit more complicated than just logging a warning at high iteration count. The logging is probably more than sufficient for a user to debug their application, but also a worse user experience. > Assignor can't determine number of partitions on FJK with upstream windowed > repartition > --------------------------------------------------------------------------------------- > > Key: KAFKA-10689 > URL: https://issues.apache.org/jira/browse/KAFKA-10689 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.5.0 > Reporter: A. Sophie Blee-Goldman > Priority: Major > Fix For: 2.8.0, 2.7.1 > > > Due to a minor logical gap in how windowed repartition sink nodes are written > to the topology, they are never added to the official map of sink topics > tracked by the InternalTopologyBuilder. This makes it impossible to determine > the number of partitions of downstream repartition topics in > StreamsPartitionAssignor#setRepartitionTopicMetadataNumberOfPartitions, > causing the assignor to loop infinitely in this method. > -- This message was sent by Atlassian Jira (v8.3.4#803005)