[ 
https://issues.apache.org/jira/browse/KAFKA-10689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227077#comment-17227077
 ] 

A. Sophie Blee-Goldman commented on KAFKA-10689:
------------------------------------------------

It's a pretty obnoxious bug, since the application stays stuck in REBALANCING 
while StreamThreads slowly drop out of the group one-by-one as the current 
group leader gets stuck and a new rebalance has to be triggered. Meanwhile we 
don't log anything within this loop so it's impossible to know what happened 
based on the logs.

Ideally we would just limit the number of iterations and shut down the 
application if we can't seem to figure out the number of partitions for some 
reason. Unfortunately, given the random way that 
setRepartitionTopicMetadataNumberOfPartitions walks through the topology and 
the lack of a ceiling on topological cycles/complexity, it's not immediately 
obvious how (or if) we can pick a limit on the number of necessary iterations. 

Still, we can probably improve the current situation and do better than just 
silently looping forever. One simple option would be to just start logging a 
warning once we're past some large iteration number.

Another option is to keep track of the set of repartition topics whose 
partitions are still unknown, and if this set fails to change over one full 
iteration of the outer `topicGroups.values()` loop, then break out and shut 
down the application. This seems pretty airtight, although obviously a bit more 
complicated than just logging a warning at high iteration count. The logging is 
probably more than sufficient for a user to debug their application, but also a 
worse user experience.

> Assignor can't determine number of partitions on FJK with upstream windowed 
> repartition
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10689
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10689
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.5.0
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Major
>             Fix For: 2.8.0, 2.7.1
>
>
> Due to a minor logical gap in how windowed repartition sink nodes are written 
> to the topology, they are never added to the official map of sink topics 
> tracked by the InternalTopologyBuilder. This makes it impossible to determine 
> the number of partitions of downstream repartition topics in 
> StreamsPartitionAssignor#setRepartitionTopicMetadataNumberOfPartitions, 
> causing the assignor to loop infinitely in this method. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to