Frank, > I don't think forcing the API users to introduce the nonce is desirable.
I agree. That is why the nonce is a workaround, and not a proper solution. It's something to alleviate the symptoms in the short-term until a bugfix & upgrade can fix it. > Have you had any ideas on how this can be implemented within Kafka Connect itself so that it works as expected for all users? I have not looked into solutions in enough depth to recommend one. If I had, the PR would be open :) > We tried adding tasks to trigger a propagation of the task configs (increased from 36 to 40 tasks) however that did not unblock it.So triggering this code path did not seem to work: You may be affected by _another_ bug which is preventing tasks from being reconfigured which is specific to MM2: https://issues.apache.org/jira/browse/KAFKA-10586 You can see evidence for this in the DistributedHerder ERROR log "Request to leader to reconfigure connector tasks failed". A fix for this is in-flight already. It appears that Strimzi is using the kafka-mirror-maker.sh script, which is affected: https://github.com/strimzi/strimzi-kafka-operator/blob/97b48461d724a9c59505a9ad31b3d184476a83d7/docker-images/kafka-based/kafka/scripts/kafka_mirror_maker_run.sh#L121 > Are there any other (not overly-verbose) classes you recommend we enable DEBUG logging on I think you've covered the interesting ones. You can also look and see if the Mirror*Connector classes are behaving themselves, but it doesn't appear that the reconfiguration code path has any relevant logs. > Also, would making the inconsistent connectors (assuming they're being identified as such by Kafka Connect when this happens) through an API call also make sense so that this can be detected/monitored more easily? Unless we have evidence that the config topic being in inconsistent state (B) as a common problem, I don't think adding monitorability for it has a high enough ROI to be implemented. If you feel strongly about it, then you can consider opening a KIP to describe the public interface changes and how those interfaces would be used for monitoring. Inconsistent state (A) however, seems very common. I've seen it in production myself, it's implicated in KAFKA-9228, KAFKA-10586, and is clearly causing disturbance to real users. Fixing the conditions which lead to state (A) is what I'm most interested in seeing, and should be prioritized first because it's what is going to have the highest ROI. Right now you can find connectors in inconsistent state (A) with the following: * You can hand-inspect the task configs with the `GET /{connector}/tasks-config` endpoint since 2.8.0. This does not work for dedicated MM2 (right now) for precisely the same reason that KAFKA-10586 occurs: the REST API isn't running. * For mechanical alerting, you can read the config topic and track the offset for the most recent connector config and compare it with the offset for the most recent task config. This depends on internal implementation though, and isn't supported across releases. I think a way of detecting state (A) via the REST API would be a valuable addition that could get accepted, if someone is willing to do the legwork to design, propose and implement it. It would be valuable even without any bugs present, as the connectors have to transit through state (A) on each reconfiguration. We can look into this after getting some tactical fixes in place to avoid the long-term state (A). Thanks, Greg Harris On Wed, Feb 15, 2023 at 9:34 AM Frank Grimes <frankgrime...@yahoo.com.invalid> wrote: > So we've just hit this issue again just with the MM2 connector and trying > to add a new mirrored topic.We're running MirrorMaker 2 in Strimzi. i.e. > "connector.class": > "org.apache.kafka.connect.mirror.MirrorSourceConnector"We have 6 worker > nodes.We changed the config to add a new mirror topic. i.e. append a new > topic to the MirrorSourceConnector's "topics" config.The MM2 config topic > reflects the change, as does viewing the config using Kowl UI.However, no > tasks run to mirror the newly-added topic.We also do not see any updates on > the MM2 status topic for the mirroring of that newly-added topic. > We tried adding tasks to trigger a propagation of the task configs > (increased from 36 to 40 tasks) however that did not unblock it.So > triggering this code path did not seem to work: > https://github.com/apache/kafka/blob/8cb0a5e9d3441962896b79163d141607e94d9b54/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1569-L1572 > Only restarting the workers seemed to unblock the propagation of the new > task config for the new mirrored topic. > Hopefully this can help us narrow things down a bit... > In the meantime we've since enabled the following DEBUG logging in > production to try to get more hints the next time this happens: > log4j.logger.org.apache.kafka.connect.storage.KafkaConfigBackingStore: > DEBUG > log4j.logger.org.apache.kafka.connect.runtime.distributed.DistributedHerder: > DEBUG > Perhaps that will show us if it's at all related MM2 config topic > compaction and/or connectors in inconsistent state. > Are there any other (not overly-verbose) classes you recommend we enable > DEBUG logging on? > Also, would making the inconsistent connectors (assuming they're being > identified as such by Kafka Connect when this happens) through an API call > also make sense so that this can be detected/monitored more easily? > Thanks! > >