Re: Kafka Connect ClusterConfigState.inconsistentConnectors() not handled by distributed Worker?

Greg Harris Wed, 15 Feb 2023 10:31:45 -0800

Frank,

> I don't think forcing the API users to introduce the nonce is desirable.

I agree. That is why the nonce is a workaround, and not a proper solution.
It's something to alleviate the symptoms in the short-term until a bugfix &
upgrade can fix it.

> Have you had any ideas on how this can be implemented within Kafka
Connect itself so that it works as expected for all users?

I have not looked into solutions in enough depth to recommend one. If I
had, the PR would be open :)

> We tried adding tasks to trigger a propagation of the task configs
(increased from 36 to 40 tasks) however that did not unblock it.So
triggering this code path did not seem to work:

You may be affected by _another_ bug which is preventing tasks from being
reconfigured which is specific to MM2:
https://issues.apache.org/jira/browse/KAFKA-10586
You can see evidence for this in the DistributedHerder ERROR log "Request
to leader to reconfigure connector tasks failed". A fix for this is
in-flight already.
It appears that Strimzi is using the kafka-mirror-maker.sh script, which is
affected:
https://github.com/strimzi/strimzi-kafka-operator/blob/97b48461d724a9c59505a9ad31b3d184476a83d7/docker-images/kafka-based/kafka/scripts/kafka_mirror_maker_run.sh#L121

> Are there any other (not overly-verbose) classes you recommend we enable
DEBUG logging on

I think you've covered the interesting ones. You can also look and see if
the Mirror*Connector classes are behaving themselves, but it doesn't appear
that the reconfiguration code path has any relevant logs.

> Also, would making the inconsistent connectors (assuming they're being
identified as such by Kafka Connect when this happens) through an API call
also make sense so that this can be detected/monitored more easily?

Unless we have evidence that the config topic being in inconsistent state
(B) as a common problem, I don't think adding monitorability for it has a
high enough ROI to be implemented.
If you feel strongly about it, then you can consider opening a KIP to
describe the public interface changes and how those interfaces would be
used for monitoring.

Inconsistent state (A) however, seems very common. I've seen it in
production myself, it's implicated in KAFKA-9228, KAFKA-10586, and is
clearly causing disturbance to real users.
Fixing the conditions which lead to state (A) is what I'm most interested
in seeing, and should be prioritized first because it's what is going to
have the highest ROI.

Right now you can find connectors in inconsistent state (A) with the
following:
* You can hand-inspect the task configs with the `GET
/{connector}/tasks-config` endpoint since 2.8.0. This does not work for
dedicated MM2 (right now) for precisely the same reason that KAFKA-10586
occurs: the REST API isn't running.
* For mechanical alerting, you can read the config topic and track the
offset for the most recent connector config and compare it with the offset
for the most recent task config. This depends on internal implementation
though, and isn't supported across releases.

I think a way of detecting state (A) via the REST API would be a valuable
addition that could get accepted, if someone is willing to do the legwork
to design, propose and implement it.
It would be valuable even without any bugs present, as the connectors have
to transit through state (A) on each reconfiguration. We can look into this
after getting some tactical fixes in place to avoid the long-term state (A).

Thanks,
Greg Harris

On Wed, Feb 15, 2023 at 9:34 AM Frank Grimes
<frankgrime...@yahoo.com.invalid> wrote:

> So we've just hit this issue again just with the MM2 connector and trying
> to add a new mirrored topic.We're running MirrorMaker 2 in Strimzi. i.e.
> "connector.class":
> "org.apache.kafka.connect.mirror.MirrorSourceConnector"We have 6 worker
> nodes.We changed the config to add a new mirror topic. i.e. append a new
> topic to the MirrorSourceConnector's "topics" config.The MM2 config topic
> reflects the change, as does viewing the config using Kowl UI.However, no
> tasks run to mirror the newly-added topic.We also do not see any updates on
> the MM2 status topic for the mirroring of that newly-added topic.
> We tried adding tasks to trigger a propagation of the task configs
> (increased from 36 to 40 tasks) however that did not unblock it.So
> triggering this code path did not seem to work:
> https://github.com/apache/kafka/blob/8cb0a5e9d3441962896b79163d141607e94d9b54/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1569-L1572
> Only restarting the workers seemed to unblock the propagation of the new
> task config for the new mirrored topic.
> Hopefully this can help us narrow things down a bit...
> In the meantime we've since enabled the following DEBUG logging in
> production to try to get more hints the next time this happens:
> log4j.logger.org.apache.kafka.connect.storage.KafkaConfigBackingStore:
> DEBUG
> log4j.logger.org.apache.kafka.connect.runtime.distributed.DistributedHerder:
> DEBUG
> Perhaps that will show us if it's at all related MM2 config topic
> compaction and/or connectors in inconsistent state.
> Are there any other (not overly-verbose) classes you recommend we enable
> DEBUG logging on?
> Also, would making the inconsistent connectors (assuming they're being
> identified as such by Kafka Connect when this happens) through an API call
> also make sense so that this can be detected/monitored more easily?
> Thanks!
>
>

Re: Kafka Connect ClusterConfigState.inconsistentConnectors() not handled by distributed Worker?

Reply via email to