Stephen,

Cool, that is a *lot* of connectors!

Regarding rebalances, the reason this happens is that Kafka Connect is
trying to keep the total work of the cluster balanced across the workers.
If you add/remove connectors or the # of workers change, then we need to go
through another round deciding where that work is done. The way this is
accomplished is by having the workers coordinate through Kafka's group
coordination protocol by performing a rebalance. This is very similar to
how consumer rebalances work -- the members all "rejoin" the group, one
figures out how to assign work, and then everyone gets their assignments
and restarts work.

The way this works today is global -- everyone has to stop work, commit
offsets, then start the process where work is assigned, and finally restart
work. That's why you're seeing everything stop, then restart.

We know this will eventually become a scalability limit. We've talked about
other approaches that avoid requiring stopping everything. There's not
currently a JIRA with more details & ideas, but
https://issues.apache.org/jira/browse/KAFKA-5505 is filed for the general
issue. We haven't committed to any specific approach, but I've thought
through this a bit and have some ideas around how we could make the process
more incremental such that we don't have to stop *everything* during a
single rebalance process, instead accepting the cost of some subsequent
rebalances in order to make each iteration faster/cheaper.

I'm not sure when we'll get these updates in yet. One other thing to
consider is if it is possible to use fewer connectors at a time. One of our
goals was to encourage broad copying by default; fewer connectors/tasks
doesn't necessarily solve your problem, but depending on the connectors
you're using it is possible it would reduce the time spent
stopping/starting tasks during the rebalance and alleviate your problem.

-Ewen

On Thu, Jul 20, 2017 at 8:01 AM, Stephen Durfey <sjdur...@gmail.com> wrote:

> I'm seeing some behavior with the DistributedHerder that I am trying to
> understand. I'm working on setting up a cluster of kafka connect nodes and
> have a relatively large number of connectors to submit to it (392
> connectors right now that will soon become over 1100). As for the
> deployment of it I am using chef, and having that PUT connector configs at
> deployment time so I can create/update any connectors.
>
> Everytime I PUT a new connector config to the worker it appears to be
> initiating an assignment rebalance. I believe this is only happening when
> submitting a new connector. This is causing all existing and running
> connectors to stop and restart. My logs end up being flooded with
> exceptions from the source jdbc task with sql connections being closed and
> wakeup exceptions in my sink tasks when committing offsets. This causes
> issues beyond having to wait for a rebalance as restarting the jdbc
> connectors causes them to re-pull all data, since they are using bulk mode.
> Everything eventually settles down and all the connectors finish
> successfully, but each PUT takes progressively longer waiting for a
> rebalance to finish.
>
> If I simply restart the worker nodes and let them only instantiate
> connectors that have already been successfully submitted everything starts
> up fine. So, this is only an issue when submitting new connectors over the
> REST endpoint.
>
> So, I'm trying to understand why submitting a new connector causes the
> rebalancing, but also if there is a better way to deploy the connector
> configs in distributed mode?
>
> Thanks,
>
> Stephen
>

Reply via email to