Stephen, Cool, that is a *lot* of connectors!
Regarding rebalances, the reason this happens is that Kafka Connect is trying to keep the total work of the cluster balanced across the workers. If you add/remove connectors or the # of workers change, then we need to go through another round deciding where that work is done. The way this is accomplished is by having the workers coordinate through Kafka's group coordination protocol by performing a rebalance. This is very similar to how consumer rebalances work -- the members all "rejoin" the group, one figures out how to assign work, and then everyone gets their assignments and restarts work. The way this works today is global -- everyone has to stop work, commit offsets, then start the process where work is assigned, and finally restart work. That's why you're seeing everything stop, then restart. We know this will eventually become a scalability limit. We've talked about other approaches that avoid requiring stopping everything. There's not currently a JIRA with more details & ideas, but https://issues.apache.org/jira/browse/KAFKA-5505 is filed for the general issue. We haven't committed to any specific approach, but I've thought through this a bit and have some ideas around how we could make the process more incremental such that we don't have to stop *everything* during a single rebalance process, instead accepting the cost of some subsequent rebalances in order to make each iteration faster/cheaper. I'm not sure when we'll get these updates in yet. One other thing to consider is if it is possible to use fewer connectors at a time. One of our goals was to encourage broad copying by default; fewer connectors/tasks doesn't necessarily solve your problem, but depending on the connectors you're using it is possible it would reduce the time spent stopping/starting tasks during the rebalance and alleviate your problem. -Ewen On Thu, Jul 20, 2017 at 8:01 AM, Stephen Durfey <sjdur...@gmail.com> wrote: > I'm seeing some behavior with the DistributedHerder that I am trying to > understand. I'm working on setting up a cluster of kafka connect nodes and > have a relatively large number of connectors to submit to it (392 > connectors right now that will soon become over 1100). As for the > deployment of it I am using chef, and having that PUT connector configs at > deployment time so I can create/update any connectors. > > Everytime I PUT a new connector config to the worker it appears to be > initiating an assignment rebalance. I believe this is only happening when > submitting a new connector. This is causing all existing and running > connectors to stop and restart. My logs end up being flooded with > exceptions from the source jdbc task with sql connections being closed and > wakeup exceptions in my sink tasks when committing offsets. This causes > issues beyond having to wait for a rebalance as restarting the jdbc > connectors causes them to re-pull all data, since they are using bulk mode. > Everything eventually settles down and all the connectors finish > successfully, but each PUT takes progressively longer waiting for a > rebalance to finish. > > If I simply restart the worker nodes and let them only instantiate > connectors that have already been successfully submitted everything starts > up fine. So, this is only an issue when submitting new connectors over the > REST endpoint. > > So, I'm trying to understand why submitting a new connector causes the > rebalancing, but also if there is a better way to deploy the connector > configs in distributed mode? > > Thanks, > > Stephen >