Btw, if you can share, I would be curious what connectors you're using and why you need so many. I'd be interested if a modification to the connector could also simplify things for you.
-Ewen On Wed, Jul 26, 2017 at 12:33 AM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Stephen, > > Cool, that is a *lot* of connectors! > > Regarding rebalances, the reason this happens is that Kafka Connect is > trying to keep the total work of the cluster balanced across the workers. > If you add/remove connectors or the # of workers change, then we need to go > through another round deciding where that work is done. The way this is > accomplished is by having the workers coordinate through Kafka's group > coordination protocol by performing a rebalance. This is very similar to > how consumer rebalances work -- the members all "rejoin" the group, one > figures out how to assign work, and then everyone gets their assignments > and restarts work. > > The way this works today is global -- everyone has to stop work, commit > offsets, then start the process where work is assigned, and finally restart > work. That's why you're seeing everything stop, then restart. > > We know this will eventually become a scalability limit. We've talked > about other approaches that avoid requiring stopping everything. There's > not currently a JIRA with more details & ideas, but > https://issues.apache.org/jira/browse/KAFKA-5505 is filed for the general > issue. We haven't committed to any specific approach, but I've thought > through this a bit and have some ideas around how we could make the process > more incremental such that we don't have to stop *everything* during a > single rebalance process, instead accepting the cost of some subsequent > rebalances in order to make each iteration faster/cheaper. > > I'm not sure when we'll get these updates in yet. One other thing to > consider is if it is possible to use fewer connectors at a time. One of our > goals was to encourage broad copying by default; fewer connectors/tasks > doesn't necessarily solve your problem, but depending on the connectors > you're using it is possible it would reduce the time spent > stopping/starting tasks during the rebalance and alleviate your problem. > > -Ewen > > On Thu, Jul 20, 2017 at 8:01 AM, Stephen Durfey <sjdur...@gmail.com> > wrote: > >> I'm seeing some behavior with the DistributedHerder that I am trying to >> understand. I'm working on setting up a cluster of kafka connect nodes and >> have a relatively large number of connectors to submit to it (392 >> connectors right now that will soon become over 1100). As for the >> deployment of it I am using chef, and having that PUT connector configs at >> deployment time so I can create/update any connectors. >> >> Everytime I PUT a new connector config to the worker it appears to be >> initiating an assignment rebalance. I believe this is only happening when >> submitting a new connector. This is causing all existing and running >> connectors to stop and restart. My logs end up being flooded with >> exceptions from the source jdbc task with sql connections being closed and >> wakeup exceptions in my sink tasks when committing offsets. This causes >> issues beyond having to wait for a rebalance as restarting the jdbc >> connectors causes them to re-pull all data, since they are using bulk >> mode. >> Everything eventually settles down and all the connectors finish >> successfully, but each PUT takes progressively longer waiting for a >> rebalance to finish. >> >> If I simply restart the worker nodes and let them only instantiate >> connectors that have already been successfully submitted everything starts >> up fine. So, this is only an issue when submitting new connectors over the >> REST endpoint. >> >> So, I'm trying to understand why submitting a new connector causes the >> rebalancing, but also if there is a better way to deploy the connector >> configs in distributed mode? >> >> Thanks, >> >> Stephen >> > >