Btw, if you can share, I would be curious what connectors you're using and
why you need so many. I'd be interested if a modification to the connector
could also simplify things for you.

-Ewen

On Wed, Jul 26, 2017 at 12:33 AM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Stephen,
>
> Cool, that is a *lot* of connectors!
>
> Regarding rebalances, the reason this happens is that Kafka Connect is
> trying to keep the total work of the cluster balanced across the workers.
> If you add/remove connectors or the # of workers change, then we need to go
> through another round deciding where that work is done. The way this is
> accomplished is by having the workers coordinate through Kafka's group
> coordination protocol by performing a rebalance. This is very similar to
> how consumer rebalances work -- the members all "rejoin" the group, one
> figures out how to assign work, and then everyone gets their assignments
> and restarts work.
>
> The way this works today is global -- everyone has to stop work, commit
> offsets, then start the process where work is assigned, and finally restart
> work. That's why you're seeing everything stop, then restart.
>
> We know this will eventually become a scalability limit. We've talked
> about other approaches that avoid requiring stopping everything. There's
> not currently a JIRA with more details & ideas, but
> https://issues.apache.org/jira/browse/KAFKA-5505 is filed for the general
> issue. We haven't committed to any specific approach, but I've thought
> through this a bit and have some ideas around how we could make the process
> more incremental such that we don't have to stop *everything* during a
> single rebalance process, instead accepting the cost of some subsequent
> rebalances in order to make each iteration faster/cheaper.
>
> I'm not sure when we'll get these updates in yet. One other thing to
> consider is if it is possible to use fewer connectors at a time. One of our
> goals was to encourage broad copying by default; fewer connectors/tasks
> doesn't necessarily solve your problem, but depending on the connectors
> you're using it is possible it would reduce the time spent
> stopping/starting tasks during the rebalance and alleviate your problem.
>
> -Ewen
>
> On Thu, Jul 20, 2017 at 8:01 AM, Stephen Durfey <sjdur...@gmail.com>
> wrote:
>
>> I'm seeing some behavior with the DistributedHerder that I am trying to
>> understand. I'm working on setting up a cluster of kafka connect nodes and
>> have a relatively large number of connectors to submit to it (392
>> connectors right now that will soon become over 1100). As for the
>> deployment of it I am using chef, and having that PUT connector configs at
>> deployment time so I can create/update any connectors.
>>
>> Everytime I PUT a new connector config to the worker it appears to be
>> initiating an assignment rebalance. I believe this is only happening when
>> submitting a new connector. This is causing all existing and running
>> connectors to stop and restart. My logs end up being flooded with
>> exceptions from the source jdbc task with sql connections being closed and
>> wakeup exceptions in my sink tasks when committing offsets. This causes
>> issues beyond having to wait for a rebalance as restarting the jdbc
>> connectors causes them to re-pull all data, since they are using bulk
>> mode.
>> Everything eventually settles down and all the connectors finish
>> successfully, but each PUT takes progressively longer waiting for a
>> rebalance to finish.
>>
>> If I simply restart the worker nodes and let them only instantiate
>> connectors that have already been successfully submitted everything starts
>> up fine. So, this is only an issue when submitting new connectors over the
>> REST endpoint.
>>
>> So, I'm trying to understand why submitting a new connector causes the
>> rebalancing, but also if there is a better way to deploy the connector
>> configs in distributed mode?
>>
>> Thanks,
>>
>> Stephen
>>
>
>

Reply via email to