Thanks a lot for the guidance, I think we’ll go ahead with one cluster. I just need to figure out how our CD pipeline can talk to our connect cluster securely (because it’ll need direct access to perform API calls).
Lastly, a question or maybe a piece of feedback… is it not possible to specify the key serializer and deserializer as part of the rest api job config? The issue is that sometimes our data is avro, sometimes it’s json. And it seems I’d need two separate clusters for that? On 6 January 2017 at 1:54:10 pm, Ewen Cheslack-Postava (e...@confluent.io) wrote: On Thu, Jan 5, 2017 at 3:12 PM, Stephane Maarek < steph...@simplemachines.com.au> wrote: > Hi, > > We like to operate in micro-services (dockerize and ship everything on ecs) > and I was wondering which approach was preferred. > We have one kafka cluster, one zookeeper cluster, etc, but when it comes to > kafka connect I have some doubts. > > Is it better to have one big kafka connect with multiple nodes, or many > small kafka connect clusters or standalone, for each connector / etl ? > You can do any of these, and it may depend on how you do orchestration/deployment. We built Connect to support running one big cluster running a bunch of connectors. It balances work automatically and provides a way to control scale up/down via increased parallelism. This means we don't need to make any assumptions about how you deploy, how you handle elastically scaling your clusters, etc. But if you run in an environment and have the tooling in place to do that already, you can also opt to run many smaller clusters and use that tooling to scale up/down. In that case you'd just make sure there were enough tasks for each connector so that when you scale the # of workers for a cluster up the rebalancing of work would ensure there was enough tasks for every worker to remain occupied. The main drawback of doing this is that Connect uses a few topics to for configs, status, and offsets and you need these to be unique per cluster. This means you'll have 3N more topics. If you're running a *lot* of connectors, that could eventually become a problem. It also means you have that many more worker configs to handle, clusters to monitor, etc. And deploying a connector no longer becomes as simple as just making a call to the service's REST API since there isn't a single centralized service. The main benefits I can think of are a) if you already have preferred tooling for handling elasticity and b) better resource isolation between connectors (i.e. an OOM error in one connector won't affect any other connectors). For standalone mode, we'd generally recommend only using it when distributed mode doesn't make sense, e.g. for log file collection. Other than that, having the fault tolerance and high availability of distributed mode is preferred. On your specific points: > > The issues I’m trying to address are : > - Integration with our CI/CD pipeline > I'm not sure anything about Connect affects this. Is there a specific concern you have about the CI/CD pipeline & Connect? > - Efficient resources utilisation > Putting all the connectors into one cluster will probably result in better resource utilization unless you're already automatically tracking usage and scaling appropriately. The reason is that if you use a bunch of small clusters, you're now stuck trying to optimize N uses. Since Connect can already (roughly) balance work, putting all the work into one cluster and having connect split it up means you just need to watch utilization of the nodes in that one cluster and scale up or down as appropriate. > - Easily add new jar files that connectors depend on with minimal downtime > This one is a bit interesting. You shouldn't have any downtime adding jars in the sense that you can do rolling bounces of Connect. The one caveat is that the current limitation for how it rebalances work involves halting work for all connectors/tasks, doing the rebalance, and then starting them up again. We plan to improve this, but the timeframe for it is still uncertain. Usually these rebalance steps should be pretty quick. The main reason this can be a concern is that halting some connectors could take some time (e.g. because they need to fully flush their data). This means the period of time your connectors are not processing data during one of those rebalances is controlled by the "worst" connector. I would recommend trying a single cluster but monitoring whether you see stalls due to rebalances. If you do, then moving to multiple clusters might make sense. (This also, obviously, depends a lot on your SLA for data delivery.) > - Monitoring operations > Multiple clusters definitely seems messier and more complicated for this. There will be more workers in a single cluster, but it's a single service you need to monitor and maintain. Hope that helps! -Ewen > > Thanks for your guidance > > Regards, > Stephane >