Hi all, I was wondering if anybody here has and was willing to share experience about designing and operating complex multi-datacenter/multi-cluster Kafka deployments in which data must flow from and to several distinct Kafka clusters with more complex semantics than what MirrorMaker provides.
The general, very sensible consensus is that producers of data should publish to a local Kafka cluster. But if that data is produced in multiple datacenters, and must be consumed multiple datacenters as well, then you need to implement data routing and filtering to organise your pipeline. Imagine the following scenario, with three datacenters A, B and C. Data is being produced (of the same kind, to the same topic) in all three datacenters. Both datacenters A and B have consumers that want all the data generated in all three datacenters, but C is only interested in a subset of what is produced in A and B (according to some specific filters for example). This means you have data flowing in both directions between each datacenter. You need some kind of source-base filtering to prevent data going back and forth ad vitam eternam, as well as something to implement the custom filtering logic where needed, which also means you'd need to envelope all data into a broader object that knows about where the data was published from. Is this kind of deployment pretty common in the industry/among the users of Kafka? I haven't found much online that would help putting together this type of architectures. Is it basically roll-your-own with something similar to the MirrorMaker that has a consumer, filtering component and producer, and place a couple of these in each direction between each pair of clusters? It ultimately bogs down to pretty simple "routing" of data, just in a more complex manner than having all data flow to a single sink location. Let me know what you folks think! TIA, /Max -- Maxime Petazzoni Sr. Platform Engineer m 408.310.0595 www.turn.com