One thing to keep in mind is that the 18.5 Gbps number is across the entire Kafka infrastructure at LinkedIn, not a single cluster. For a single cluster, I’d say our peak right now is 4.7 Gbps inbound. However, because Kafka is horizontally scalable, you can build a cluster to support what you’re looking for.
At the very basic, if we have gigabit network interfaces, 16 Gbps requires at least 16 brokers. This assumes a replication factor of 1, which is almost certainly not what you want to do (as there is no protection within the cluster for a node failure). If you have RF=2, every bit written into your cluster by producers is going to generate a bit outbound from a broker for replication, and a bit inbound on the broker it’s being replicated to. This means that your inbound throughput is halved, and you need at least 32 brokers. This is simplistic - it doesn’t take into account performance, or retention on disk, both of which you’ll need to think about. You can’t run the network interfaces saturated all the time, but you can get pretty close without having performance problems. You also need to be concerned about your consumers here. If you have a single consumer, there’s no real complication. It maps directly with your producers. But I (for example) have 3 consumers (plus the replicas) on average for any given message. That means I need 4 times as much outbound bandwidth as I do inbound. If my producers write in 250 Mbps, I’m saturating that 1 Gbps network link on the other side. As far as laying out the brokers, this is where things will get a lot more interesting. Ideally, you want the cluster to have as many failure domains as possible, so you want to spread it out to lots of racks (to minimize the impact of a power failure or a single switch failure, assuming a top of rack switch). Which means you have to take into account the cross-talk between the brokers for replication. Spine and leaf network topologies work well here because they don’t involve quite so many cross-connects between the racks to get big pipes. Really, you’re going to have to work with your network engineers to figure out what you have to work with, and plan within that. It’s hard to stand on the outside and give you a solid plan, because I don’t know what else is going on on your networks, where your producers are, what your consumers are doing, or what your performance needs to look like. -Todd On Fri, May 6, 2016 at 4:07 AM, Andrew Backhouse <backhouseand...@gmail.com> wrote: > Hello, > > I trust you are well. > > There's a wealth of great articles and presentations relating to the Kafka > logic and design, thank you for these. But, I'm unable to find information > relating to the network infrastructure that underpins Kafka. What I am > trying to understand and put my Network Engineers at ease when it comes to > sizing our Kafka cluster is as follows; > > Looking at an example deployment e.g. LinkedIn and their Kafka deployment, > it processes 18.5Gbit/sec inbound. > > What design considerations were made to; > > > 1. validate the network can support such throughput (inbound, > inter-cluster & outbound) > 2. identify what network infrastructure config/layout can be used or > avoided to optimise for this level of traffic > > To help your answer, we're looking at potentially 16GBit/sec inbound which > concerns our network team. > > If you can please share pointers to existing materials or specific details > of your deployment, that will be great. > > > Regards, > > Andrew > -- *—-* *Todd Palino* Staff Site Reliability Engineer Data Infrastructure Streaming linkedin.com/in/toddpalino