We don’t do this on the Kafka side, but for a different system that has similar distribution problems we manually maintain a map of “hot” keys. On the Kafka side, we distribute keys with an even distribution in our largest volume topic, and then squash the data and repartition based on a skewed key. The resulting skew is somewhat insignificant compared to our largest volume topic that we tend to not care.
Wes > On May 4, 2016, at 2:57 PM, Srikanth <srikanth...@gmail.com> wrote: > > Yeah, fixed slicing may help. I'll put more thought into this. > You had mentioned that you didn't put custom partitioner into production. > Would you mind sharing how you worked around this currently? > > Srikanth > > On Tue, May 3, 2016 at 5:43 PM, Wesley Chow <w...@chartbeat.com> wrote: > >>> >>> Upload to S3 is partitioned by the "key" field. I.e, one folder per key. >> It >>> does offset management to make sure offset commit is in sync with S3 >> upload. >> >> We do this in several spots and I wish we had built our system in such a >> way that we could just open source it. I’m sure many people have solved >> this repeatedly. We’ve had significant disk performance issues when the >> number of keys is large (40,000-ish in our case) — you can’t be expected to >> open a file per key. That’s why something like the fixed slicing strategy I >> described can make a big difference. >> >> Wes >> >>