Understood. Thanks. On Wed, May 4, 2016 at 3:47 PM, Wesley Chow <w...@chartbeat.com> wrote:
> > We don’t do this on the Kafka side, but for a different system that has > similar distribution problems we manually maintain a map of “hot” keys. On > the Kafka side, we distribute keys with an even distribution in our largest > volume topic, and then squash the data and repartition based on a skewed > key. The resulting skew is somewhat insignificant compared to our largest > volume topic that we tend to not care. > > Wes > > > > On May 4, 2016, at 2:57 PM, Srikanth <srikanth...@gmail.com> wrote: > > > > Yeah, fixed slicing may help. I'll put more thought into this. > > You had mentioned that you didn't put custom partitioner into production. > > Would you mind sharing how you worked around this currently? > > > > Srikanth > > > > On Tue, May 3, 2016 at 5:43 PM, Wesley Chow <w...@chartbeat.com> wrote: > > > >>> > >>> Upload to S3 is partitioned by the "key" field. I.e, one folder per > key. > >> It > >>> does offset management to make sure offset commit is in sync with S3 > >> upload. > >> > >> We do this in several spots and I wish we had built our system in such a > >> way that we could just open source it. I’m sure many people have solved > >> this repeatedly. We’ve had significant disk performance issues when the > >> number of keys is large (40,000-ish in our case) — you can’t be > expected to > >> open a file per key. That’s why something like the fixed slicing > strategy I > >> described can make a big difference. > >> > >> Wes > >> > >> > >