We don’t do this on the Kafka side, but for a different system that has similar 
distribution problems we manually maintain a map of “hot” keys. On the Kafka 
side, we distribute keys with an even distribution in our largest volume topic, 
and then squash the data and repartition based on a skewed key. The resulting 
skew is somewhat insignificant compared to our largest volume topic that we 
tend to not care.

Wes


> On May 4, 2016, at 2:57 PM, Srikanth <srikanth...@gmail.com> wrote:
> 
> Yeah, fixed slicing may help. I'll put more thought into this.
> You had mentioned that you didn't put custom partitioner into production.
> Would you mind sharing how you worked around this currently?
> 
> Srikanth
> 
> On Tue, May 3, 2016 at 5:43 PM, Wesley Chow <w...@chartbeat.com> wrote:
> 
>>> 
>>> Upload to S3 is partitioned by the "key" field. I.e, one folder per key.
>> It
>>> does offset management to make sure offset commit is in sync with S3
>> upload.
>> 
>> We do this in several spots and I wish we had built our system in such a
>> way that we could just open source it. I’m sure many people have solved
>> this repeatedly. We’ve had significant disk performance issues when the
>> number of keys is large (40,000-ish in our case) — you can’t be expected to
>> open a file per key. That’s why something like the fixed slicing strategy I
>> described can make a big difference.
>> 
>> Wes
>> 
>> 

Reply via email to