Re: [DISCUSS] streaming shuffle to improve data clustering and tame small files problem

2023-01-30 Thread Jark Wu
Thank Steven, for starting this discussion. As I suggested in the previous thread, this can be a joint effort beneficial for various projects. I would also like to hear opinions from @Jingsong Li , who is maintaining Flink Table Store. Best, Jark On Tue, 31 Jan 2023 at 08:46, Steven Wu wrote:

[DISCUSS] streaming shuffle to improve data clustering and tame small files problem

2023-01-30 Thread Steven Wu
Hi, We had a proposal to add a streaming shuffling stage in the Flink Iceberg sink to to improve data clustering and tame the small files problem [1]. Here are a couple of common use cases. * Event time partitioned table where we can get small files problem due to skewed and long-tail