Re: [DISCUSS] Change some default config values of blocking shuffle

Jingsong Li Mon, 13 Dec 2021 23:43:54 -0800

Hi Yingjie,

Thanks for your explanation. I have no more questions. +1


On Tue, Dec 14, 2021 at 3:31 PM Yingjie Cao <kevin.ying...@gmail.com> wrote:
>
> Hi Jingsong,
>
> Thanks for your feedback.
>
> >>> My question is, what is the maximum parallelism a job can have with the 
> >>> default configuration? (Does this break out of the box)
>
> Yes, you are right, these two options are related to network memory and 
> framework off-heap memory. Generally, these changes will not break out of the 
> box experience, but for some extreme cases, for example, there are too many 
> ResultPartitions per task, users may need to increase network memory to avoid 
> "insufficient network buffer" error. For framework off-head, I believe that 
> user do not need to change the default value.
>
> In fact, I have a basic goal when changing these config values: when running 
> TPCDS of medium parallelism with the default value, all queries must pass 
> without any error and at the same time, the performance can be improved. I 
> think if we achieve this goal, most common use cases can be covered.
>
> Currently, for the default configuration, the exclusive buffers required at 
> input gate side is still parallelism relevant (though since 1.14, we can 
> decouple the network buffer consumption from parallelism by setting a config 
> value, it has slight performance influence on streaming jobs), which means 
> that no large parallelism can be supported by the default configuration. 
> Roughly, I would say the default value can support jobs of several hundreds 
> of parallelism.
>
> >>> I do feel that this correspondence is a bit difficult to control at the 
> >>> moment, and it would be best if a rough table could be provided.
>
> I think this is a good suggestion, we can provide those suggestions in the 
> document.
>
> Best,
> Yingjie
>
> Jingsong Li <jingsongl...@gmail.com> 于2021年12月14日周二 14:39写道：
>>
>> Hi  Yingjie,
>>
>> +1 for this FLIP. I'm pretty sure this will greatly improve the ease
>> of batch jobs.
>>
>> Looks like "taskmanager.memory.framework.off-heap.batch-shuffle.size"
>> and "taskmanager.network.sort-shuffle.min-buffers" are related to
>> network memory and framework.off-heap.size.
>>
>> My question is, what is the maximum parallelism a job can have with
>> the default configuration? (Does this break out of the box)
>>
>> How much network memory and framework.off-heap.size are required for
>> how much parallelism in the default configuration?
>>
>> I do feel that this correspondence is a bit difficult to control at
>> the moment, and it would be best if a rough table could be provided.
>>
>> Best,
>> Jingsong
>>
>> On Tue, Dec 14, 2021 at 2:16 PM Yingjie Cao <kevin.ying...@gmail.com> wrote:
>> >
>> > Hi Jiangang,
>> >
>> > Thanks for your suggestion.
>> >
>> > >>> The config can affect the memory usage. Will the related memory 
>> > >>> configs be changed?
>> >
>> > I think we will not change the default network memory settings. My best 
>> > expectation is that the default value can work for most cases (though may 
>> > not the best) and for other cases, user may need to tune the memory 
>> > settings.
>> >
>> > >>> Can you share the tpcds results for different configs? Although we 
>> > >>> change the default values, it is helpful to change them for different 
>> > >>> users. In this case, the experience can help a lot.
>> >
>> > I did not keep all previous TPCDS results, but from the results, I can 
>> > tell that on HDD, always using the sort-shuffle is a good choice. For 
>> > small jobs, using sort-shuffle may not bring much performance gain, this 
>> > may because that all shuffle data can be cached in memory (page cache), 
>> > this is the case if the cluster have enough resources. However, if the 
>> > whole cluster is under heavy burden or you are running large scale jobs, 
>> > the performance of those small jobs can also be influenced. For 
>> > large-scale jobs, the configurations suggested to be tuned are 
>> > taskmanager.network.sort-shuffle.min-buffers and 
>> > taskmanager.memory.framework.off-heap.batch-shuffle.size, you can increase 
>> > these values for large-scale batch jobs.
>> >
>> > BTW, I am still running TPCDS tests these days and I can share these 
>> > results soon.
>> >
>> > Best,
>> > Yingjie
>> >
>> > 刘建刚 <liujiangangp...@gmail.com> 于2021年12月10日周五 18:30写道：
>> >>
>> >> Glad to see the suggestion. In our test, we found that small jobs with 
>> >> the changing configs can not improve the performance much just as your 
>> >> test. I have some suggestions:
>> >>
>> >> The config can affect the memory usage. Will the related memory configs 
>> >> be changed?
>> >> Can you share the tpcds results for different configs? Although we change 
>> >> the default values, it is helpful to change them for different users. In 
>> >> this case, the experience can help a lot.
>> >>
>> >> Best,
>> >> Liu Jiangang
>> >>
>> >> Yun Gao <yungao...@aliyun.com.invalid> 于2021年12月10日周五 17:20写道：
>> >>>
>> >>> Hi Yingjie,
>> >>>
>> >>> Very thanks for drafting the FLIP and initiating the discussion!
>> >>>
>> >>> May I have a double confirmation for 
>> >>> taskmanager.network.sort-shuffle.min-parallelism that
>> >>> since other frameworks like Spark have used sort-based shuffle for all 
>> >>> the cases, does our
>> >>> current circumstance still have difference with them?
>> >>>
>> >>> Best,
>> >>> Yun
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------------------------------------------
>> >>> From:Yingjie Cao <kevin.ying...@gmail.com>
>> >>> Send Time:2021 Dec. 10 (Fri.) 16:17
>> >>> To:dev <d...@flink.apache.org>; user <u...@flink.apache.org>; user-zh 
>> >>> <user-zh@flink.apache.org>
>> >>> Subject:Re: [DISCUSS] Change some default config values of blocking 
>> >>> shuffle
>> >>>
>> >>> Hi dev & users:
>> >>>
>> >>> I have created a FLIP [1] for it, feedbacks are highly appreciated.
>> >>>
>> >>> Best,
>> >>> Yingjie
>> >>>
>> >>> [1] 
>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability
>> >>> Yingjie Cao <kevin.ying...@gmail.com> 于2021年12月3日周五 17:02写道：
>> >>>
>> >>> Hi dev & users,
>> >>>
>> >>> We propose to change some default values of blocking shuffle to improve 
>> >>> the user out-of-box experience (not influence streaming). The default 
>> >>> values we want to change are as follows:
>> >>>
>> >>> 1. Data compression 
>> >>> (taskmanager.network.blocking-shuffle.compression.enabled): Currently, 
>> >>> the default value is 'false'.  Usually, data compression can reduce both 
>> >>> disk and network IO which is good for performance. At the same time, it 
>> >>> can save storage space. We propose to change the default value to true.
>> >>>
>> >>> 2. Default shuffle implementation 
>> >>> (taskmanager.network.sort-shuffle.min-parallelism): Currently, the 
>> >>> default value is 'Integer.MAX', which means by default, Flink jobs will 
>> >>> always use hash-shuffle. In fact, for high parallelism, sort-shuffle is 
>> >>> better for both stability and performance. So we propose to reduce the 
>> >>> default value to a proper smaller one, for example, 128. (We tested 128, 
>> >>> 256, 512 and 1024 with a tpc-ds and 128 is the best one.)
>> >>>
>> >>> 3. Read buffer of sort-shuffle 
>> >>> (taskmanager.memory.framework.off-heap.batch-shuffle.size): Currently, 
>> >>> the default value is '32M'. Previously, when choosing the default value, 
>> >>> both ‘32M' and '64M' are OK for tests and we chose the smaller one in a 
>> >>> cautious way. However, recently, it is reported in the mailing list that 
>> >>> the default value is not enough which caused a buffer request timeout 
>> >>> issue. We already created a ticket to improve the behavior. At the same 
>> >>> time, we propose to increase this default value to '64M' which can also 
>> >>> help.
>> >>>
>> >>> 4. Sort buffer size of sort-shuffle 
>> >>> (taskmanager.network.sort-shuffle.min-buffers): Currently, the default 
>> >>> value is '64' which means '64' network buffers (32k per buffer by 
>> >>> default). This default value is quite modest and the performance can be 
>> >>> influenced. We propose to increase this value to a larger one, for 
>> >>> example, 512 (the default TM and network buffer configuration can serve 
>> >>> more than 10 result partitions concurrently).
>> >>>
>> >>> We already tested these default values together with tpc-ds benchmark in 
>> >>> a cluster and both the performance and stability improved a lot. These 
>> >>> changes can help to improve the out-of-box experience of blocking 
>> >>> shuffle. What do you think about these changes? Is there any concern? If 
>> >>> there are no objections, I will make these changes soon.
>> >>>
>> >>> Best,
>> >>> Yingjie
>> >>>
>>
>>
>> --
>> Best, Jingsong Lee



-- 
Best, Jingsong Lee

Re: [DISCUSS] Change some default config values of blocking shuffle

回复