[ https://issues.apache.org/jira/browse/FLINK-33879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weijie Guo closed FLINK-33879. ------------------------------ Resolution: Fixed master(1.19) via 879509d7ca886f8f0ed4dd966e859d3c2a5aa231. > Hybrid Shuffle may stop working for a while during redistribution > ----------------------------------------------------------------- > > Key: FLINK-33879 > URL: https://issues.apache.org/jira/browse/FLINK-33879 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Reporter: Jiang Xin > Assignee: Jiang Xin > Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > > Currently, the Hybrid Shuffle can work with the memory tier and disk tier > together, however, in the following scenario the result partition would stop > working. > Suppose we have a shuffle task with 2 sub-partitions. The LocalBufferPool has > 15 buffers, the memory tier can use at most 15-(2*(2+1)+1) = 8 buffers > according to `TieredStorageMemoryManagerImpl#getMaxNonReclaimableBuffers`. If > the memory tier uses up all 8 buffers and the input channel consumes them > very slowly because of problems, e.g. unstable network, the disk tier can > still work with 1 reserved buffer. However, if a redistribution happens now > and the pool size is decreased to less than 8, then the BufferAccumulator can > not request buffers anymore, and thus the result partition stops working > until the buffers in the memory tier are recycled. > The purpose is to make the result partition still work with the disk tier and > write the shuffle data to disk so that once the input channel is ready, the > data on the disk can be consumed immediately. -- This message was sent by Atlassian Jira (v8.20.10#820010)