You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows constantly in a loop, every time its union with a new increment will results in both of them being reshuffled, and partitioner reverted to None. This can never be efficient with existing API.
- Why does spark write huge file into temporary local disk even w... Peng Cheng
- Re: Why does spark write huge file into temporary local di... Sean Owen
- Re: Why does spark write huge file into temporary loca... Peng Cheng