[ https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Feichi Feng resolved HUDI-724. ------------------------------ Resolution: Fixed > Parallelize GetSmallFiles For Partitions > ---------------------------------------- > > Key: HUDI-724 > URL: https://issues.apache.org/jira/browse/HUDI-724 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core > Reporter: Feichi Feng > Priority: Major > Labels: pull-request-available > Attachments: gap.png, nogapAfterImprovement.png > > Original Estimate: 48h > Time Spent: 40m > Remaining Estimate: 47h 20m > > When writing data, a gap was observed between spark stages. By tracking down > where the time was spent on the spark driver, it's get-small-files operation > for partitions. > When creating the UpsertPartitioner and trying to assign insert records, it > uses a normal for-loop for get the list of small files for all partitions > that the load is going to load data to, and the process is very slow when > there are a lot of partitions to go through. While the operation is running > on spark driver process, all other worker nodes are sitting idle waiting for > tasks. > For all those partitions, they don't affect each other, so the > get-small-files operations can be parallelized. The change I made is to pass > the JavaSparkContext to the UpsertPartitioner, and create RDD for the > partitions and eventually send the get small files operations to multiple > tasks. > > screenshot attached for > the gap without the improvement > the spark stage with the improvement (no gap) -- This message was sent by Atlassian Jira (v8.3.4#803005)