[ https://issues.apache.org/jira/browse/HUDI-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanjia Gary Li reassigned HUDI-315: ----------------------------------- Assignee: Yanjia Gary Li > Reimplement statistics/workload profile collected during writes using Spark > 2.x custom accumulators > --------------------------------------------------------------------------------------------------- > > Key: HUDI-315 > URL: https://issues.apache.org/jira/browse/HUDI-315 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Writer Core > Reporter: Vinoth Chandar > Assignee: Yanjia Gary Li > Priority: Major > > https://medium.com/@shrechak/leveraging-custom-accumulators-in-apache-spark-2-0-f4fef23f19f1 > > In Hudi, there are two places where we need to obtain statistics on the input > data > - HoodieBloomIndex : for knowing what partitions need to be loaded and > checked against (is this still needed with the timeline server enabled is a > separate question) > - Workload profile to get a sense of number of updates, inserts to each > partition/file group > Both of them issue their own groupBy or shuffle computation today. This can > be avoided using an accumulator -- This message was sent by Atlassian Jira (v8.3.4#803005)