Andrew Ray created SPARK-13749:
----------------------------------

             Summary: Faster pivot implementation for many distinct values with 
two phase aggregation
                 Key: SPARK-13749
                 URL: https://issues.apache.org/jira/browse/SPARK-13749
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Andrew Ray


The existing implementation of pivot translates into a single aggregation with 
one aggregate per distinct pivot value. When the number of distinct pivot 
values is large (say 1000+) this can get extremely slow since each input value 
gets evaluated on every aggregate even though it only affects the value of one 
of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary 
threshold) distinct pivot values. We do two phases of aggregation. In the first 
we group by the grouping columns plus the pivot column and perform the 
specified aggregations (one or sometimes more). In the second aggregation we 
group by the grouping columns and use the new (non public) PivotFirst aggregate 
that rearranges the outputs of the first aggregation into an array indexed by 
the pivot value. Finally we do a project to extract the array entries into the 
appropriate output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to