Tim Armstrong created IMPALA-6226: ------------------------------------- Summary: Analytic planner misses some opportunities to merge sort groups Key: IMPALA-6226 URL: https://issues.apache.org/jira/browse/IMPALA-6226 Project: IMPALA Issue Type: Improvement Components: Frontend Affects Versions: Impala 2.10.0 Reporter: Tim Armstrong Priority: Minor
{noformat} create table testrn (g int, v int, r int); insert into testrn (g,v,r) select 1,1,1 union select 1,2,2 union select 2,1,3 union select 2,2,4; select g,first_value(v) over(partition by g order by r),row_number() over (partition by g order by r desc) from testrn; {noformat} The above analytical function query produces the following plan: {noformat} Max Per-Host Resource Reservation: Memory=20.00MB Per-Host Resource Estimates: Memory=52.00MB WARNING: The following tables are missing relevant table and/or column statistics. default.testrn F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=0B mem-reservation=0B PLAN-ROOT SINK | mem-estimate=0B mem-reservation=0B | 06:EXCHANGE [UNPARTITIONED] | mem-estimate=0B mem-reservation=0B | tuple-ids=6,3 row-size=24B cardinality=unavailable | F01:PLAN FRAGMENT [HASH(g)] hosts=1 instances=1 Per-Host Resources: mem-estimate=20.00MB mem-reservation=20.00MB 04:ANALYTIC | functions: row_number() | partition by: g | order by: r DESC | window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB | tuple-ids=6,3 row-size=24B cardinality=unavailable | 03:SORT | order by: g ASC NULLS FIRST, r DESC | mem-estimate=6.00MB mem-reservation=6.00MB spill-buffer=2.00MB | tuple-ids=6 row-size=16B cardinality=unavailable | 02:ANALYTIC | functions: first_value(v) | partition by: g | order by: r ASC | window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB | tuple-ids=4,2 row-size=16B cardinality=unavailable | 01:SORT | order by: g ASC NULLS FIRST, r ASC | mem-estimate=6.00MB mem-reservation=6.00MB spill-buffer=2.00MB | tuple-ids=4 row-size=12B cardinality=unavailable | 05:EXCHANGE [HASH(g)] | mem-estimate=0B mem-reservation=0B | tuple-ids=0 row-size=12B cardinality=unavailable | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B 00:SCAN HDFS [default.testrn, RANDOM] partitions=1/1 files=1 size=24B stats-rows=unavailable extrapolated-rows=disabled table stats: rows=unavailable size=unavailable column stats: unavailable mem-estimate=32.00MB mem-reservation=0B tuple-ids=0 row-size=12B cardinality=unavailable {noformat} I believe we could avoid sorting twice by rewriting first_value() into last_value(). I suspect there are a few other cases where the analytic functions can be "flipped" like this to handle a different sort order. The solution probably involves adding some extra cases to this logic: https://github.com/apache/incubator-impala/blob/5e9b4e2/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java#L729 to detect when the expressions are identical up to the ASC/DESC and when the function can be "flipped" to produce the correct result. -- This message was sent by Atlassian JIRA (v6.4.14#64029)