Tim Armstrong created IMPALA-6226:
-------------------------------------

             Summary: Analytic planner misses some opportunities to merge sort 
groups
                 Key: IMPALA-6226
                 URL: https://issues.apache.org/jira/browse/IMPALA-6226
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.10.0
            Reporter: Tim Armstrong
            Priority: Minor


{noformat}
create table testrn (g int, v int, r int);
insert into testrn (g,v,r) select 1,1,1 union select 1,2,2 union select 2,1,3 
union select 2,2,4;
select g,first_value(v) over(partition by g order by r),row_number() over 
(partition by g order by r desc) from testrn;
{noformat}

The above analytical function query produces the following plan:
{noformat}
Max Per-Host Resource Reservation: Memory=20.00MB
Per-Host Resource Estimates: Memory=52.00MB
WARNING: The following tables are missing relevant table and/or column 
statistics.
default.testrn

F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=0B mem-reservation=0B
PLAN-ROOT SINK
|  mem-estimate=0B mem-reservation=0B
|
06:EXCHANGE [UNPARTITIONED]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=6,3 row-size=24B cardinality=unavailable
|
F01:PLAN FRAGMENT [HASH(g)] hosts=1 instances=1
Per-Host Resources: mem-estimate=20.00MB mem-reservation=20.00MB
04:ANALYTIC
|  functions: row_number()
|  partition by: g
|  order by: r DESC
|  window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
|  tuple-ids=6,3 row-size=24B cardinality=unavailable
|
03:SORT
|  order by: g ASC NULLS FIRST, r DESC
|  mem-estimate=6.00MB mem-reservation=6.00MB spill-buffer=2.00MB
|  tuple-ids=6 row-size=16B cardinality=unavailable
|
02:ANALYTIC
|  functions: first_value(v)
|  partition by: g
|  order by: r ASC
|  window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB
|  tuple-ids=4,2 row-size=16B cardinality=unavailable
|
01:SORT
|  order by: g ASC NULLS FIRST, r ASC
|  mem-estimate=6.00MB mem-reservation=6.00MB spill-buffer=2.00MB
|  tuple-ids=4 row-size=12B cardinality=unavailable
|
05:EXCHANGE [HASH(g)]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=0 row-size=12B cardinality=unavailable
|
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B
00:SCAN HDFS [default.testrn, RANDOM]
   partitions=1/1 files=1 size=24B
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=unavailable
   column stats: unavailable
   mem-estimate=32.00MB mem-reservation=0B
   tuple-ids=0 row-size=12B cardinality=unavailable
{noformat}

I believe we could avoid sorting twice by rewriting first_value() into 
last_value(). I suspect there are a few other cases where the analytic 
functions can be "flipped" like this to handle a different sort order.

The solution probably involves adding some extra cases to this logic: 
https://github.com/apache/incubator-impala/blob/5e9b4e2/fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java#L729
 to detect when the expressions are identical up to the ASC/DESC and when the 
function can be "flipped" to produce the correct result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to