Balazs Jeszenszky created IMPALA-7653: -----------------------------------------
Summary: Improve accuracy of incremental stats cardinality estimation Key: IMPALA-7653 URL: https://issues.apache.org/jira/browse/IMPALA-7653 Project: IMPALA Issue Type: Improvement Components: Frontend Affects Versions: Impala 3.0 Reporter: Balazs Jeszenszky Currently, the operators of a compute [incremental] stats' subquery rely on combined selectivities - as usual - to estimate cardinality, e.g. during aggregation. For example, note the expected cardinality of the aggregation on this subquery: {code} F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=4 Per-Host Resources: mem-estimate=305.20GB mem-reservation=136.00MB 01:AGGREGATE [STREAMING] | output: [...] | group by: col_a, col_b, col_c | mem-estimate=76.21GB mem-reservation=34.00MB spill-buffer=2.00MB | tuple-ids=1 row-size=104.83KB cardinality=693000 | 00:SCAN HDFS [default.test, RANDOM] partitions=1/554 files=1 size=109.65MB stats-rows=1506374 extrapolated-rows=disabled table stats: rows=821958291 size=unavailable column stats: all mem-estimate=88.00MB mem-reservation=0B tuple-ids=0 row-size=2.06KB cardinality=1506374 {code} This was generated as a result of compute incremental stats on a single partition, so the output of that aggregation is a single row. Due to the width of the intermediate rows, such overestimations lead to bloated memory estimates. Since the amount of partitions to be updated is known at plan-time, Impala could use that to set the aggregation's cardinality. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org