Re: Impala fails to scan only the correct base table when a query is submitted to the view.

Tim Armstrong Tue, 08 Dec 2020 09:22:51 -0800

You're seeing the effects of one optimization where scans with trivially
false predicates or contradictory predicates are eliminated. That
optimization can deduce contradictions between equality predicates but
doesn't do more complex reasoning about whether ranges specified by
inequality predicates overlap.


Partition pruning is the optimization that is generally more important and
works with a much bigger set of predicates. I don't know what your scenario
is, but I've seen in the past that use cases with views like yours tend to
also be partitioned on the columns where there's significant opportunity
for data elimination.
E.g.

create view test_view as
select * from functional.alltypes
where year >= 2010
union all
select * from functional.alltypestiny
where year <= 2009

[localhost.EXAMPLE.COM:21050] functional> explain select * from test_view
where year < 2010;
Query: explain select * from test_view where year < 2010
+---------------------------------------------------------------------------------+
| Explain String
       |
+---------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=8.00KB Threads=2
       |
| Per-Host Resource Estimates: Memory=16MB
       |
| Codegen disabled by planner
      |
|
      |
| PLAN-ROOT SINK
       |
| |
      |
| 00:UNION
       |
| |  pass-through-operands: all
      |
| |  row-size=89B cardinality=8
      |
| |
      |
| |--02:SCAN HDFS [functional.alltypestiny]
      |
| |     partition predicates: `year` <= 2009, functional.alltypestiny.year
< 2010 |
| |     HDFS partitions=4/4 files=4 size=460B
      |
| |     row-size=89B cardinality=8
       |
| |
      |
| 01:SCAN HDFS [functional.alltypes]
       |
|    partition predicates: `year` >= 2010, functional.alltypes.year < 2010
       |
|    partitions=0/24 files=0 size=0B
       |
|    row-size=89B cardinality=0
      |
+---------------------------------------------------------------------------------+

Note that there are 0 partitions.

If the values aren't present in a particular table as well, other forms of
data elimination at runtime can be very effective at avoiding scanning data
- https://blog.cloudera.com/faster-performance-for-selective-queries/,
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/.
Those won't show up in the plan.

On Tue, Dec 8, 2020 at 7:35 AM Vibhath Ileperuma <
vibhatharunapr...@gmail.com> wrote:

> Hi Team,
>
>
>
> When I’m evaluating the execution of queries submitted to a view, I
> noticed the following issue.
>
>
>
> *Problem:*
>
> Impala fails to scan only the correct base table when a query is submitted
> to the view.
>
> Consider following Example
>
> Let’s say we create two tables and a view as follows.
>
> *CREATE* *TABLE* TEST_TABLE_1(
>
> COL1 *BIGINT*,
>
> COL_RANGE *BIGINT*
>
> )
>
>
>
> *CREATE* *TABLE* TEST_TABLE_2(
>
> COL1 *BIGINT*,
>
> COL_RANGE *BIGINT*
>
> )
>
>
>
> *CREATE* *VIEW* TEST_VIEW *AS*
>
> *SELECT* COL1, COL_RANGE *FROM* TEST_TABLE_1 *WHERE* COL_RANGE <= 5
>
> *UNION* *ALL*
>
> *SELECT* COL1, COL_RANGE *FROM* TEST_TABLE_2 *WHERE* COL_RANGE > 5
>
>
>
> If we submit a select query with only one equal condition for ‘COL_RANGE’,
> impala scans only the correct base table.
>
> *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE = 2
>
> Operator              #Hosts  Avg Time  Max Time  #Rows  Est. #Rows  Peak
> Mem  Est. Peak Mem  Detail
>
>
> -------------------------------------------------------------------------------------------------------------------
>
> F02:ROOT                   1   0.000ns
> 0.000ns                            0              0
>
> 03:EXCHANGE                1   0.000ns   0.000ns      1           1  16.00
> KB       16.00 KB  UNPARTITIONED
>
> F01:EXCHANGE SENDER        1   0.000ns   0.000ns                      7.52
> KB              0
>
> 00:UNION                   1   0.000ns   0.000ns      1
> 1         0              0
>
> 01:SCAN LOCAL              1   6.999ms   6.999ms      1           1  49.00
> KB       32.00 MB  default.test_table_1
>
>
>
> *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE = 7
>
> Operator              #Hosts   Avg Time   Max Time  #Rows  Est. #Rows  Peak 
> Mem  Est. Peak Mem  Detail
>
> ---------------------------------------------------------------------------------------------------------------------
>
> F02:ROOT                   1    0.000ns    0.000ns                            
> 0              0
>
> 03:EXCHANGE                1    0.000ns    0.000ns      1           1  16.00 
> KB       16.00 KB  UNPARTITIONED
>
> F01:EXCHANGE SENDER        1  999.987us  999.987us                      7.52 
> KB              0
>
> 00:UNION                   1    0.000ns    0.000ns      1           1         
> 0              0
>
> 02:SCAN LOCAL              1    4.999ms    4.999ms      1           1  49.00 
> KB       32.00 MB  default.test_table_2
>
>
>
>
>
> But if we submit a query with a range condition, it always scan both the
> tables.
>
> *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE < 4
>
> Operator              #Hosts  Avg Time  Max Time  #Rows  Est. #Rows  Peak Mem 
>  Est. Peak Mem  Detail
>
> -------------------------------------------------------------------------------------------------------------------
>
> F03:ROOT                   1   0.000ns   0.000ns                            0 
>              0
>
> 03:EXCHANGE                1   0.000ns   0.000ns      2           2  16.00 KB 
>       16.00 KB  UNPARTITIONED
>
> F02:EXCHANGE SENDER        1   0.000ns   0.000ns                      7.52 KB 
>              0
>
> 00:UNION                   1   0.000ns   0.000ns      2           2         0 
>              0
>
> |--02:SCAN LOCAL           1   2.999ms   2.999ms      0           1  41.00 KB 
>       32.00 MB  default.test_table_2
>
> 01:SCAN LOCAL              1   4.999ms   4.999ms      2           1  49.00 KB 
>       32.00 MB  default.test_table_1
>
>
>
> *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE > 5
>
> Operator              #Hosts  Avg Time  Max Time  #Rows  Est. #Rows  Peak Mem 
>  Est. Peak Mem  Detail
>
> -------------------------------------------------------------------------------------------------------------------
>
> F03:ROOT                   1   0.000ns   0.000ns                            0 
>              0
>
> 03:EXCHANGE                1   0.000ns   0.000ns      3           2  16.00 KB 
>       16.00 KB  UNPARTITIONED
>
> F02:EXCHANGE SENDER        1   1.000ms   1.000ms                      7.52 KB 
>              0
>
> 00:UNION                   1   0.000ns   0.000ns      3           2         0 
>              0
>
> |--02:SCAN LOCAL           1   5.000ms   5.000ms      3           1  49.00 KB 
>       32.00 MB  default.test_table_2
>
> 01:SCAN LOCAL              1   7.000ms   7.000ms      0           1  41.00 KB 
>       32.00 MB  default.test_table_1
>
>
>
>
>
>
>
> I would be really grateful if you could check on this and advise how to
> overcome this issue.
>
> Thanks & Regards
>
> *Vibhath Ileperuma*
>
>
>

Re: Impala fails to scan only the correct base table when a query is submitted to the view.

Reply via email to