You're seeing the effects of one optimization where scans with trivially false predicates or contradictory predicates are eliminated. That optimization can deduce contradictions between equality predicates but doesn't do more complex reasoning about whether ranges specified by inequality predicates overlap.
Partition pruning is the optimization that is generally more important and works with a much bigger set of predicates. I don't know what your scenario is, but I've seen in the past that use cases with views like yours tend to also be partitioned on the columns where there's significant opportunity for data elimination. E.g. create view test_view as select * from functional.alltypes where year >= 2010 union all select * from functional.alltypestiny where year <= 2009 [localhost.EXAMPLE.COM:21050] functional> explain select * from test_view where year < 2010; Query: explain select * from test_view where year < 2010 +---------------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=8.00KB Threads=2 | | Per-Host Resource Estimates: Memory=16MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 00:UNION | | | pass-through-operands: all | | | row-size=89B cardinality=8 | | | | | |--02:SCAN HDFS [functional.alltypestiny] | | | partition predicates: `year` <= 2009, functional.alltypestiny.year < 2010 | | | HDFS partitions=4/4 files=4 size=460B | | | row-size=89B cardinality=8 | | | | | 01:SCAN HDFS [functional.alltypes] | | partition predicates: `year` >= 2010, functional.alltypes.year < 2010 | | partitions=0/24 files=0 size=0B | | row-size=89B cardinality=0 | +---------------------------------------------------------------------------------+ Note that there are 0 partitions. If the values aren't present in a particular table as well, other forms of data elimination at runtime can be very effective at avoiding scanning data - https://blog.cloudera.com/faster-performance-for-selective-queries/, https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/. Those won't show up in the plan. On Tue, Dec 8, 2020 at 7:35 AM Vibhath Ileperuma < vibhatharunapr...@gmail.com> wrote: > Hi Team, > > > > When I’m evaluating the execution of queries submitted to a view, I > noticed the following issue. > > > > *Problem:* > > Impala fails to scan only the correct base table when a query is submitted > to the view. > > Consider following Example > > Let’s say we create two tables and a view as follows. > > *CREATE* *TABLE* TEST_TABLE_1( > > COL1 *BIGINT*, > > COL_RANGE *BIGINT* > > ) > > > > *CREATE* *TABLE* TEST_TABLE_2( > > COL1 *BIGINT*, > > COL_RANGE *BIGINT* > > ) > > > > *CREATE* *VIEW* TEST_VIEW *AS* > > *SELECT* COL1, COL_RANGE *FROM* TEST_TABLE_1 *WHERE* COL_RANGE <= 5 > > *UNION* *ALL* > > *SELECT* COL1, COL_RANGE *FROM* TEST_TABLE_2 *WHERE* COL_RANGE > 5 > > > > If we submit a select query with only one equal condition for ‘COL_RANGE’, > impala scans only the correct base table. > > *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE = 2 > > Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak > Mem Est. Peak Mem Detail > > > ------------------------------------------------------------------------------------------------------------------- > > F02:ROOT 1 0.000ns > 0.000ns 0 0 > > 03:EXCHANGE 1 0.000ns 0.000ns 1 1 16.00 > KB 16.00 KB UNPARTITIONED > > F01:EXCHANGE SENDER 1 0.000ns 0.000ns 7.52 > KB 0 > > 00:UNION 1 0.000ns 0.000ns 1 > 1 0 0 > > 01:SCAN LOCAL 1 6.999ms 6.999ms 1 1 49.00 > KB 32.00 MB default.test_table_1 > > > > *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE = 7 > > Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak > Mem Est. Peak Mem Detail > > --------------------------------------------------------------------------------------------------------------------- > > F02:ROOT 1 0.000ns 0.000ns > 0 0 > > 03:EXCHANGE 1 0.000ns 0.000ns 1 1 16.00 > KB 16.00 KB UNPARTITIONED > > F01:EXCHANGE SENDER 1 999.987us 999.987us 7.52 > KB 0 > > 00:UNION 1 0.000ns 0.000ns 1 1 > 0 0 > > 02:SCAN LOCAL 1 4.999ms 4.999ms 1 1 49.00 > KB 32.00 MB default.test_table_2 > > > > > > But if we submit a query with a range condition, it always scan both the > tables. > > *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE < 4 > > Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem > Est. Peak Mem Detail > > ------------------------------------------------------------------------------------------------------------------- > > F03:ROOT 1 0.000ns 0.000ns 0 > 0 > > 03:EXCHANGE 1 0.000ns 0.000ns 2 2 16.00 KB > 16.00 KB UNPARTITIONED > > F02:EXCHANGE SENDER 1 0.000ns 0.000ns 7.52 KB > 0 > > 00:UNION 1 0.000ns 0.000ns 2 2 0 > 0 > > |--02:SCAN LOCAL 1 2.999ms 2.999ms 0 1 41.00 KB > 32.00 MB default.test_table_2 > > 01:SCAN LOCAL 1 4.999ms 4.999ms 2 1 49.00 KB > 32.00 MB default.test_table_1 > > > > *SELECT* * *FROM* TEST_VIEW *WHERE* COL_RANGE > 5 > > Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem > Est. Peak Mem Detail > > ------------------------------------------------------------------------------------------------------------------- > > F03:ROOT 1 0.000ns 0.000ns 0 > 0 > > 03:EXCHANGE 1 0.000ns 0.000ns 3 2 16.00 KB > 16.00 KB UNPARTITIONED > > F02:EXCHANGE SENDER 1 1.000ms 1.000ms 7.52 KB > 0 > > 00:UNION 1 0.000ns 0.000ns 3 2 0 > 0 > > |--02:SCAN LOCAL 1 5.000ms 5.000ms 3 1 49.00 KB > 32.00 MB default.test_table_2 > > 01:SCAN LOCAL 1 7.000ms 7.000ms 0 1 41.00 KB > 32.00 MB default.test_table_1 > > > > > > > > I would be really grateful if you could check on this and advise how to > overcome this issue. > > Thanks & Regards > > *Vibhath Ileperuma* > > >