Bruce Robbins created SPARK-28156:
-------------------------------------

             Summary: Join plan sometimes does not use cached query
                 Key: SPARK-28156
                 URL: https://issues.apache.org/jira/browse/SPARK-28156
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.3, 2.3.3, 3.0.0
            Reporter: Bruce Robbins


I came across a case where a cached query is referenced on both sides of a 
join, but the InMemoryRelation is inserted on only one side. This case occurs 
only when the cached query uses a (Hive-style) view.

Consider this example:
{noformat}
// create the data
val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
"c", "d")
df1.write.mode("overwrite").format("orc").saveAsTable("table1")
sql("drop view if exists table1_vw")
sql("create view table1_vw as select * from table1")

// create the cached query
val cacheddataDf = sql("""
select a, b, c, d
from table1_vw
""")

import org.apache.spark.storage.StorageLevel.DISK_ONLY
cacheddataDf.createOrReplaceTempView("cacheddata")
cacheddataDf.persist(DISK_ONLY)

// main query
val queryDf = sql(s"""
select leftside.a, leftside.b
from cacheddata leftside
join cacheddata rightside
on leftside.a = rightside.a
""")

queryDf.explain(true)
{noformat}
Note that the optimized plan does not use an InMemoryRelation for the right 
side, but instead just uses a Relation:
{noformat}
Project [a#45, b#46]
+- Join Inner, (a#45 = a#37)
   :- Project [a#45, b#46]
   :  +- Filter isnotnull(a#45)
   :     +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
replicas)
   :           +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
Batched: true, DataFilters: [], Format: ORC, Location: 
InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<a:int,b:int,c:int,d:int>
   +- Project [a#37]
      +- Filter isnotnull(a#37)
         +- Relation[a#37,b#38,c#39,d#40] orc

{noformat}
The fragment does not match the cached query because AliasViewChild adds an 
extra projection under the View on the right side (see #2 below).

AliasViewChild adds the extra projection because the exprIds in the View's 
output appears to have been renamed by Analyzer$ResolveReferences (#1 below). I 
have not yet looked at why.
{noformat}
-
-
-
   +- SubqueryAlias `rightside`
      +- SubqueryAlias `cacheddata`
         +- Project [a#73, b#74, c#75, d#76]
            +- SubqueryAlias `default`.`table1_vw`
(#1) ->        +- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
(#2) ->           +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
                     +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
                        +- Project [a#37, b#38, c#39, d#40]
                           +- SubqueryAlias `default`.`table1`
                              +- Relation[a#37,b#38,c#39,d#40] orc

{noformat}
In a larger query (where cachedata may be referred on either side only 
indirectly), this phenomenon can create certain oddities, as the fragment is 
not replaced with InMemoryRelation, and the fragment is present when the plan 
is optimized as a whole.

In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to