[jira] [Created] (SPARK-10967) Incorrect Join behavior in filter conditions

RaviShankar KS (JIRA) Tue, 06 Oct 2015 22:06:07 -0700

RaviShankar KS created SPARK-10967:
--------------------------------------

             Summary: Incorrect Join behavior in filter conditions
                 Key: SPARK-10967
                 URL: https://issues.apache.org/jira/browse/SPARK-10967
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 1.4.1
         Environment: Ubuntu on AWS
            Reporter: RaviShankar KS
            Assignee: Josh Rosen
             Fix For: 1.5.0



According to the [Hive Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
for UNION ALL:

{quote}
The number and names of columns returned by each select_statement have to be 
the same. Otherwise, a schema error is thrown.
{quote}

Spark SQL silently swallows an error when the tables being joined with UNION 
ALL have the same number of columns but different names.

Reproducible example:

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note category vs. cat names of first column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"cat" : "A", "num" : 5}""")

//  +--------+---+
//  |category|num|
//  +--------+---+
//  |       A|  5|
//  |       A|  5|
//  +--------+---+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}

When the number of columns is different, Spark can even mix in datatypes. 

Reproducible example (requires a new spark-shell session):

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note test_another is missing category column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"num" : 5}""")

//  +--------+
//  |category|
//  +--------+
//  |       A|
//  |       5| 
//  +--------+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}

At other times, when the schema are complex, Spark SQL produces a misleading 
error about an unresolved Union operator:

{code}
scala> ctx.sql("""select * from view_clicks
     | union all
     | select * from view_clicks_aug
     | """)
15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
union all
select * from view_clicks_aug
15/08/11 02:40:25 INFO ParseDriver: Parse Completed
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu        ip=unknown-ip-addr      
cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu        ip=unknown-ip-addr      
cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu        ip=unknown-ip-addr      
cmd=get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu        ip=unknown-ip-addr      
cmd=get_table : db=default tbl=view_clicks_aug
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
        at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10967) Incorrect Join behavior in filter conditions

Reply via email to