[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005799#comment-15005799
 ] 

Apache Spark commented on SPARK-11447:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/9720

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> 

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-10 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1490#comment-1490
 ] 

kevin yu commented on SPARK-11447:
--

Hi Kapil: I have a possible fix ready, now I am working on the test case. 

Kevin


> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> 

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-05 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992271#comment-14992271
 ] 

kevin yu commented on SPARK-11447:
--

Hello Kapil : Thanks a lot. I am looking into it now. Kevin

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> 

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-03 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987096#comment-14987096
 ] 

Kapil Singh commented on SPARK-11447:
-

On second look, I seem to have identified the issue. Take a look at lines 
283-286 here:
https://github.com/apache/spark/blob/a01cbf5daac148f39cd97299780f542abc41d1e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

If one of the types in BinaryComparison is StringType and other is NullType, 
during analyzed plan computation, this forces DoubleType on the StringType. 
Later while enforcing this cast (lines 340-343 in 
https://github.com/apache/spark/blob/a01cbf5daac148f39cd97299780f542abc41d1e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala),
 conversion to double fails if string is not actually a number and a default 
null value is assigned to result. This manifests as null comparison resulting 
true for all string values. 

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-02 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985569#comment-14985569
 ] 

kevin yu commented on SPARK-11447:
--

Hello Kapil:

When you say Doesn't work, does it mean that you got exception? 
 
I tried spark 1.5 , and this scala code works for me. 
Can you verify which spark version you are running? I saw you put 1.5.1 there. 

//DOESN'T WORK
val filteredDF = df.filter(df("column") <=> (new Column(Literal(null

I run this on my spark shell at the latest version

scala> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

thanks.



> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> 

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-02 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986552#comment-14986552
 ] 

Kapil Singh commented on SPARK-11447:
-

Hi Kevin,

My bad, should have been more explicit. The code runs but the result is 
incorrect:

val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
filteredDF.show

gives:

+--+
|column|
+--+
|   abc|
|  null|
|   xyz|
+--+

which is not correct. The result should include only the row with null value.

In the second version where I'm passing type information:

val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
SparkleFunctions.dataType(df("column"))
filteredDF.show

result is:

+--+
|column|
+--+
|  null|
+--+

which is correct. Here SparkleFunctions.dataType is just a method I defined to 
be able to get type information from Column.expr outside "org.apache.spark.sql" 
package:

package org.apache.spark.sql
object SparkleFunctions {
  def dataType(c: Column): DataType = c.expr.dataType
}

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
>