[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19707


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150069049
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   checkDataset(ds, SpecialCharClass("1", "2"))
 }
   }
+
+  test("SPARK-22472: add null check for top-level primitive values") {
+// If the primitive values are from Option, we need to do runtime null 
check.
+val ds = Seq(Some(1), None).toDS().as[Int]
+intercept[NullPointerException](ds.collect())
+val e = intercept[SparkException](ds.map(_ * 2).collect())
+assert(e.getCause.isInstanceOf[NullPointerException])
+
+withTempPath { path =>
+  Seq(new Integer(1), 
null).toDF("i").write.parquet(path.getCanonicalPath)
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150064768
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   checkDataset(ds, SpecialCharClass("1", "2"))
 }
   }
+
+  test("SPARK-22472: add null check for top-level primitive values") {
+// If the primitive values are from Option, we need to do runtime null 
check.
+val ds = Seq(Some(1), None).toDS().as[Int]
+intercept[NullPointerException](ds.collect())
+val e = intercept[SparkException](ds.map(_ * 2).collect())
+assert(e.getCause.isInstanceOf[NullPointerException])
+
+withTempPath { path =>
+  Seq(new Integer(1), 
null).toDF("i").write.parquet(path.getCanonicalPath)
--- End diff --

It doesn't matter, I just need a dataframe, I can even use `Seq(Some(1), 
None).toDF`, but using parquet is more convincing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150027756
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   checkDataset(ds, SpecialCharClass("1", "2"))
 }
   }
+
+  test("SPARK-22472: add null check for top-level primitive values") {
+// If the primitive values are from Option, we need to do runtime null 
check.
+val ds = Seq(Some(1), None).toDS().as[Int]
+intercept[NullPointerException](ds.collect())
+val e = intercept[SparkException](ds.map(_ * 2).collect())
+assert(e.getCause.isInstanceOf[NullPointerException])
+
+withTempPath { path =>
+  Seq(new Integer(1), 
null).toDF("i").write.parquet(path.getCanonicalPath)
--- End diff --

Is this PR orthogonal to data source format?
Could you test more data source like `JSON`, here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150026840
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -134,7 +134,13 @@ object ScalaReflection extends ScalaReflection {
 val tpe = localTypeOf[T]
 val clsName = getClassNameFromType(tpe)
 val walkedTypePath = s"""- root class: "$clsName :: Nil
-deserializerFor(tpe, None, walkedTypePath)
+val expr = deserializerFor(tpe, None, walkedTypePath)
+val Schema(_, nullable) = schemaFor(tpe)
+if (nullable) {
+  expr
+} else {
+  AssertNotNull(expr, walkedTypePath)
+}
--- End diff --

Hi, @cloud-fan .
It looks great. Can we add a test case in `ScalaReflectionSuite`, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150024246
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   checkDataset(ds, SpecialCharClass("1", "2"))
 }
   }
+
+  test("SPARK-22472: add null check for top-level primitive values") {
+// If the primitive values are from Option, we need to do runtime null 
check.
+val ds = Seq(Some(1), None).toDS().as[Int]
+intercept[NullPointerException](ds.collect())
+val e = intercept[SparkException](ds.map(_ * 2).collect())
+assert(e.getCause.isInstanceOf[NullPointerException])
+
+withTempPath { path =>
+  Seq(new Integer(1), 
null).toDF("i").write.parquet(path.getCanonicalPath)
--- End diff --

not a big deal, but `toDF("i")` is more explicit about column name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19707#discussion_r150015018
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1408,6 +1409,23 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   checkDataset(ds, SpecialCharClass("1", "2"))
 }
   }
+
+  test("SPARK-22472: add null check for top-level primitive values") {
+// If the primitive values are from Option, we need to do runtime null 
check.
+val ds = Seq(Some(1), None).toDS().as[Int]
+intercept[NullPointerException](ds.collect())
+val e = intercept[SparkException](ds.map(_ * 2).collect())
+assert(e.getCause.isInstanceOf[NullPointerException])
+
+withTempPath { path =>
+  Seq(new Integer(1), 
null).toDF("i").write.parquet(path.getCanonicalPath)
--- End diff --

nit: `toDF()` also works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19707

[SPARK-22472][SQL] add null check for top-level primitive values

## What changes were proposed in this pull request?

One powerful feature of `Dataset` is, we can easily map SQL rows to 
Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema ``, and we have a `case class Data(a: Int, b: String)`. Users can easily 
read this parquet file into `Data` objects, and Spark will throw NPE if column 
`a` has null values.

However the null checking is left behind for top-level primitive values. 
For example, let's say we have a parquet file with schema ``, and we 
read it into Scala `Int`. If column `a` has null values, we will get some weird 
results.
```
scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
++
|v   |
++
|null|
|1   |
++

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-+
|value|
+-+
|-2   |
|2|
+-+
```

This is because internally Spark use some special default values for 
primitive types, but never expect users to see/operate these default value 
directly.

This PR adds null check for top-level primitive values

## How was this patch tested?

new test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19707.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19707


commit dad50806b27a40ed1112d8ee29b3bd5c60164170
Author: Wenchen Fan 
Date:   2017-11-09T13:39:10Z

add null check for top-level primitive values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org