[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-06-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20929


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r191119148
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r191118958
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* that should be used for parsing.
* `samplingRatio` (default is 1.0): defines fraction of input JSON 
objects used
* for schema inferring.
+   * `dropFieldIfAllNull` (default `false`): whether to ignore column 
of all null values or
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-28 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r191118916
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
+.toDS().write.mode("overwrite").text(path)
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190447608
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
+.toDS().write.mode("overwrite").text(path)
+  val df = spark.read.format("json")
+.option("dropFieldIfAllNull", true)
+.load(path)
+  val expectedSchema = new StructType()
+.add("a", NullType).add("b", NullType).add("c", NullType).add("d", 
NullType)
+.add("e", NullType)
+  assert(df.schema === expectedSchema)
--- End diff --

No, there's no explicit preference between them since the preferences are 
diverted even in committers. It's fine to use one of them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190447504
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* that should be used for parsing.
* `samplingRatio` (default is 1.0): defines fraction of input JSON 
objects used
* for schema inferring.
+   * `dropFieldIfAllNull` (default `false`): whether to ignore column 
of all null values or
--- End diff --

Python side too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190399084
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
+.toDS().write.mode("overwrite").text(path)
--- End diff --

Do you need the `overwrite` mode?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190401748
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
--- End diff --

Could you add a test when `dropFieldIfAllNull` is set to `true` but not all 
values in a column are `null`s


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190397868
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* that should be used for parsing.
* `samplingRatio` (default is 1.0): defines fraction of input JSON 
objects used
* for schema inferring.
+   * `dropFieldIfAllNull` (default `false`): whether to ignore column 
of all null values or
--- End diff --

How about 
[DataStreamReader](https://github.com/maropu/spark/blob/SPARK-23772/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L277)?
 I guess the description should be added to it too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...

2018-05-23 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r190400420
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read.option("mode", "PERMISSIVE").option("encoding", 
"UTF-8").json(Seq(badJson).toDS()),
   Row(badJson))
   }
+
+  test("SPARK-23772 ignore column of all null values or empty array during 
schema inference") {
+ withTempPath { tempDir =>
+  val path = tempDir.getAbsolutePath
+  Seq(
+"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], 
"e":{}}""",
+"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""",
+"""{"a":null, "b":[], "c":[], "d": null, "e":null}""")
+.toDS().write.mode("overwrite").text(path)
+  val df = spark.read.format("json")
+.option("dropFieldIfAllNull", true)
+.load(path)
+  val expectedSchema = new StructType()
+.add("a", NullType).add("b", NullType).add("c", NullType).add("d", 
NullType)
+.add("e", NullType)
+  assert(df.schema === expectedSchema)
--- End diff --

It seems the `DefaultEquality` is used here which applies `==`. Are there 
any reasons for `===` instead of just `==`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org