[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20929 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191119148 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191118958 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * that should be used for parsing. * `samplingRatio` (default is 1.0): defines fraction of input JSON objects used * for schema inferring. + * `dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191118916 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") +.toDS().write.mode("overwrite").text(path) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190447608 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") +.toDS().write.mode("overwrite").text(path) + val df = spark.read.format("json") +.option("dropFieldIfAllNull", true) +.load(path) + val expectedSchema = new StructType() +.add("a", NullType).add("b", NullType).add("c", NullType).add("d", NullType) +.add("e", NullType) + assert(df.schema === expectedSchema) --- End diff -- No, there's no explicit preference between them since the preferences are diverted even in committers. It's fine to use one of them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190447504 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * that should be used for parsing. * `samplingRatio` (default is 1.0): defines fraction of input JSON objects used * for schema inferring. + * `dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or --- End diff -- Python side too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190399084 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") +.toDS().write.mode("overwrite").text(path) --- End diff -- Do you need the `overwrite` mode? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190401748 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") --- End diff -- Could you add a test when `dropFieldIfAllNull` is set to `true` but not all values in a column are `null`s --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190397868 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * that should be used for parsing. * `samplingRatio` (default is 1.0): defines fraction of input JSON objects used * for schema inferring. + * `dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or --- End diff -- How about [DataStreamReader](https://github.com/maropu/spark/blob/SPARK-23772/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L277)? I guess the description should be added to it too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r190400420 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") +.toDS().write.mode("overwrite").text(path) + val df = spark.read.format("json") +.option("dropFieldIfAllNull", true) +.load(path) + val expectedSchema = new StructType() +.add("a", NullType).add("b", NullType).add("c", NullType).add("d", NullType) +.add("e", NullType) + assert(df.schema === expectedSchema) --- End diff -- It seems the `DefaultEquality` is used here which applies `==`. Are there any reasons for `===` instead of just `==`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org