[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23132 > mind fixing PR description accordingly? @HyukjinKwon fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240041107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -347,17 +347,28 @@ class JacksonParser( schema: StructType, fieldConverters: Array[ValueConverter]): InternalRow = { val row = new GenericInternalRow(schema.length) +var badRecordException: Option[Throwable] = None + while (nextUntil(parser, JsonToken.END_OBJECT)) { schema.getFieldIndex(parser.getCurrentName) match { case Some(index) => - row.update(index, fieldConverters(index).apply(parser)) - + try { +row.update(index, fieldConverters(index).apply(parser)) + } catch { +case NonFatal(e) => + badRecordException = badRecordException.orElse(Some(e)) + parser.skipChildren() + } case None => parser.skipChildren() } } -row +if (badRecordException.isEmpty) { + row +} else { + throw BadRecordException(() => UTF8String.EMPTY_UTF8, () => Some(row), badRecordException.get) --- End diff -- I added a separate exception --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r240038837 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- I didn't mean type inference in partition values but you are probably right we should follow the same logic in schema inferring in datasources and partition value types. Just wondering how it works for now, this code: https://github.com/apache/spark/blob/5a140b7844936cf2b65f08853b8cfd8c499d4f13/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L474-L482 and this https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L163 can use different timestamp patterns, or it is supposed to work only with default settings? Maybe `inferPartitionColumnValue` should ask a datasource for inferring date/timestamp types? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r240031238 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- Yes, date parsing can be less strict in that case but if we prefer `TimestampType` over `DateType` for similar date and timestamp pattern, we will consume more memory. And from `DateType` we can lift to `TimestampType` during schema inferring but opposite way is impossible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240030751 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala --- @@ -229,6 +229,11 @@ private[json] trait TestJsonData { """{"date": "27/10/2014 18:30"}""" :: """{"date": "28/01/2016 20:00"}""" :: Nil))(Encoders.STRING) + def badRecords: Dataset[String] = --- End diff -- moved --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240029821 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -35,7 +35,9 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. - - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, the returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. --- End diff -- `from_csv` was added recently. It didn't exist in 2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23257 @cloud-fan What do you think of the PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240014440 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. --- End diff -- Fixed for CSV too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r240006479 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2563,4 +2563,18 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { assert(!files.exists(_.getName.endsWith("json"))) } } + + test("return partial result for bad records") { +val schema = new StructType() + .add("a", DoubleType) + .add("b", ArrayType(IntegerType)) + .add("c", StringType) + .add("_corrupt_record", StringType) --- End diff -- replaced --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23196 @srowen @HyukjinKwon @gatorsmile Could you take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r24694 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- @cloud-fan This PR propose similar changes as in https://github.com/apache/spark/pull/23120 . Could you take a look at it. > For such behavior change, shall we add a config to roll back to previous behavior? I don't think it makes sense to introduce global SQL config for this particular case. The risk of breaking users apps is low because apps logic cannot based only on presence of all nulls in row. All nulls don't differentiate bad and not-bad JSON records. From my point of view, a note in the migration guide is enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r24411 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- The order can be matter if you have the same pattern (or similar) for dates and timestamps. `DateType` can be preferable because it requires less memory. It seems reasonable to move from `DateType` to `TimestampType` during schema inferring since opposite one is impossible without loosing info. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r24119 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- I added an example when there is a difference, and updated the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r239998397 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- In the `PERMISSIVE` mode, no way but at the moment (without the PR) you cannot distinguish a row produced from a bad record from a row produced from JSON object with all `null` fields too. A row itself with all null cannot be an indicator of bad record. Need an additional flag. `null` or non-`null` in the corrupt column plays such role. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239998126 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- Tests passed on new parser. I reverted back all settings for `HiveCompatibilitySuite` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23257: [SPARK-26310][SQL] Verify applicability of JSON options
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23257 @gatorsmile Please, have a look at this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23257: [SPARK-26310][SQL] Verify applicability of JSON o...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23257 [SPARK-26310][SQL] Verify applicability of JSON options ## What changes were proposed in this pull request? In the PR, I propose additional verification of JSON options. In particular, detection of using `read` options in `write` and vice versa. The verification can be disabled by new SQL config `spark.sql.verifyDataSourceOptions` (`true` by default). This changes should prevent misusing of JSON options, and improve usage experience of built-in datasource. ## How was this patch tested? Added new test to `JacksonGeneratorSuite` to check how `in read` option is handled if it is passed `in write`. Also the test checks new SQL config can disable the verification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 sep-options-in-save-load Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23257 commit edbe0b7e7f29631f6b92255a7b561c916a6c39cc Author: Maxim Gekk Date: 2018-12-07T17:00:11Z Introducing JSONOptionsInWrite commit af1507093f42a206c2c22f6c40cf4c43290244b8 Author: Maxim Gekk Date: 2018-12-07T20:34:24Z Test for options verification --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23253#discussion_r239853384 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -37,6 +37,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. + - In Spark version 2.4 and earlier, JSON datasource and JSON functions like `from_json` convert a bad JSON record to a row with all `null`s in the PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of JSON column values were parsed and converted to desired types successfully. + --- End diff -- If the row was produced from a bad JSON record, the bad record is placed to the corrupt column otherwise the corrupt column contains null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON r...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23253 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23253: [SPARK-26303][SQL] Return partial results for bad...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23253 [SPARK-26303][SQL] Return partial results for bad JSON records ## What changes were proposed in this pull request? In the PR, I propose to return partial results from JSON datasource and JSON functions in the PERMISSIVE mode if some of JSON fields are parsed and converted to desired types successfully. The changes are made only for `StructType`. Whole bad JSON records are placed into the corrupt column specified by the `columnNameOfCorruptRecord` option or SQL config. Partial results are not returned for malformed JSON input. ## How was this patch tested? Added new UT which checks converting JSON strings with one invalid and one valid field at the end of the string. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-bad-record Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23253.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23253 commit 2d2fed923e9567b6bc22cf4355ee3b57f8699cd4 Author: Maxim Gekk Date: 2018-12-06T21:26:42Z Test for bad records commit de4885899b2a5e866b6c4a365e91fc901cdc0f86 Author: Maxim Gekk Date: 2018-12-06T23:11:29Z Added a test commit 101df4e7fcee60c042c21951551bd3376ae325cb Author: Maxim Gekk Date: 2018-12-06T23:16:09Z Return partial results commit 542ae74a9240b2905f4ca1f9de62fd53f1562d64 Author: Maxim Gekk Date: 2018-12-07T09:02:37Z Updating the migration guide commit 439f57ac59a7c1f481317126934a55d20b38f579 Author: Maxim Gekk Date: 2018-12-07T09:04:58Z Revert "Updating the migration guide" This reverts commit 542ae74a9240b2905f4ca1f9de62fd53f1562d64. commit 77d767016607056809d5f8c4e9653809838b91dd Author: Maxim Gekk Date: 2018-12-07T09:10:31Z Updating the migration guide - separate notes commit fa431ee293d5ae3a7c3b8b441c4a0b6778ac84e1 Author: Maxim Gekk Date: 2018-12-07T09:12:16Z Updating the migration guide - minor commit e13673de2537fb42bdc2bb8df9849560ae399c89 Author: Maxim Gekk Date: 2018-12-07T10:00:07Z Updating the migration guide - minor 2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r239547742 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- `DateType` is not inferred at all but there is another type inference code that could be shared between JSON and CSV (maybe somewhere else). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r239537694 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- Yes, we can do that. There is some common code that could be shared. Can we do it in a separate PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23201: [SPARK-26246][SQL] Infer date and timestamp types from J...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23201 @cloud-fan May I ask you to look at this PR, please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23202 @HyukjinKwon @srowen Is there anything which worries you in the PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23159#discussion_r239269414 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1777,7 +1777,7 @@ class Analyzer( case p if p.expressions.exists(hasGenerator) => throw new AnalysisException("Generators are not supported outside the SELECT clause, but " + - "got: " + p.simpleString) + "got: " + p.simpleString((SQLConf.get.maxToStringFields))) --- End diff -- fixed here and other places --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r239269170 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +121,18 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText +if ((allCatch opt options.timestampFormat.parse(stringValue)).isDefined) { --- End diff -- I made similar changes to https://github.com/apache/spark/pull/23202 - strong `DateType` inferring before `TimestampType`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23196 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23235#discussion_r239055208 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -35,6 +35,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. + - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode if specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. --- End diff -- you are right. I will remove the part about `StructType` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial results for ...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23235 @cloud-fan Please, have a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23235 [SPARK-26151][SQL][FOLLOWUP] Return partial results for bad CSV records ## What changes were proposed in this pull request? Updated SQL migration guide according to changes in https://github.com/apache/spark/pull/23120 You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 failuresafe-partial-result-followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23235.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23235 commit 8c115f7871d4db66b13ee21ea3a1231f7153791e Author: Maxim Gekk Date: 2018-12-05T12:13:26Z Updating the migration guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23120 The PR https://github.com/apache/spark/pull/23235 updates the sql migration guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239010321 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- Our current approach for converting dates is inconsistent in a few places, for example: - `UTF8String` -> `num days` uses hardcoded `GMT` and ignores SQL config: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L493 - `String` -> `java.util.Date` ignores Spark's time zone settings, and uses system time zone: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L186 - In many places even a function accepts timeZone parameter, it is not passed (used default time zone - **not from config but from TimeZone.getDefault()**). For example: https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L187 . - Casting to the date type depends on type of argument, if it is `TimestampType`, expression-wise timezone is used, otherwise `GMT`: https://github.com/apache/spark/blob/d03e0af80d7659f12821cc2442efaeaee94d3985/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L403-L410 I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 > Rather than change every single call to this method, if this should generally be the value of the argument, then why not make it the default value or something? New parameter aims to solve the problem when there are multiple callers, and each of them needs different maximum fields. So, a feasible approach is to propagate `maxFields` from callers to `truncatedString`. Changing global SQL config does not solve the problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238853314 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- While porting on new parser/formatter, I faced to 2problems at least: - The time zone from SQL config is not taken into account on parsing at all. Basically used functions take default time zone from jvm settings. It could be fixed by `TimeZone.setDefault` or using absolute values. - Round trip in parsing a date to `DateType` and back to a date as a string could give different string because `DateType` stores only days (as `Int`) since epoch (in `UTC`). And such representation loses time zone offset. So, exact matching is impossible due to lack of information. For example, roundtrip converting for `TimestampType` works perfectly. This is the case for the changes. Previously, it worked because the specified time zone is not used at all (did not impact on number of days while converting a string to `DateType`). With new parser/formatter, it becomes matter, and I have to change time zone to `GMT` to eliminate the problem of loosing timezone offsets (it is zero for `GMT`). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 @HyukjinKwon @dongjoon-hyun @srowen @zsxwing Do you have any objections of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238696317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- Done. Please, have a look at the changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21363 @srowen There is another PR with related changes - inferring `DateType` from CSV: https://github.com/apache/spark/pull/23202 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23202#discussion_r238194142 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala --- @@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends Serializable { compatibleType(typeSoFar, tryParseDecimal(field)).getOrElse(StringType) case DoubleType => tryParseDouble(field) case TimestampType => tryParseTimestamp(field) +case DateType => tryParseDate(field) --- End diff -- Just in case, I did exact match here as well, see https://github.com/apache/spark/pull/23202/files#diff-17719da188b2c15129f848f654a0e6feR174 . If date parser didn't consume all input (`pos.getIndex != field.length`), it fails. If I move it up in type inferring pipeline, it should work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238116870 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- @gatorsmile What would I you recommend to improve the text? I can add the links above, so, an user can figure out what is difference in their particular case. Our tests don't show any difference on our default timestamp/date patterns but the user can use something more specific and face to behaviour change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23202 [SPARK-26248][SQL] Infer date type from CSV ## What changes were proposed in this pull request? The `CSVInferSchema` class is extended to support inferring of `DateType` from CSV input. The attempt to infer `DateType` is performed after inferring `TimestampType`. ## How was this patch tested? Added new test for inferring date types from CSV . It was also tested by existing suites like `CSVInferSchemaSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 csv-date-inferring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23202.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23202 commit a8d27d64d45ca4ea207f0d403dba472b2f606968 Author: Maxim Gekk Date: 2018-12-02T21:57:02Z Test for inferring the date type commit fa915fd3fcab37be2c53bc23a77b55e3024b3994 Author: Maxim Gekk Date: 2018-12-02T21:57:21Z Inferring date type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23201 [SPARK-26246][SQL] Infer date and timestamp types from JSON ## What changes were proposed in this pull request? The `JsonInferSchema` class is extended to support `DateType` and `TimestampType` inferring from string fields in JSON input. It tries to infer `TimestampType` as tightest type first of all. If timestamp parsing fails, `DateType` is inferred using date pattern. As the fallback in the case of both failures, it invokes `DateTimeUtils.stringToTime`. ## How was this patch tested? Added new test suite - `JsonInferSchemaSuite` to check date and timestamp types inferring from JSON. This changes was tested by `JsonSuite`, `JsonExpressionsSuite` and `JsonFunctionsSuite` as well. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-infer-time Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23201.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23201 commit 2a26e2c680b517e9e89a0f4bc4cc31884020188d Author: Maxim Gekk Date: 2018-12-02T20:06:05Z Added a test for timestamp inferring commit bd472072a39dbec2e1eec1396196c6c5e6a659dd Author: Maxim Gekk Date: 2018-12-02T20:43:48Z Infer date and timestamp types commit 9dbdf0a764c998875932e50faf460f36216ef58d Author: Maxim Gekk Date: 2018-12-02T20:44:08Z Test for date type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238111358 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- New implementation and old one have slightly different pattern formats. See https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html . And two Java API can have different behaviours. Besides of that, new one can parse timestamps with microseconds precision as a consequence of using Java 8 java.time API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 @srowen I think this PR is ready. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r238097513 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -243,21 +243,27 @@ class UnivocityParser( () => getPartialResult(), new RuntimeException("Malformed CSV record")) } else { - try { -// When the length of the returned tokens is identical to the length of the parsed schema, -// we just need to convert the tokens that correspond to the required columns. -var i = 0 -while (i < requiredSchema.length) { + // When the length of the returned tokens is identical to the length of the parsed schema, + // we just need to convert the tokens that correspond to the required columns. + var badRecordException: Option[Throwable] = None + var i = 0 + while (i < requiredSchema.length) { +try { row(i) = valueConverters(i).apply(getToken(tokens, i)) - i += 1 +} catch { + case NonFatal(e) => +badRecordException = badRecordException.orElse(Some(e)) } +i += 1 + } + + if (badRecordException.isEmpty) { row - } catch { -case NonFatal(e) => - // For corrupted records with the number of tokens same as the schema, - // CSV reader doesn't support partial results. All fields other than the field - // configured by `columnNameOfCorruptRecord` are set to `null`. - throw BadRecordException(() => getCurrentInput, () => None, e) + } else { +// For corrupted records with the number of tokens same as the schema, +// CSV reader doesn't support partial results. All fields other than the field +// configured by `columnNameOfCorruptRecord` are set to `null`. --- End diff -- The comment is not actual anymore. I will remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23150#discussion_r238075711 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala --- @@ -86,62 +85,74 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { // null. Seq(true, false).foreach { b => val options = new CSVOptions(Map("nullValue" -> "null"), false, "GMT") - val converter = -parser.makeConverter("_1", StringType, nullable = b, options = options) + val parser = new UnivocityParser(StructType(Seq.empty), options) + val converter = parser.makeConverter("_1", StringType, nullable = b) assert(converter.apply("") == UTF8String.fromString("")) } } test("Throws exception for empty string with non null type") { - val options = new CSVOptions(Map.empty[String, String], false, "GMT") +val options = new CSVOptions(Map.empty[String, String], false, "GMT") +val parser = new UnivocityParser(StructType(Seq.empty), options) val exception = intercept[RuntimeException]{ - parser.makeConverter("_1", IntegerType, nullable = false, options = options).apply("") + parser.makeConverter("_1", IntegerType, nullable = false).apply("") } assert(exception.getMessage.contains("null value found but field _1 is not nullable.")) } test("Types are cast correctly") { val options = new CSVOptions(Map.empty[String, String], false, "GMT") -assert(parser.makeConverter("_1", ByteType, options = options).apply("10") == 10) -assert(parser.makeConverter("_1", ShortType, options = options).apply("10") == 10) -assert(parser.makeConverter("_1", IntegerType, options = options).apply("10") == 10) -assert(parser.makeConverter("_1", LongType, options = options).apply("10") == 10) -assert(parser.makeConverter("_1", FloatType, options = options).apply("1.00") == 1.0) -assert(parser.makeConverter("_1", DoubleType, options = options).apply("1.00") == 1.0) -assert(parser.makeConverter("_1", BooleanType, options = options).apply("true") == true) - -val timestampsOptions = +var parser = new UnivocityParser(StructType(Seq.empty), options) +assert(parser.makeConverter("_1", ByteType).apply("10") == 10) +assert(parser.makeConverter("_1", ShortType).apply("10") == 10) +assert(parser.makeConverter("_1", IntegerType).apply("10") == 10) +assert(parser.makeConverter("_1", LongType).apply("10") == 10) +assert(parser.makeConverter("_1", FloatType).apply("1.00") == 1.0) +assert(parser.makeConverter("_1", DoubleType).apply("1.00") == 1.0) +assert(parser.makeConverter("_1", BooleanType).apply("true") == true) + +var timestampsOptions = new CSVOptions(Map("timestampFormat" -> "dd/MM/ hh:mm"), false, "GMT") +parser = new UnivocityParser(StructType(Seq.empty), timestampsOptions) val customTimestamp = "31/01/2015 00:00" -val expectedTime = timestampsOptions.timestampFormat.parse(customTimestamp).getTime -val castedTimestamp = - parser.makeConverter("_1", TimestampType, nullable = true, options = timestampsOptions) +var format = FastDateFormat.getInstance( + timestampsOptions.timestampFormat, timestampsOptions.timeZone, timestampsOptions.locale) +val expectedTime = format.parse(customTimestamp).getTime +val castedTimestamp = parser.makeConverter("_1", TimestampType, nullable = true) .apply(customTimestamp) assert(castedTimestamp == expectedTime * 1000L) val customDate = "31/01/2015" val dateOptions = new CSVOptions(Map("dateFormat" -> "dd/MM/"), false, "GMT") -val expectedDate = dateOptions.dateFormat.parse(customDate).getTime -val castedDate = - parser.makeConverter("_1", DateType, nullable = true, options = dateOptions) -.apply(customTimestamp) -assert(castedDate == DateTimeUtils.millisToDays(expectedDate)) +parser = new UnivocityParser(StructType(Seq.empty), dateOptions) +format = FastDateFormat.getInstance( + dateOptions.dateFormat, dateOptions.timeZone, d
[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23150#discussion_r238075664 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -622,10 +623,11 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te val options = Map( "header" -> "true", "inferSchema" -> "false", - "dateFormat" -> "dd/MM/ hh:mm") + "dateFormat" -> "dd/MM/ HH:mm") --- End diff -- According to iso 8601: ``` h clock-hour-of-am-pm (1-12) number12 H hour-of-day (0-23) number0 ``` but real data is not in the allowed range: ``` date 26/08/2015 18:00 27/10/2014 18:30 28/01/2016 20:00 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23150#discussion_r238075585 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1107,7 +,7 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te test("SPARK-18699 put malformed records in a `columnNameOfCorruptRecord` field") { Seq(false, true).foreach { multiLine => - val schema = new StructType().add("a", IntegerType).add("b", TimestampType) + val schema = new StructType().add("a", IntegerType).add("b", DateType) --- End diff -- I changed the type because supposed to valid date `"1983-08-04"` cannot be parsed with default timestamp pattern. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 > they pass right? is there another test you were unable to add? For now everything has been passed. I run all test localy on different timezones (set via jvm parameter `-Duser.timezone`). I updated the migration guide since I enabled new parser by default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23196 [SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-time-parser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23196.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23196 commit fb10b91502b67b98f2904a06b017a6e56dd6e39f Author: Maxim Gekk Date: 2018-12-01T11:01:47Z Adding DateTimeFormatter commit a9b39eccfdc3bd9da8fe52eda24ed240588b282f Author: Maxim Gekk Date: 2018-12-01T11:15:31Z Support DateTimeFormatter by JacksonParser and JacksonGenerator commit ff589f505f7aaa8c94ce304aaf77792505998c16 Author: Maxim Gekk Date: 2018-12-01T12:31:43Z Make test independent from current time zone commit 4646dededae832185a35a85244baab6507d28f0d Author: Maxim Gekk Date: 2018-12-01T18:19:31Z Fix a test by new fallback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238070641 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + if (params.headerFlag) { +val gen = getGen() +gen.writeHeaders() + } + private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + + override def write(row: InternalRow): Unit = { +val gen = getGen() --- End diff -- Frankly speaking I don't see any reasons for this. For now we have 2 flags actually - `isGeneratorInitiated` and another one inside of lazy val. And 2 slightly different approaches - with the `Option` type in Json and Text, and lazy val + flag in Orc and CSV. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 > ... if it's ready from your side @MaxGekk @srowen I just think how I could reduce number of changes in tests. In some cases, test behavior depends on current time zone on my laptop. Without changes of this PR, such tests passed because expected and produced values are results of the same functions. I would try to fix that in a separate PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 @HyukjinKwon May I ask you to look at this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238059569 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with headers") { +withTempPath { path => + val df1 = spark.range(10).repartition(2).filter(_ < 0).map(_.toString).toDF + // we have 2 partitions but they are both empty and will be filtered out upon writing + // thanks to SPARK-23271 one new empty partition will be inserted + df1.write.format("csv").option("header", true).save(path.getAbsolutePath) + val df2 = spark.read.format("csv").option("header", true).option("inferSchema", false) +.load(path.getAbsolutePath) + assert(df1.rdd.getNumPartitions == 2) + assert(df2.rdd.getNumPartitions == 1) --- End diff -- nit: I wouldn't check number of partition here since it is implementation specific and doesn't matter for behavior checked in the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238059480 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + if (params.headerFlag) { +val gen = getGen() +gen.writeHeaders() + } + private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + + override def write(row: InternalRow): Unit = { +val gen = getGen() --- End diff -- @HyukjinKwon Do you mean creating generator in lazy val? ``` lazy val univocityGenerator = { val charset = Charset.forName(params.charset) val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) new UnivocityGenerator(dataSchema, os, params) } ``` The problem is in the close method, you will have to call `univocityGenerator.close()` in the method. If the lazy val wasn't instantiated before (empty partition and the `header` option is `false`), the generator will be created and closed immediately. And as a result, you will get an empty file for the empty partition. That's why I prefer the approach with `Option[UnivocityGenerator]` in https://github.com/apache/spark/pull/23052. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237873832 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -243,21 +243,27 @@ class UnivocityParser( () => getPartialResult(), new RuntimeException("Malformed CSV record")) } else { - try { -// When the length of the returned tokens is identical to the length of the parsed schema, -// we just need to convert the tokens that correspond to the required columns. -var i = 0 -while (i < requiredSchema.length) { + // When the length of the returned tokens is identical to the length of the parsed schema, + // we just need to convert the tokens that correspond to the required columns. + var badRecordException: Option[Throwable] = None + var i = 0 + while (i < requiredSchema.length) { --- End diff -- It depends on what kind of error we face to. If a parser is still in normal state and ready to continue, we could skip current error. In case of JSON, we parse input in stream fashion, and convert values to desired type on the fly. If JacksonParser is able to recognize next token, why we should stop on the first error? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237679872 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") --- End diff -- You can do something like this: ``` val df = spark.range(10).repartition(2).filter(_ < 0) ``` ``` scala> df.rdd.getNumPartitions res0: Int = 2 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237633755 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") + df1.printSchema + df1.write.format("csv").option("header", true).save(path.getAbsolutePath) + val df2 = spark.read.format("csv").option("header", true).load(path.getAbsolutePath) + df2.printSchema --- End diff -- Please, remove this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237635061 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") --- End diff -- This will create a dataframe with one empty partition. I would check the case when there are more than 2 empty partitions, and each saved file has a header. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237634432 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") + df1.printSchema + df1.write.format("csv").option("header", true).save(path.getAbsolutePath) + val df2 = spark.read.format("csv").option("header", true).load(path.getAbsolutePath) --- End diff -- I would set the `inferSchema` option to `false` explicitly because your test assumes that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237633638 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") + df1.printSchema --- End diff -- `printSchema` should be removed. This is not necessary for test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23132 @cloud-fan I did the same for CSV: https://github.com/apache/spark/pull/22979 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23132 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22979 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23173 It seems this is similar to @HyukjinKwon PR: https://github.com/apache/spark/pull/13252 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23052#discussion_r237282738 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val charset = Charset.forName(params.charset) + private var univocityGenerator: Option[UnivocityGenerator] = None - private val writer = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - - private val gen = new UnivocityGenerator(dataSchema, writer, params) + override def write(row: InternalRow): Unit = { +val gen = univocityGenerator.getOrElse { + val charset = Charset.forName(params.charset) + val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) + new UnivocityGenerator(dataSchema, os, params) +} +univocityGenerator = Some(gen) --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 @gatorsmile @cloud-fan Could you look at the changes - extracted from another PR: https://github.com/apache/spark/pull/22429 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23132#discussion_r237239829 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 + - In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`. --- End diff -- I removed SQL config and record in the migration guid. Also I applied `DecimalFormat` only for not `en-US` locales. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 Some tests didn't pass on new changes till I set time zone explicitly. The tests use the same functions for checking correctness as the code that is supposed to test. I think need more investigation here, I guess old (current) implementation doesn't take timezones into account in some cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 I have to correct timestamp/date pattern in a few test to follow ISO 8601 (see [Patterns for Formatting and Parsing](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) like `hh` -> `HH` and `mm` -> `MM`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 > Are there any other behavior changes with the new code, besides being able to parse microseconds? The main one is new parser doesn't have the fallback to `DateTimeUtils.stringToTime` if it cannot parse its input. Though I am not sure about this fallback ... if an user wants to parse input which doesn't fit to specified pattern (like `"-MM-dd'T'HH:mm:ss.SSSXXX"`) why should guess and try another pattern instead of forcing him/her to set appropriate one. I could guess this fallback was added for backward compatibility to previous versions. If so, should we still keep compatibility to such version? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 @srowen @HyukjinKwon @viirya @mgaido91 May I ask you to look at the PR. The changes are related to another PR which you have reviewed already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23052 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237093726 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -243,21 +243,27 @@ class UnivocityParser( () => getPartialResult(), new RuntimeException("Malformed CSV record")) } else { - try { -// When the length of the returned tokens is identical to the length of the parsed schema, -// we just need to convert the tokens that correspond to the required columns. -var i = 0 -while (i < requiredSchema.length) { + // When the length of the returned tokens is identical to the length of the parsed schema, + // we just need to convert the tokens that correspond to the required columns. + var badRecordException: Option[Throwable] = None + var i = 0 + while (i < requiredSchema.length) { --- End diff -- but we lose field values that could be converted successfully after the exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23132#discussion_r237091555 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 + - In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`. --- End diff -- > Is there any performance regression with DecimalFormat? I haven't benchmarked the changes. Looking at `JSONBenchmarks`, we even don't check decimals there. > Do we need the DecimalFormat when locale is en-US? No, we don't need it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23132#discussion_r237091495 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1610,6 +1610,13 @@ object SQLConf { """ "... N more fields" placeholder.""") .intConf .createWithDefault(25) + + val LEGACY_DECIMAL_PARSING_ENABLED = buildConf("spark.sql.legacy.decimalParsing.enabled") +.doc("If it is set to false, it enables parsing decimals in locale specific formats. " + + "To switch back to previous behaviour when parsing was performed by java.math.BigDecimal " + + "and all commas were removed from the input, set the flag to true.") +.booleanConf +.createWithDefault(false) --- End diff -- > Can we have true by default? Yes, sure. > I usually recommend to keep the existing behavior during one next release for safety. Would it be better to switch on new implementation in major release (3.0) neither in minor one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237085148 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- I think so, `wholetext` makes files not splittable: https://github.com/apache/spark/blob/46110a589f4e91cd7605c5a2c34c3db6b2635830/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L57 This can guarantee ( in text datasources at least) one file -> one partition. > IIUC one partition can read multiple files. Do you mean this code? https://github.com/apache/spark/blob/8c6871828e3eb9fdb3bc665441a1aaf60b86b1e7/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L459-L464 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237065584 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala --- @@ -33,26 +33,21 @@ class FailureSafeParser[IN]( private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord) private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord)) private val resultRow = new GenericInternalRow(schema.length) - private val nullResult = new GenericInternalRow(schema.length) // This function takes 2 parameters: an optional partial result, and the bad record. If the given // schema doesn't contain a field for corrupted record, we just return the partial result or a // row with all fields null. If the given schema contains a field for corrupted record, we will // set the bad record to this field, and set other fields according to the partial result or null. private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = { -if (corruptFieldIndex.isDefined) { - (row, badRecord) => { -var i = 0 -while (i < actualSchema.length) { - val from = actualSchema(i) - resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull - i += 1 -} -resultRow(corruptFieldIndex.get) = badRecord() -resultRow +(row, badRecord) => { --- End diff -- For now JSON does not support this. Need additional changes in JacksonParser to return partial results. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23052 Actually it needs similar changes like in https://github.com/apache/spark/pull/23130 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237050065 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- > does this test fail without your change? Yes, it does due to the `wholetext`. > Is JSON the only data source that may return a row for empty file? We depend on underlying parser here. I will check CSV and Text. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23052 > seems like a real failure I am looking at it. It seems the test is not deterministic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23120 @cloud-fan May I ask you to take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22429 Here is a PR introduces `maxFields` parameter to all function involved in creation of truncated strings of spark plans: https://github.com/apache/spark/pull/23159 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23159 ping @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23159: [SPARK-26191][SQL] Control truncation of Spark pl...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23159 [SPARK-26191][SQL] Control truncation of Spark plans via maxFields parameter ## What changes were proposed in this pull request? In the PR, I propose to add `maxFields` parameter to all functions involved in creation of textual representation of spark plans such as `simpleString` and `verboseString`. New parameter restrict number of fields produced to truncated strings. Any elements beyond the limit will be dropped and replaced by a `"... N more fields"` placeholder. The threshold is bumped up to `Int.MaxValue` for `toFile()`. ## How was this patch tested? Added a test to `QueryExecutionSuite` which checks `maxFields` impacts on number of truncated fields in `LocalRelation`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 to-file-max-fields Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23159 commit 756242132d665223738b177af0b838fc953c6a6b Author: Maxim Gekk Date: 2018-11-27T21:34:58Z Added a test commit e9e789311b998d961aa8fcf76463307a410969ea Author: Maxim Gekk Date: 2018-11-27T22:40:45Z Adding maxField parameter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23132 @dongjoon-hyun @cloud-fan Please, take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23127: [SPARK-26159] Codegen for LocalTableScanExec and RDDScan...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23127 > jenkins retest this please @juliuszsompolski It won't help, you need to fix python tests it seems ``` Failed example: df.explain() Differences (ndiff with -expected +actual): == Physical Plan == - Scan ExistingRDD[age#0,name#1] + *(1) Scan ExistingRDD[age#0,name#1] ? + ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r236719477 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +143,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in load") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions == 1) --- End diff -- It seems expected value should be on right. I changed the order and got the following: ```scala assert(123 === readback.rdd.getNumPartitions) ``` ``` 123 did not equal 1 ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at (SaveLoadSuite.scala:155) Expected :1 Actual :123 ``` Current assert triggers correct message: ```scala assert(readback.rdd.getNumPartitions == 123) ``` ``` 1 did not equal 123 ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at (SaveLoadSuite.scala:155) Expected :123 Actual :1 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23052#discussion_r236659185 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val charset = Charset.forName(params.charset) + private var univocityGenerator: Option[UnivocityGenerator] = None --- End diff -- > ... but that it could create many generators and writers that aren't closed. Writers/generators are created inside of tasks: https://github.com/apache/spark/blob/ab1650d2938db4901b8c28df945d6a0691a19d31/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L228-L256 where `dataWriter.commit()` and `dataWriter.abort()` close writers/generators. So, number of not closed generators is less or equal to the size of the task thread pool on executors at any moment. > Unless we know writes will only happen in one thread ... According to comments below, this is our assumption: https://github.com/apache/spark/blob/e8167768cfebfdb11acd8e0a06fe34ca43c14648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L33-L37 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23120 @HyukjinKwon Please, review the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23150: [SPARK-26178][SQL] Use java.time API for parsing timesta...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23150 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22979 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23052#discussion_r236584201 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter( context: TaskAttemptContext, params: CSVOptions) extends OutputWriter with Logging { - private val charset = Charset.forName(params.charset) + private var univocityGenerator: Option[UnivocityGenerator] = None --- End diff -- We have not observe any race conditions so far. Instances of `UnivocityGenerator` are created per-tasks as well as `OutputStreamWriter`s. They share instances of schema and CSVOptions but we do not modify them while writing. Inside of each `UnivocityGenerator`, we create an instance of `CsvWriter` but I almost absolutely sure they do not share anything internally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21363 Here is the PR: https://github.com/apache/spark/pull/23150 which allows to switch on java.time API (in CSV so far). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23150: [SPARK-26178][SQL] Use java.time API for parsing ...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23150 [SPARK-26178][SQL] Use java.time API for parsing timestamps and dates from CSV ## What changes were proposed in this pull request? In the PR, I propose to use **java.time API** for parsing timestamps and dates from CSV content with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behaviour with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `UnivocityParserSuite`, `CsvExpressionsSuite`, `CsvFunctionsSuite` and `CsvSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 time-parser Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23150 commit 74a76c2f78ad139993f3bbe0f2ff8f1c81c3bd84 Author: Maxim Gekk Date: 2018-11-24T11:37:55Z New and legacy time parser commit 63cf6112085029c52e4aee6f9bb2e6b84ce18a96 Author: Maxim Gekk Date: 2018-11-24T12:00:10Z Add config spark.sql.legacy.timeParser.enabled commit 2a2ab83a5ecb251ce81e7f12a8c0d3067f88b2d5 Author: Maxim Gekk Date: 2018-11-24T13:24:07Z Fallback legacy parser commit 667bf9f65a90ac69b8cbbad77a17e21f9dd18733 Author: Maxim Gekk Date: 2018-11-24T15:54:19Z something commit 227a7bdc53bdd022e9c365b410810c58f56e8bea Author: Maxim Gekk Date: 2018-11-25T12:15:15Z Using instances commit 73ee56088bf4d2856c454a7bbd4171b61cfe4614 Author: Maxim Gekk Date: 2018-11-25T13:52:02Z Added generator commit f35f6e13270eb994ac97627da79497673b4fe686 Author: Maxim Gekk Date: 2018-11-25T18:03:17Z Refactoring of TimeFormatter commit 1c09b58e6fe3e0fd565c852dcb73dc012fa56819 Author: Maxim Gekk Date: 2018-11-25T18:06:22Z Renaming to DateTimeFormatter commit 7b213d5b2ae404c87f090da622a78d3d19fee6a9 Author: Maxim Gekk Date: 2018-11-25T18:32:54Z Added DateFormatter commit 242ba474dcf112b48bd286811daed86a66366c39 Author: Maxim Gekk Date: 2018-11-25T19:58:08Z Default values in parsing commit db48ee6918eef06e19c3bdf64e3c44f4541cc294 Author: Maxim Gekk Date: 2018-11-25T21:09:08Z Parse as date type because format for timestamp is not not matched to values commit e18841b38050ac411a507a2a2643584f2c8739ce Author: Maxim Gekk Date: 2018-11-25T21:53:11Z Fix tests commit 8db023834b680f336ff5a0e08253ba2cb3b6e3b7 Author: Maxim Gekk Date: 2018-11-25T23:09:20Z CSVSuite passed commit 0b9ed92a456d60db0934340f37e0bd428b2f6a42 Author: Maxim Gekk Date: 2018-11-26T22:00:10Z Fix imports commit 799ebb3432dec7fe1e1099d68a3f1c09e714aa8e Author: Maxim Gekk Date: 2018-11-26T22:03:19Z Revert test back commit 5a223919439e2d22814b92c0e1e572b3c318566f Author: Maxim Gekk Date: 2018-11-26T22:17:11Z Set timeZone commit f287b77d94de9e9f466c0ff2c2370f22a46b48f7 Author: Maxim Gekk Date: 2018-11-26T22:44:42Z Removing default for micros because it causes conflicts in parsing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/23132 [SPARK-26163][SQL] Parsing decimals from JSON using locale ## What changes were proposed in this pull request? In the PR, I propose using of the locale option to parse (and infer) decimals from JSON input. After the changes, `JacksonParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. New behaviour can be switched off via SQL config `spark.sql.legacy.decimalParsing.enabled`. ## How was this patch tested? Added 2 tests to `JsonExpressionsSuite` for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales: - Inferring decimal type using locale from JSON field values - Converting JSON field values to specified decimal type using the locales. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-decimal-parsing-locale Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23132.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23132 commit 506417ed4b643298560a66c043f7b31beb489da3 Author: Maxim Gekk Date: 2018-11-10T19:49:12Z Test for parsing decimals using locale commit ac25fb6ed1d3d6689ad8841476c025848c87f2a3 Author: Maxim Gekk Date: 2018-11-10T19:51:48Z Parsing decimals using locale commit b784003078270a46aaf8aceb2d86dd9f13f3500c Author: Maxim Gekk Date: 2018-11-10T19:54:00Z Updating the migration guide commit d0522093dfe1823f91100cbec3ef5d6c8a372f27 Author: Maxim Gekk Date: 2018-11-24T19:32:45Z Merge remote-tracking branch 'origin/master' into json-decimal-parsing-locale # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala commit 722e135cf3b94e22079fd9a0fe1d309a04e76a64 Author: Maxim Gekk Date: 2018-11-24T19:58:39Z Add SQL config spark.sql.legacy.decimalParsing.enabled commit dc6c0ac0a97b31441d99ff1dd71608ae5e2eca73 Author: Maxim Gekk Date: 2018-11-24T20:00:31Z Updating the migration guide commit f15b1817fb51d453487665122473855712214692 Author: Maxim Gekk Date: 2018-11-24T20:38:24Z Added a test for parsing commit ab781d54e4f7604d64c72c3c383c549abab0a9a9 Author: Maxim Gekk Date: 2018-11-24T20:52:03Z Fix test commit 163a8b9d7d017409ae4dfa40e492680bf0e4f935 Author: Maxim Gekk Date: 2018-11-24T20:52:26Z Create getDecimalParser commit 8fb65c0db85f4bd2f76d473c5e31e772ff0d4c1d Author: Maxim Gekk Date: 2018-11-24T22:11:42Z Add a test for inferring decimals commit 7e3a2906a96894cadc58771131d07d06ba265382 Author: Maxim Gekk Date: 2018-11-24T22:12:35Z Change JsonSuite to adopt it for JsonInferSchema class commit 83920b25f586dc242841ff0a73105ae9e43295ed Author: Maxim Gekk Date: 2018-11-24T22:13:01Z Inferring decimals from JSON --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23130 @cloud-fan @HyukjinKwon Please, take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r236049216 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1892,7 +1898,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { .text(path) val jsonDF = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path) - assert(jsonDF.count() === corruptRecordCount) + assert(jsonDF.count() === corruptRecordCount + 1) // null row for empty file --- End diff -- @cloud-fan @HyukjinKwon Here is a PR https://github.com/apache/spark/pull/23130 which does this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org