[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874872763 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala: ## @@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends Serializable { @throws(classOf[DateTimeException]) def parse(s: String): Long + /** + * Parses a timestamp in a string and converts it to an optional number of microseconds. + * + * @param s - string with timestamp to parse + * @return An optional number of microseconds since epoch. The result is None on invalid input. + * @throws ParseException can be thrown by legacy parser + * @throws DateTimeParseException can be thrown by new parser + * @throws DateTimeException unable to obtain local date or time + */ + @throws(classOf[ParseException]) + @throws(classOf[DateTimeParseException]) + @throws(classOf[DateTimeException]) + def parseOptional(s: String): Option[Long] = +try { + Some(parse(s)) +} catch { Review Comment: FYI I have updated the PR description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874864869 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala: ## @@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends Serializable { @throws(classOf[DateTimeException]) def parse(s: String): Long + /** + * Parses a timestamp in a string and converts it to an optional number of microseconds. + * + * @param s - string with timestamp to parse + * @return An optional number of microseconds since epoch. The result is None on invalid input. + * @throws ParseException can be thrown by legacy parser + * @throws DateTimeParseException can be thrown by new parser + * @throws DateTimeException unable to obtain local date or time + */ + @throws(classOf[ParseException]) + @throws(classOf[DateTimeParseException]) + @throws(classOf[DateTimeException]) + def parseOptional(s: String): Option[Long] = +try { + Some(parse(s)) +} catch { Review Comment: For the default timestamp formatter, there is an override version which is certainly faster: https://github.com/apache/spark/pull/36562/files#diff-b42bcba727feeebf78f0e5540f2d4f6c6a38afd2225e4ebeae22a604e42eb094R251 If user set a customized timestamp format, or set the SQL conf to use legacy timestamp format, the performance is not changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874492296 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -456,4 +456,19 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(errMsg.contains("""Invalid input syntax for type "TIMESTAMP": 'x123'""")) } } + + test("default formatter: support returning optional parse results") { Review Comment: ```suggestion test("SPARK-39193: support returning optional parse results in the default formatter") { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874474944 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala: ## @@ -456,4 +456,19 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(errMsg.contains("""Invalid input syntax for type "TIMESTAMP": 'x123'""")) } } + + test("default formatter: support returning optional parse results") { Review Comment: There are already many test cases for the timestamp inference of JSON/CSV: https://github.com/gengliangwang/spark/runs/6451697135 Here we just add one simple test case for the default timestamp formatted since it overrides the method `parseOptional` and `parseWithoutTimeZoneOptional` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874425426 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala: ## @@ -30,29 +30,16 @@ import org.apache.spark.sql.catalyst.analysis.TypeCoercion import org.apache.spark.sql.catalyst.expressions.ExprUtils import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil import org.apache.spark.sql.catalyst.util._ -import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT import org.apache.spark.sql.errors.QueryExecutionErrors import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String import org.apache.spark.util.Utils private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { private val decimalParser = ExprUtils.getDecimalParser(options.locale) - private val timestampFormatter = TimestampFormatter( Review Comment: Yes, I am changing the solution to fix test failures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r874415392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -178,7 +164,8 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { // We can only parse the value as TimestampNTZType if it does not have zone-offset or // time-zone component and can be parsed with the timestamp formatter. // Otherwise, it is likely to be a timestamp with timezone. -if ((allCatch opt timestampNTZFormatter.parseWithoutTimeZone(field, false)).isDefined) { +val fieldAsUTF8String = UTF8String.fromString(field) +if (DateTimeUtils.stringToTimestampWithoutTimeZone(fieldAsUTF8String).isDefined) { Review Comment: Yes, that's why tests failed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r873559626 ## sql/core/benchmarks/CSVBenchmark-results.txt: ## @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance -OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1 +Apple M1 Pro Parsing quoted values:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -One quoted string 41610 42902 1598 0.0 832194.2 1.0X +One quoted string 16964 16981 15 0.0 339281.1 1.0X Review Comment: Not directly related to this PR, but the M1 Macbook is so fast! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r873709851 ## sql/core/benchmarks/CSVBenchmark-results.txt: ## @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance -OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1 +Apple M1 Pro Parsing quoted values:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -One quoted string 41610 42902 1598 0.0 832194.2 1.0X +One quoted string 16964 16981 15 0.0 339281.1 1.0X Review Comment: OK I will try it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r873559626 ## sql/core/benchmarks/CSVBenchmark-results.txt: ## @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance -OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1 +Apple M1 Pro Parsing quoted values:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -One quoted string 41610 42902 1598 0.0 832194.2 1.0X +One quoted string 16964 16981 15 0.0 339281.1 1.0X Review Comment: Not directly related to this PR, but the M1 Macbook is so fast! cc @dbtsai @dongjoon-hyun @viirya @sunchao @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources
gengliangwang commented on code in PR #36562: URL: https://github.com/apache/spark/pull/36562#discussion_r873559626 ## sql/core/benchmarks/CSVBenchmark-results.txt: ## @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance -OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1 +Apple M1 Pro Parsing quoted values:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -One quoted string 41610 42902 1598 0.0 832194.2 1.0X +One quoted string 16964 16981 15 0.0 339281.1 1.0X Review Comment: Not directly related to this PR, but m1 chips are so fast! cc @dbtsai @dongjoon-hyun @viirya @sunchao @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org