[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874872763


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##
@@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends Serializable {
   @throws(classOf[DateTimeException])
   def parse(s: String): Long
 
+  /**
+   * Parses a timestamp in a string and converts it to an optional number of 
microseconds.
+   *
+   * @param s - string with timestamp to parse
+   * @return An optional number of microseconds since epoch. The result is 
None on invalid input.
+   * @throws ParseException can be thrown by legacy parser
+   * @throws DateTimeParseException can be thrown by new parser
+   * @throws DateTimeException unable to obtain local date or time
+   */
+  @throws(classOf[ParseException])
+  @throws(classOf[DateTimeParseException])
+  @throws(classOf[DateTimeException])
+  def parseOptional(s: String): Option[Long] =
+try {
+  Some(parse(s))
+} catch {

Review Comment:
   FYI I have updated the PR description



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874864869


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala:
##
@@ -52,6 +52,25 @@ sealed trait TimestampFormatter extends Serializable {
   @throws(classOf[DateTimeException])
   def parse(s: String): Long
 
+  /**
+   * Parses a timestamp in a string and converts it to an optional number of 
microseconds.
+   *
+   * @param s - string with timestamp to parse
+   * @return An optional number of microseconds since epoch. The result is 
None on invalid input.
+   * @throws ParseException can be thrown by legacy parser
+   * @throws DateTimeParseException can be thrown by new parser
+   * @throws DateTimeException unable to obtain local date or time
+   */
+  @throws(classOf[ParseException])
+  @throws(classOf[DateTimeParseException])
+  @throws(classOf[DateTimeException])
+  def parseOptional(s: String): Option[Long] =
+try {
+  Some(parse(s))
+} catch {

Review Comment:
   For the default timestamp formatter, there is an override version which is 
certainly faster: 
https://github.com/apache/spark/pull/36562/files#diff-b42bcba727feeebf78f0e5540f2d4f6c6a38afd2225e4ebeae22a604e42eb094R251
   
   If user set a customized timestamp format, or set the SQL conf to use legacy 
timestamp format, the performance is not changed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874492296


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##
@@ -456,4 +456,19 @@ class TimestampFormatterSuite extends 
DatetimeFormatterSuite {
   assert(errMsg.contains("""Invalid input syntax for type "TIMESTAMP": 
'x123'"""))
 }
   }
+
+  test("default formatter: support returning optional parse results") {

Review Comment:
   ```suggestion
 test("SPARK-39193: support returning optional parse results in the default 
formatter") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874474944


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala:
##
@@ -456,4 +456,19 @@ class TimestampFormatterSuite extends 
DatetimeFormatterSuite {
   assert(errMsg.contains("""Invalid input syntax for type "TIMESTAMP": 
'x123'"""))
 }
   }
+
+  test("default formatter: support returning optional parse results") {

Review Comment:
   There are already many test cases for the timestamp inference of JSON/CSV:
   https://github.com/gengliangwang/spark/runs/6451697135
Here we just add one simple test case for the default timestamp formatted 
since it overrides the method `parseOptional` and `parseWithoutTimeZoneOptional`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874425426


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala:
##
@@ -30,29 +30,16 @@ import org.apache.spark.sql.catalyst.analysis.TypeCoercion
 import org.apache.spark.sql.catalyst.expressions.ExprUtils
 import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil
 import org.apache.spark.sql.catalyst.util._
-import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
 import org.apache.spark.util.Utils
 
 private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
 
   private val decimalParser = ExprUtils.getDecimalParser(options.locale)
 
-  private val timestampFormatter = TimestampFormatter(

Review Comment:
   Yes, I am changing the solution to fix test failures



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-17 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r874415392


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -178,7 +164,8 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 // We can only parse the value as TimestampNTZType if it does not have 
zone-offset or
 // time-zone component and can be parsed with the timestamp formatter.
 // Otherwise, it is likely to be a timestamp with timezone.
-if ((allCatch opt timestampNTZFormatter.parseWithoutTimeZone(field, 
false)).isDefined) {
+val fieldAsUTF8String = UTF8String.fromString(field)
+if 
(DateTimeUtils.stringToTimestampWithoutTimeZone(fieldAsUTF8String).isDefined) {

Review Comment:
   Yes, that's why tests failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r873559626


##
sql/core/benchmarks/CSVBenchmark-results.txt:
##
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 

 
-OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1
+Apple M1 Pro
 Parsing quoted values:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-One quoted string 41610  42902
1598  0.0  832194.2   1.0X
+One quoted string 16964  16981 
 15  0.0  339281.1   1.0X

Review Comment:
   Not directly related to this PR, but the M1 Macbook is so fast! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r873709851


##
sql/core/benchmarks/CSVBenchmark-results.txt:
##
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 

 
-OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1
+Apple M1 Pro
 Parsing quoted values:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-One quoted string 41610  42902
1598  0.0  832194.2   1.0X
+One quoted string 16964  16981 
 15  0.0  339281.1   1.0X

Review Comment:
   OK I will try it 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r873559626


##
sql/core/benchmarks/CSVBenchmark-results.txt:
##
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 

 
-OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1
+Apple M1 Pro
 Parsing quoted values:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-One quoted string 41610  42902
1598  0.0  832194.2   1.0X
+One quoted string 16964  16981 
 15  0.0  339281.1   1.0X

Review Comment:
   Not directly related to this PR, but the M1 Macbook is so fast! cc @dbtsai 
@dongjoon-hyun @viirya @sunchao @huaxingao 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #36562: [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources

2022-05-16 Thread GitBox


gengliangwang commented on code in PR #36562:
URL: https://github.com/apache/spark/pull/36562#discussion_r873559626


##
sql/core/benchmarks/CSVBenchmark-results.txt:
##
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 

 
-OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.13.0-1021-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 12.2.1
+Apple M1 Pro
 Parsing quoted values:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-One quoted string 41610  42902
1598  0.0  832194.2   1.0X
+One quoted string 16964  16981 
 15  0.0  339281.1   1.0X

Review Comment:
   Not directly related to this PR, but m1 chips are so fast! cc @dbtsai 
@dongjoon-hyun @viirya @sunchao @huaxingao 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org