[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976097051


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   Can you elaborate on why we need 
LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, 
I am not a supporter of hardcoding date/time formats here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976091402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {

Review Comment:
   Never mind, I checked the code, the resulting type will be TimestampType or 
TimestampNTZ.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976089392


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {

Review Comment:
   What result does `findTightestCommonType` return for DateType and 
TimestampType?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976046176


##
docs/sql-data-sources-csv.md:
##
@@ -111,7 +111,7 @@ Data source options of CSV can be set via:
   
 prefersDate
 false
-During schema inference (inferSchema), attempts to infer 
string columns that contain dates or timestamps as Date if the 
values satisfy the dateFormat option and failed to be parsed by 
the respective formatter. With a user-provided schema, attempts to parse 
timestamp columns as dates using dateFormat if they fail to 
conform to timestampFormat, in this case the parsed values will be 
cast to timestamp type afterwards.
+During schema inference (inferSchema), attempts to infer 
string columns that contain dates as Date if the values satisfy 
the dateFormat option or default date format. For columns that 
contain mixing dates and timestamps, infer them as StringType.

Review Comment:
   nit: `... a mix/mixture of dates and timestamps ...`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing mixing dates and timestamps

Review Comment:
   Same here, just a bit of rewording.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {
+  case (DateType, TimestampType) | (DateType, TimestampNTZType) |
+   (TimestampNTZType, DateType) | (TimestampType, DateType) =>
+// For a column containing mixing dates and timestamps
+// infer it as timestamp type if its dates can be inferred as 
timestamp type
+// otherwise infer it as StringType
+val dateFormat = 
options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern)
+t1 match {
+  case DateType if canParseDateAsTimestamp(dateFormat, t2) =>
+Some(t2)
+  case TimestampType | TimestampNTZType if 
canParseDateAsTimestamp(dateFormat, t1) =>
+Some(t1)
+  case _ => Some(StringType)
+}
+  case _ => TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+}
+  }
+
+  /**
+   * Return if strings of given date format can be parsed as timestamps

Review Comment:
   nit: Returns `true` if strings ...



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -2819,51 +2819,68 @@ abstract class CSVSuite
 }
   }
 
-  test("SPARK-39469: Infer schema for date type") {
-val options1 = Map(
-  "header" -> "true",
-  "inferSchema" -> "true",
-  "timestampFormat" -> "-MM-dd'T'HH:mm:ss",
-  "dateFormat" -> "-MM-dd",
-  "prefersDate" -> "true")
-val options2 = Map(
-  "header" -> "true",
-  "inferSchema" -> "true",
-  "prefersDate" -> "true")
-
-// Error should be thrown when attempting to prefersDate with Legacy parser
-if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
-  checkError(
-exception = intercept[SparkIllegalArgumentException] {
-  
spark.read.format("csv").options(options1).load(testFile(dateInferSchemaFile))
-},
-errorClass = "CANNOT_INFER_DATE")
-} else {
-  // 1. Specify date format and timestamp format
-  // 2. Date inference should work with default date format when 
dateFormat is not provided
-  Seq(options1, options2).foreach {options =>
+  test("SPARK-39469: Infer schema for columns with only dates " +
+"and columns with mixing date and timestamps correctly") {
+def checkCSVReadDatetime(
+  options: Map[String, String],
+  expectedSchema: StructType,
+  expectedData: Seq[Seq[Any]]): Unit = {
+
+  // Error should be thrown when attempting to prefersDate with Legacy 
parser

Review Comment:
   nit: `to use prefersDate ...`.



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##
@@ -2819,51 +2819,68 @@ abstract class CSVSuite
 }
   }
 
-  test("SPARK-39469: Infer schema for date type") {
-val options1 = Map(
-  "header" -> "true",
-   

[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974998043


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Let me think about it as well as there are different fallbacks with the 
regard to date/timestamp parsing. IMHO, we would want to have some kind of flag 
to revert to the original behaviour but this would be a lot of flags to 
configure. 
   
   It is possible that some users might be relying on a more lenient parsing of 
timestamps so if we switch to ISO8601 only, some of the jobs might need to be 
updated. Let me think a bit more on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974985351


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   Opened https://github.com/apache/spark/pull/37942 to fix this. Apologies for 
the mistake.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974948772


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   After thinking about it more, let me submit a separate PR to fix it, this is 
a serious bug and I will fix it ASAP, we may need to port to other Spark 
branches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974908544


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   @xiaonanyang-db Does it mean that we don't have test coverage for it? I am 
wondering how I managed to merge a PR like this without a unit test 樂. I will 
open a follow-up PR to improve testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974908544


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   @xiaonanyang-db Does it mean that we don't have test coverage for it? I am 
wondering how I managed to merge like this. I will open a follow-up PR to 
improve testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox


sadikovi commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974905472


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   Thanks for fixing it, the values were swapped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org