[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976108061


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   When timestamp format is not specified, the desired behavior is that a 
column with mix of dates and timestamps could be inferred as timestamp type if 
the lenient timestamp formatter can parse the date strings under the column as 
well. 
   
   To achieve that without bringing other performance concern, we want to 
simply check if the date format could be supported by the lenient timestamp 
formatter. Does that make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976108061


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   When timestamp format is not specified, the desired behavior is that a 
column with mix of dates and timestamps will be inferred as timestamp type if 
the lenient timestamp formatter could parse the date strings under the column 
as well. To achieve that without bringing other performance concern, we want to 
simply check if the dateFormat could be supported by the lenient timestamp 
formatter. Does that make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976090455


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {

Review Comment:
   ```
   (d1, d2) match {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   TimestampType
   
 case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, 
_: TimestampType) =>
   TimestampType
   
 case (_: TimestampNTZType, _: DateType) | (_: DateType, _: 
TimestampNTZType) =>
   TimestampNTZType
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976070903


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {

Review Comment:
   `findTightestCommonType` merge DateType and TimestampType in a way that is 
not applicable here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976083738


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   > More of a thought experiment, I don't think this actually happens in 
practice: cast allows years longer than 4 digits, is it something that needs to 
be supported here? For more context, 
https://issues.apache.org/jira/browse/SPARK-39731.
   
   That's some interesting formats, I am not sure if we need to take care of  
them here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "/MM/dd"
   timestampFormat = "/MM/dd HH:mm:ss" 
   ```
   I don't quite understand your question on this case. 
   But speaking in the context of this PR, because `timestampFormat` is 
specified, a column with a mix of dates and timestamps will be inferred as 
`StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "/MM/dd"
   timestampFormat = "/MM/dd HH:mm:ss" 
   ```
   I don't quite understand your question on this case. 
   But speaking in the context of this PR, because `timestampFormat` is 
specified, we will use the strict parser, a column with a mix of dates and 
timestamps will be inferred as `StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976081324


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   ```
   dateFormat = "/MM/dd"
   timestampFormat = "/MM/dd HH:mm:ss" 
   ```
   Because `timestampFormat` is specified, we will use the strict parser.
   In this case, a column with a mix of dates and timestamps will be inferred 
as `StringType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976077440


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 ExprUtils.getDecimalParser(options.locale)
   }
 
+  // Date formats that could be parsed in DefaultTimestampFormatter
+  // Reference: DateTimeUtils.parseTimestampString
+  private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set(

Review Comment:
   We need this set to determine inferring a column with mixture of dates and 
timestamps as `TimestampType` or `StringType` when no timestamp format is 
specified (the lenient timestamp formatter will be used)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976070903


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:
##
@@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
* is compatible with both input data types.
*/
   private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = {
-TypeCoercion.findTightestCommonType(t1, 
t2).orElse(findCompatibleTypeForCSV(t1, t2))
+(t1, t2) match {

Review Comment:
   `findTightestCommonType` merge DateType and TimestampType in a way that is 
not applicable here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r976042644


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Made some follow-up changes, please check the updated description for the 
behavior after changes and semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values
   - If `prefersDate=true`, the column will be inferred as `StringType`
   - otherwise the column could be inferred as timestamp/string type based 
on whether the date format is supported by the lenient timestampFormatter
 - If timestamp values are before date values
   - the column could be inferred as timestamp/string type based on whether 
the date format is supported by the lenient timestampFormatter
   
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values
   - If `prefersDate=true`, the column will be inferred as `StringType`
   - otherwise
 - If the date format is supported by `DefaultTimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
 - otherwise, the column will be inferred as `StringType`
 - If timestamp values are before date values
   - If the date format is supported by `DefaultTimestampFormatter`, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values
   - If `prefersDate=true`, the column will be inferred as `StringType`
   - otherwise
 - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
 - otherwise, the column will be inferred as `StringType`
 - If timestamp values are before date values
   - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   There is no behavior change when `prefersDate=false`.
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values
   - If `prefersDate=true`, the column will be inferred as `StringType`
   - otherwise
 - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
 - otherwise, the column will be inferred as `StringType`
 - If timestamp values are before date values
   - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values
   - If `prefersDate=true`, the column will be inferred as `StringType`
   - otherwise
 - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
 - otherwise, the column will be inferred as `StringType`
   - otherwise the column will be inferred as `StringType`
 - If timestamp values are before date values
   - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values, the column will be inferred 
as `StringType`
 - If timestamp values are before date values
   - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, for a column 
with mixing dates and timestamps
 - If date values are before timestamp values, the column will be inferred 
as `timestampFormat/timestampNTZFormat`
 - If timestamp values are before date values
   - If the date format is supported by `Iso8601TimestampFormatter `, the 
column will still be inferred as `timestampFormat/timestampNTZFormat`
   - otherwise the column will be inferred as `StringType`
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, we keep 
current behavior that columns with mixing dates and timestamps could be 
inferred as `TimestampType/TimestampNTZType`, which is okay and even a good 
feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 
   
   cc @brkyvz @Yaohua628 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan & @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will always be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, we keep 
current behavior that columns with mixing dates and timestamps could be 
inferred as `TimestampType/TimestampNTZType`, which is okay and even a good 
feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r975634588


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   Totally agree with your concerns @cloud-fan & @sadikovi.
   
   After some quick discussion within my team, we agreed on not changing these 
lines to avoid unnecessary regressions and any other behavior changes. Thus, 
the behavior after this PR become:
   - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly 
parse fields as timestamp according to the format. Thus, columns with mixing 
dates and timestamps will be inferred as `StringType`.
   - If no `timestampFormat/timestampNTZFormat` specified by user, columns with 
mixing dates and timestamps could be inferred as 
`TimestampType/TimestampNTZType`, which is okay and even a good feature.
   
   Does this make sense to you? @sadikovi @cloud-fan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974986988


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   - There are two types of timestampFormatter: Default and Iso8601.
 - The default formatter use same parsing logic as CAST, which can parse a 
date value as timestamp.
 - The Iso8601 has more restrict parse. 
 - If no timestampFormat is given, we will use the default one 
 
   - In the case of a column containing timestamp value first and dates 
following, if we use the default formatter, it would still parse the dates as 
timestamp and infer the column as timestamp type, which is inconsistent with 
what we are proposing now.
   
   It’s pretty similar to the legacy parser problem.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-20 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974986988


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##
@@ -183,7 +180,9 @@ class CSVOptions(
   Some(parameters.getOrElse("timestampFormat",
 s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"))
 } else {
-  parameters.get("timestampFormat")
+  // Use Iso8601TimestampFormatter (with strict timestamp parsing) to
+  // avoid parsing dates in timestamp columns as timestamp type

Review Comment:
   - There are two types of timestampFormatter: Default and Iso8601.
 - The default formatter use same parsing logic as CAST, which can parse a 
date value as timestamp.
 - The Iso8601 has more restrict parse. 
 - If not timestampFormat is given, we will use the default one 
 
   - In the case of a column containing timestamp value first and dates 
following, if we use the default formatter, it would still parse the dates as 
timestamp type, which is not consistent with what we are proposing now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox


xiaonanyang-db commented on code in PR #37933:
URL: https://github.com/apache/spark/pull/37933#discussion_r974909141


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##
@@ -224,7 +223,7 @@ class UnivocityParser(
   case NonFatal(e) =>
 // If fails to parse, then tries the way used in 2.0 and 1.x for 
backwards
 // compatibility if enabled.
-if (!enableParsingFallbackForTimestampType) {
+if (!enableParsingFallbackForDateType) {

Review Comment:
   That will be nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org