[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976108061 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps could be inferred as timestamp type if the lenient timestamp formatter can parse the date strings under the column as well. To achieve that without bringing other performance concern, we want to simply check if the date format could be supported by the lenient timestamp formatter. Does that make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976108061 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: When timestamp format is not specified, the desired behavior is that a column with mix of dates and timestamps will be inferred as timestamp type if the lenient timestamp formatter could parse the date strings under the column as well. To achieve that without bringing other performance concern, we want to simply check if the dateFormat could be supported by the lenient timestamp formatter. Does that make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976090455 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { Review Comment: ``` (d1, d2) match { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => TimestampType case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) => TimestampType case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) => TimestampNTZType } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976070903 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { Review Comment: `findTightestCommonType` merge DateType and TimestampType in a way that is not applicable here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976083738 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: > More of a thought experiment, I don't think this actually happens in practice: cast allows years longer than 4 digits, is it something that needs to be supported here? For more context, https://issues.apache.org/jira/browse/SPARK-39731. That's some interesting formats, I am not sure if we need to take care of them here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976081324 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: ``` dateFormat = "/MM/dd" timestampFormat = "/MM/dd HH:mm:ss" ``` I don't quite understand your question on this case. But speaking in the context of this PR, because `timestampFormat` is specified, a column with a mix of dates and timestamps will be inferred as `StringType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976081324 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: ``` dateFormat = "/MM/dd" timestampFormat = "/MM/dd HH:mm:ss" ``` I don't quite understand your question on this case. But speaking in the context of this PR, because `timestampFormat` is specified, we will use the strict parser, a column with a mix of dates and timestamps will be inferred as `StringType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976081324 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: ``` dateFormat = "/MM/dd" timestampFormat = "/MM/dd HH:mm:ss" ``` Because `timestampFormat` is specified, we will use the strict parser. In this case, a column with a mix of dates and timestamps will be inferred as `StringType`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976077440 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: We need this set to determine inferring a column with mixture of dates and timestamps as `TimestampType` or `StringType` when no timestamp format is specified (the lenient timestamp formatter will be used) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976070903 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { Review Comment: `findTightestCommonType` merge DateType and TimestampType in a way that is not applicable here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976042644 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Made some follow-up changes, please check the updated description for the behavior after changes and semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values - If `prefersDate=true`, the column will be inferred as `StringType` - otherwise the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter - If timestamp values are before date values - the column could be inferred as timestamp/string type based on whether the date format is supported by the lenient timestampFormatter There is no behavior change when `prefersDate=false`. Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values - If `prefersDate=true`, the column will be inferred as `StringType` - otherwise - If the date format is supported by `DefaultTimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise, the column will be inferred as `StringType` - If timestamp values are before date values - If the date format is supported by `DefaultTimestampFormatter`, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` There is no behavior change when `prefersDate=false`. Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values - If `prefersDate=true`, the column will be inferred as `StringType` - otherwise - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise, the column will be inferred as `StringType` - If timestamp values are before date values - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` There is no behavior change when `prefersDate=false`. Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values - If `prefersDate=true`, the column will be inferred as `StringType` - otherwise - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise, the column will be inferred as `StringType` - If timestamp values are before date values - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values - If `prefersDate=true`, the column will be inferred as `StringType` - otherwise - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise, the column will be inferred as `StringType` - otherwise the column will be inferred as `StringType` - If timestamp values are before date values - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values, the column will be inferred as `StringType` - If timestamp values are before date values - If the date format is supported by `Iso8601TimestampFormatter `, the column will be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, for a column with mixing dates and timestamps - If date values are before timestamp values, the column will be inferred as `timestampFormat/timestampNTZFormat` - If timestamp values are before date values - If the date format is supported by `Iso8601TimestampFormatter `, the column will still be inferred as `timestampFormat/timestampNTZFormat` - otherwise the column will be inferred as `StringType` Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, we keep current behavior that columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature. Does this make sense to you? @sadikovi @cloud-fan cc @brkyvz @Yaohua628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan & @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will always be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, we keep current behavior that columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature. Does this make sense to you? @sadikovi @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r975634588 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Totally agree with your concerns @cloud-fan & @sadikovi. After some quick discussion within my team, we agreed on not changing these lines to avoid unnecessary regressions and any other behavior changes. Thus, the behavior after this PR become: - If user provides a `timestampFormat/timestampNTZFormat`, we will strictly parse fields as timestamp according to the format. Thus, columns with mixing dates and timestamps will be inferred as `StringType`. - If no `timestampFormat/timestampNTZFormat` specified by user, columns with mixing dates and timestamps could be inferred as `TimestampType/TimestampNTZType`, which is okay and even a good feature. Does this make sense to you? @sadikovi @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974986988 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: - There are two types of timestampFormatter: Default and Iso8601. - The default formatter use same parsing logic as CAST, which can parse a date value as timestamp. - The Iso8601 has more restrict parse. - If no timestampFormat is given, we will use the default one - In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp and infer the column as timestamp type, which is inconsistent with what we are proposing now. It’s pretty similar to the legacy parser problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974986988 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: - There are two types of timestampFormatter: Default and Iso8601. - The default formatter use same parsing logic as CAST, which can parse a date value as timestamp. - The Iso8601 has more restrict parse. - If not timestampFormat is given, we will use the default one - In the case of a column containing timestamp value first and dates following, if we use the default formatter, it would still parse the dates as timestamp type, which is not consistent with what we are proposing now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974909141 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: That will be nice! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org