[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976097051 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -59,6 +59,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { ExprUtils.getDecimalParser(options.locale) } + // Date formats that could be parsed in DefaultTimestampFormatter + // Reference: DateTimeUtils.parseTimestampString + private val LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS = Set( Review Comment: Can you elaborate on why we need LENIENT_TS_FORMATTER_SUPPORTED_DATE_FORMATS? I understand how it is used. Also, I am not a supporter of hardcoding date/time formats here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976091402 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { Review Comment: Never mind, I checked the code, the resulting type will be TimestampType or TimestampNTZ. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976089392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { Review Comment: What result does `findTightestCommonType` return for DateType and TimestampType? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r976046176 ## docs/sql-data-sources-csv.md: ## @@ -111,7 +111,7 @@ Data source options of CSV can be set via: prefersDate false -During schema inference (inferSchema), attempts to infer string columns that contain dates or timestamps as Date if the values satisfy the dateFormat option and failed to be parsed by the respective formatter. With a user-provided schema, attempts to parse timestamp columns as dates using dateFormat if they fail to conform to timestampFormat, in this case the parsed values will be cast to timestamp type afterwards. +During schema inference (inferSchema), attempts to infer string columns that contain dates as Date if the values satisfy the dateFormat option or default date format. For columns that contain mixing dates and timestamps, infer them as StringType. Review Comment: nit: `... a mix/mixture of dates and timestamps ...`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing mixing dates and timestamps Review Comment: Same here, just a bit of rewording. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -233,7 +237,39 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { * is compatible with both input data types. */ private def compatibleType(t1: DataType, t2: DataType): Option[DataType] = { -TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +(t1, t2) match { + case (DateType, TimestampType) | (DateType, TimestampNTZType) | + (TimestampNTZType, DateType) | (TimestampType, DateType) => +// For a column containing mixing dates and timestamps +// infer it as timestamp type if its dates can be inferred as timestamp type +// otherwise infer it as StringType +val dateFormat = options.dateFormatInRead.getOrElse(DateFormatter.defaultPattern) +t1 match { + case DateType if canParseDateAsTimestamp(dateFormat, t2) => +Some(t2) + case TimestampType | TimestampNTZType if canParseDateAsTimestamp(dateFormat, t1) => +Some(t1) + case _ => Some(StringType) +} + case _ => TypeCoercion.findTightestCommonType(t1, t2).orElse(findCompatibleTypeForCSV(t1, t2)) +} + } + + /** + * Return if strings of given date format can be parsed as timestamps Review Comment: nit: Returns `true` if strings ... ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -2819,51 +2819,68 @@ abstract class CSVSuite } } - test("SPARK-39469: Infer schema for date type") { -val options1 = Map( - "header" -> "true", - "inferSchema" -> "true", - "timestampFormat" -> "-MM-dd'T'HH:mm:ss", - "dateFormat" -> "-MM-dd", - "prefersDate" -> "true") -val options2 = Map( - "header" -> "true", - "inferSchema" -> "true", - "prefersDate" -> "true") - -// Error should be thrown when attempting to prefersDate with Legacy parser -if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) { - checkError( -exception = intercept[SparkIllegalArgumentException] { - spark.read.format("csv").options(options1).load(testFile(dateInferSchemaFile)) -}, -errorClass = "CANNOT_INFER_DATE") -} else { - // 1. Specify date format and timestamp format - // 2. Date inference should work with default date format when dateFormat is not provided - Seq(options1, options2).foreach {options => + test("SPARK-39469: Infer schema for columns with only dates " + +"and columns with mixing date and timestamps correctly") { +def checkCSVReadDatetime( + options: Map[String, String], + expectedSchema: StructType, + expectedData: Seq[Seq[Any]]): Unit = { + + // Error should be thrown when attempting to prefersDate with Legacy parser Review Comment: nit: `to use prefersDate ...`. ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala: ## @@ -2819,51 +2819,68 @@ abstract class CSVSuite } } - test("SPARK-39469: Infer schema for date type") { -val options1 = Map( - "header" -> "true", -
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974998043 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -183,7 +180,9 @@ class CSVOptions( Some(parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")) } else { - parameters.get("timestampFormat") + // Use Iso8601TimestampFormatter (with strict timestamp parsing) to + // avoid parsing dates in timestamp columns as timestamp type Review Comment: Let me think about it as well as there are different fallbacks with the regard to date/timestamp parsing. IMHO, we would want to have some kind of flag to revert to the original behaviour but this would be a lot of flags to configure. It is possible that some users might be relying on a more lenient parsing of timestamps so if we switch to ISO8601 only, some of the jobs might need to be updated. Let me think a bit more on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974985351 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: Opened https://github.com/apache/spark/pull/37942 to fix this. Apologies for the mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974948772 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: After thinking about it more, let me submit a separate PR to fix it, this is a serious bug and I will fix it ASAP, we may need to port to other Spark branches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974908544 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: @xiaonanyang-db Does it mean that we don't have test coverage for it? I am wondering how I managed to merge a PR like this without a unit test 樂. I will open a follow-up PR to improve testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974908544 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: @xiaonanyang-db Does it mean that we don't have test coverage for it? I am wondering how I managed to merge like this. I will open a follow-up PR to improve testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
sadikovi commented on code in PR #37933: URL: https://github.com/apache/spark/pull/37933#discussion_r974905472 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala: ## @@ -224,7 +223,7 @@ class UnivocityParser( case NonFatal(e) => // If fails to parse, then tries the way used in 2.0 and 1.x for backwards // compatibility if enabled. -if (!enableParsingFallbackForTimestampType) { +if (!enableParsingFallbackForDateType) { Review Comment: Thanks for fixing it, the values were swapped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org