[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150692098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Let me send the proposal to dev soon tonight (KST) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150687775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Partitioned columns are different from normal type coercion cases, they are literally all string type, and we are just trying to find a most reasonable type of them. The previous behavior was there since the very beginning, which I think didn't go through a decent discussion. This is the first time we seriously design the type merging logic for partition discovery. I think it doesn't need to be blocked by the type coercion stabilization work, as they can diverge. @HyukjinKwon can you send the proposal to dev list? I think we need more feedback, e.g. people may want more strict rules and have more cases to fallback to string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150682875 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Both the values and the schemas could be changed. The external applications might be broken if the schema is different. The new behaviors are consistent with what we does for the other type coercion cases. However, implicit type casting and partition discovery are unstable. The other mature systems have clear/stable rules about it. Below is an example. https://docs.oracle.com/cloud/latest/db112/SQLRF/sql_elements002.htm#g195937 If each release introduces new behaviors, it becomes hard to use by the end users who have such expectation. Thus, my suggestion is to first stabilize our type coercion rules before addressing this. https://github.com/apache/spark/pull/18853 is the first PR attempt in this direction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150665699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- BTW this is only used in partition discovery, I can't think of a problematic case for it. The only thing it can break is, users give Spark a data dir to do partition discovering, and users make an assumption on the inferred partition schema, but then we can argue that why users ask Spark to do partition discovery at the beginning? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150664593 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Do we really wanna document the previous wrong behavior in the doc? An example is merging `TimestampType` and `DateType`, the result is non-deterministic, depends on which partition path gets parsed first. How do we document that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150660671 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- We also need to document the behavior change in the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150659762 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Since this changes the schema, we can't change it without a conf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150568042 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- I think the new behavior is much better. It seems like the previous behavior is just wrong, but it's very rare to see different data types in partition columns, and that's why no users open tickets for it yet. cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150528361 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- So, for the types: ```scala Seq( NullType, IntegerType, LongType, DoubleType, *DecimalType(...), DateType, TimestampType, StringType) ``` I produced a chart as below by this codes: ```scala test("Print out chart") { val supportedTypes: Seq[DataType] = Seq( NullType, IntegerType, LongType, DoubleType, DecimalType(38, 0), DateType, TimestampType, StringType) val combinations = for { t1 <- supportedTypes t2 <- supportedTypes } yield Seq(t1,t2) // Old type conflict resolution: val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { val topType = dataTypes.maxBy(upCastingOrder.indexOf(_)) if (topType == NullType) StringType else topType } // New type conflict resolution: def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { TypeCoercion.findWiderCommonType(dataTypes) match { case Some(NullType) => StringType case Some(dt: DataType) => dt case _ => StringType } } println("|Input types|Old output type|New output type|") println("|---|---|---|") combinations.foreach { pair => val oldType = oldResolveTypeConflicts(pair) val newType = newResolveTypeConflicts(pair) if (oldType != newType) { println(s"|[`${pair(0)}`, `${pair(1)}`]|`$oldType`|`$newType`|") } } } ``` So, looks this PR makes the changes in type resolution as below: |Input types|Old output type|New output type| |---|---|---| |[`NullType`, `DecimalType(38,0)`]|`StringType`|`DecimalType(38,0)`| |[`NullType`, `DateType`]|`StringType`|`DateType`| |[`NullType`, `TimestampType`]|`StringType`|`TimestampType`| |[`IntegerType`, `DecimalType(38,0)`]|`IntegerType`|`DecimalType(38,0)`| |[`IntegerType`, `DateType`]|`IntegerType`|`StringType`| |[`IntegerType`, `TimestampType`]|`IntegerType`|`StringType`| |[`LongType`, `DecimalType(38,0)`]|`LongType`|`DecimalType(38,0)`| |[`LongType`, `DateType`]|`LongType`|`StringType`| |[`LongType`, `TimestampType`]|`LongType`|`StringType`| |[`DoubleType`, `DateType`]|`DoubleType`|`StringType`| |[`DoubleType`, `TimestampType`]|`DoubleType`|`StringType`| |[`DecimalType(38,0)`, `NullType`]|`StringType`|`DecimalType(38,0)`| |[`DecimalType(38,0)`, `IntegerType`]|`IntegerType`|`DecimalType(38,0)`| |[`DecimalType(38,0)`, `LongType`]|`LongType`|`DecimalType(38,0)`| |[`DecimalType(38,0)`, `DateType`]|`DecimalType(38,0)`|`StringType`| |[`DecimalType(38,0)`, `TimestampType`]|`DecimalType(38,0)`|`StringType`| |[`DateType`, `NullType`]|`StringType`|`DateType`| |[`DateType`, `IntegerType`]|`IntegerType`|`StringType`| |[`DateType`, `LongType`]|`LongType`|`StringType`| |[`DateType`, `DoubleType`]|`DoubleType`|`StringType`| |[`DateType`, `DecimalType(38,0)`]|`DateType`|`StringType`| |[`DateType`, `TimestampType`]|`DateType`|`TimestampType`| |[`TimestampType`, `NullType`]|`StringType`|`TimestampType`| |[`TimestampType`, `IntegerType`]|`IntegerType`|`StringType`| |[`TimestampType`, `LongType`]|`LongType`|`StringType`| |[`TimestampType`, `DoubleType`]|`DoubleType`|`StringType`| |[`TimestampType`, `DecimalType(38,0)`]|`TimestampType`|`StringType`| --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150528207 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- I think we will have the input types for this `resolveTypeConflicts`: ```scala Seq( NullType, IntegerType, LongType, DoubleType, *DecimalType(...), DateType, TimestampType, StringType) ``` *`DecimalType` only when it's bigger than `LongType`: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L393 Because: this particular `resolveTypeConflicts` seems being only called through: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L142 https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L337 https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L474 In the first call, I am seeing `pathsWithPartitionValues` is constructed by `partitionValues`, which is the output from `parsePartition`: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L108 which parses the input by `parsePartitionColumn`: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L209 which calls this `inferPartitionColumnValue`: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L428 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150443594 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- Let me try to describe what this PR explicitly changes soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r150426911 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -468,14 +460,16 @@ object PartitioningUtils { } /** - * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" - * types. + * Given a collection of [[Literal]]s, resolves possible type conflicts by + * [[TypeCoercion.findWiderCommonType]]. See [[TypeCoercion.findWiderTypeForTwo]]. */ private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { -val desiredType = { - val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- I think it's a bug, the previous code didn't consider the case when input literal types are outside of the `upCastingOrder`, and just pick the first type as the final type. However I'm not sure what's the expected behavior. We need to figure out what's the possible data types for partition columns, and how to merge them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r14268 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala --- @@ -1067,4 +1098,24 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha checkAnswer(spark.read.load(path.getAbsolutePath), df) } } + + test("Resolve type conflicts - decimals, dates and timestamps in partition column") { +withTempPath { path => + val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:01:00")).toDF("i", "ts") + df.write.format("parquet").partitionBy("ts").save(path.getAbsolutePath) --- End diff -- I think `2016-01-01 00:01:00` was placed when calling `resolveTypeConflicts`. Let me try to make this test case not dependent on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r141999696 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -149,7 +149,7 @@ object TypeCoercion { }) } - private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { + def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { --- End diff -- Actually, these were removed in [SPARK-16813](https://issues.apache.org/jira/browse/SPARK-16813) and https://github.com/apache/spark/pull/14418 :). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r141999620 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala --- @@ -249,6 +249,11 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha true, rootPaths, timeZoneId) + assert(actualSpec.partitionColumns === spec.partitionColumns) + assert(actualSpec.partitions.length === spec.partitions.length) + actualSpec.partitions.zip(spec.partitions).foreach { case (actual, expected) => +assert(actual === expected) + } --- End diff -- Yes, it is. It was difficult for me to find which one was different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r141947058 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala --- @@ -249,6 +249,11 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha true, rootPaths, timeZoneId) + assert(actualSpec.partitionColumns === spec.partitionColumns) + assert(actualSpec.partitions.length === spec.partitions.length) + actualSpec.partitions.zip(spec.partitions).foreach { case (actual, expected) => +assert(actual === expected) + } --- End diff -- just for my own understanding -- these added asserts are just to improve failure msgs, right? They would have all been covered by the `assert(actualSpec === spec)` below anywyay, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r141948681 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala --- @@ -1067,4 +1098,24 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha checkAnswer(spark.read.load(path.getAbsolutePath), df) } } + + test("Resolve type conflicts - decimals, dates and timestamps in partition column") { +withTempPath { path => + val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:01:00")).toDF("i", "ts") + df.write.format("parquet").partitionBy("ts").save(path.getAbsolutePath) --- End diff -- this passes even before the change, right? as I mentioned on the other PR, I don't really understand why this works despite the issue you're fixing. Regardless, seems like a good test to have. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19389#discussion_r141949009 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -149,7 +149,7 @@ object TypeCoercion { }) } - private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { + def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { --- End diff -- surprisingly, `TypeCoercion` is totally public, so this should probably be `private[sql]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19389 [SPARK-22165][SQL] Resolve type conflicts between decimals, dates and timestamps in partition column ## What changes were proposed in this pull request? This PR proposes to re-use `TypeCoercion.findWiderCommonType` when resolving type conflicts in partition values. Currently, this uses numeric precedence-like comparison; therefore, it looks introducing failures for type conflicts between timestamps, dates and decimals, please see: ```scala private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) ... literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) ``` The codes below: ```scala val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") df.write.format("parquet").partitionBy("ts").save("/tmp/foo") spark.read.load("/tmp/foo").printSchema() val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") spark.read.load("/tmp/bar").printSchema() ``` produces output as below: **Before** ``` root |-- i: integer (nullable = true) |-- ts: date (nullable = true) root |-- i: integer (nullable = true) |-- decimal: integer (nullable = true) ``` **After** ``` root |-- i: integer (nullable = true) |-- ts: timestamp (nullable = true) root |-- i: integer (nullable = true) |-- decimal: decimal(30,0) (nullable = true) ``` ## How was this patch tested? Unit tests added in `ParquetPartitionDiscoverySuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark partition-type-coercion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19389.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19389 commit 1e10336128bf1e78a889ee4438e4519bb12bd84a Author: hyukjinkwonDate: 2017-09-29T05:18:05Z Resolve type conflicts between decimals, dates and timestamps in partition column --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org