[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21074 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182651253 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest { Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)), Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, StringType), Cast(doubleLit, StringType), Cast(stringLit, StringType + +ruleTest(rule, + Coalesce(Seq(timestampLit, intLit, stringLit)), + Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, StringType), +Cast(stringLit, StringType + +ruleTest(rule, + Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)), + Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)), +Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType) --- End diff -- We usually don't add end-to-end tests for type coercion changes, as the type coercion logic is pretty isolated, it's very unlikely that we can pass the unit test but not end-to-end test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182595455 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala --- @@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest { Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)), Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, StringType), Cast(doubleLit, StringType), Cast(stringLit, StringType + +ruleTest(rule, + Coalesce(Seq(timestampLit, intLit, stringLit)), + Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, StringType), +Cast(stringLit, StringType + +ruleTest(rule, + Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)), + Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)), +Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType) --- End diff -- Could you add an end to end test case that can trigger this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182484023 --- Diff: docs/sql-programming-guide.md --- @@ -1810,6 +1810,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema. - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. + - Since Spark 2.4, finding the widest common type for the arguments of a variadic function(e.g. IN/COALESCE) should always success when each of the types of arguments is either StringType or can be promoted to StringType. Previously this may throw an exception for some specific arguments ordering. --- End diff -- > - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182458121 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -175,11 +175,27 @@ object TypeCoercion { }) } + /** + * Whether the data type contains StringType. + */ + def hasStringType(dt: DataType): Boolean = dt match { +case StringType => true +case ArrayType(et, _) => hasStringType(et) +// Add StructType if we support string promotion for struct fields in the future. +case _ => false + } + private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { -types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { - case Some(d) => findWiderTypeForTwo(d, c) - case None => None -}) +// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal +// to a op (b op c). This is only a problem for StringType. Excluding StringType, --- End diff -- `This is only a problem for StringType or nested StringType in ArrayType. Excluding these types, ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182301592 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -176,10 +176,18 @@ object TypeCoercion { } private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { -types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { - case Some(d) => findWiderTypeForTwo(d, c) - case None => None -}) +// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal +// to a op (b op c). This is only a problem for StringType. Excluding StringType, +// findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType, +// IntegerType, StringType) should have StringType as the wider common type. +val (stringTypes, nonStringTypes) = types.partition { t => + t == StringType || t == ArrayType(StringType) --- End diff -- we need something like ``` def hasStringType(dt: DataType): Boolean = dt match { case StringType => true case ArrayType(et, _) => hasStringType(et) case _ => false // Add StructType if we support string promotion for struct fields in the future. } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182295312 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -176,10 +176,16 @@ object TypeCoercion { } private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { -types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { - case Some(d) => findWiderTypeForTwo(d, c) - case None => None -}) +// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal +// to a op (b op c). This is only a problem for StringType. Excluding StringType, +// findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType, +// IntegerType, StringType) should have StringType as the wider common type. +val (stringTypes, nonStringTypes) = types.partition(_ == StringType) --- End diff -- It's expected to , let me also fix it for array types. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182284460 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -176,10 +176,16 @@ object TypeCoercion { } private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { -types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { - case Some(d) => findWiderTypeForTwo(d, c) - case None => None -}) +// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal +// to a op (b op c). This is only a problem for StringType. Excluding StringType, +// findWiderTypeForTwo satisfies the associative law. For instance, (TimestampType, +// IntegerType, StringType) should have StringType as the wider common type. +val (stringTypes, nonStringTypes) = types.partition(_ == StringType) --- End diff -- Out of curiosity, does this work with array types too (array of string vs array of non string types)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r182100748 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -176,16 +176,16 @@ object TypeCoercion { } private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { -types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { - case Some(d) => findWiderTypeForTwo(d, c) - // Currently we find the wider common type by comparing the two types from left to right, - // this can be a problem when you have two data types which don't have a common type but each - // can be promoted to StringType. For instance, (TimestampType, IntegerType, StringType) - // should have StringType as the wider common type. - case None if types.exists(_ == StringType) && -types.forall(stringPromotion(_, StringType).nonEmpty) => Some(StringType) - case _ => None -}) +// `findWiderTypeForTwo` doesn't satisfy the associative law, i.e. (a op b) op c may not equal +// to a op (b op c). This is only a problem when each of the types is StringType or can be --- End diff -- `This is only a problem for StringType` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r181620108 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -178,7 +178,13 @@ object TypeCoercion { private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { case Some(d) => findWiderTypeForTwo(d, c) - case None => None + // Currently we find the wider common type by comparing the two types from left to right, --- End diff -- This is a behavior change. We need to make it configurable. Add a conf and update the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21074#discussion_r181619518 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -178,7 +178,13 @@ object TypeCoercion { private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { case Some(d) => findWiderTypeForTwo(d, c) - case None => None + // Currently we find the wider common type by comparing the two types from left to right, --- End diff -- The real problem is, `findWiderTypeForTwo` doesn't satisfy the associative law, i.e. `(a op b) op c` may not equal to `a op (b op c)`. I think `StringType` is the only exception here, it's more clear to do ``` val (stringType, nonStringType) = types.partition(_ == StringType) (stringType.distinct ++ nonStringType).foldLeft... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...
GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/21074 [SPARK-21811][SQL] Fix the inconsistency behavior when finding the widest common type ## What changes were proposed in this pull request? Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType. For instance, if you have a table with the schema: [c1: date, c2: string, c3: int] The following succeeds: SELECT coalesce(c1, c2, c3) FROM table While the following produces an exception: SELECT coalesce(c1, c3, c2) FROM table This is only a issue when the seq of dataTypes contains `StringType` and all the types can do string promotion. close #19033 ## How was this patch tested? Add test in `TypeCoercionSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark typeCoercion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21074.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21074 commit 803a6a443ba9f7d3dc34d68b0d15f53c1b6054fb Author: Xingbo JiangDate: 2018-04-15T15:12:16Z fix type coercion when promting to StringType --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org