[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21074


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182651253
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest {
   Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)),
   Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, 
StringType),
 Cast(doubleLit, StringType), Cast(stringLit, StringType
+
+ruleTest(rule,
+  Coalesce(Seq(timestampLit, intLit, stringLit)),
+  Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, 
StringType),
+Cast(stringLit, StringType
+
+ruleTest(rule,
+  Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)),
+  Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)),
+Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, 
ArrayType(StringType)
--- End diff --

We usually don't add end-to-end tests for type coercion changes, as the 
type coercion logic is pretty isolated, it's very unlikely that we can pass the 
unit test but not end-to-end test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182595455
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest {
   Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)),
   Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, 
StringType),
 Cast(doubleLit, StringType), Cast(stringLit, StringType
+
+ruleTest(rule,
+  Coalesce(Seq(timestampLit, intLit, stringLit)),
+  Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, 
StringType),
+Cast(stringLit, StringType
+
+ruleTest(rule,
+  Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)),
+  Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)),
+Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, 
ArrayType(StringType)
--- End diff --

Could you add an end to end test case that can trigger this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182484023
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1810,6 +1810,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
  - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
  - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
+ - Since Spark 2.4, finding the widest common type for the arguments of a 
variadic function(e.g. IN/COALESCE) should always success when each of the 
types of arguments is either StringType or can be promoted to StringType. 
Previously this may throw an exception for some specific arguments ordering.
--- End diff --

> - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182458121
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -175,11 +175,27 @@ object TypeCoercion {
   })
   }
 
+  /**
+   * Whether the data type contains StringType.
+   */
+  def hasStringType(dt: DataType): Boolean = dt match {
+case StringType => true
+case ArrayType(et, _) => hasStringType(et)
+// Add StructType if we support string promotion for struct fields in 
the future.
+case _ => false
+  }
+
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
-types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
-  case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
-})
+// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op 
b) op c may not equal
+// to a op (b op c). This is only a problem for StringType. Excluding 
StringType,
--- End diff --

`This is only a problem for StringType or nested StringType in ArrayType. 
Excluding these types, ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182301592
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -176,10 +176,18 @@ object TypeCoercion {
   }
 
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
-types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
-  case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
-})
+// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op 
b) op c may not equal
+// to a op (b op c). This is only a problem for StringType. Excluding 
StringType,
+// findWiderTypeForTwo satisfies the associative law. For instance, 
(TimestampType,
+// IntegerType, StringType) should have StringType as the wider common 
type.
+val (stringTypes, nonStringTypes) = types.partition { t =>
+  t == StringType || t == ArrayType(StringType)
--- End diff --

we need something like
```
def hasStringType(dt: DataType): Boolean = dt match {
  case StringType => true
  case ArrayType(et, _) => hasStringType(et)
  case _ => false // Add StructType if we support string promotion for 
struct fields in the future.
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-17 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182295312
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -176,10 +176,16 @@ object TypeCoercion {
   }
 
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
-types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
-  case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
-})
+// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op 
b) op c may not equal
+// to a op (b op c). This is only a problem for StringType. Excluding 
StringType,
+// findWiderTypeForTwo satisfies the associative law. For instance, 
(TimestampType,
+// IntegerType, StringType) should have StringType as the wider common 
type.
+val (stringTypes, nonStringTypes) = types.partition(_ == StringType)
--- End diff --

It's expected to , let me also fix it for array types. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-17 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182284460
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -176,10 +176,16 @@ object TypeCoercion {
   }
 
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
-types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
-  case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
-})
+// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op 
b) op c may not equal
+// to a op (b op c). This is only a problem for StringType. Excluding 
StringType,
+// findWiderTypeForTwo satisfies the associative law. For instance, 
(TimestampType,
+// IntegerType, StringType) should have StringType as the wider common 
type.
+val (stringTypes, nonStringTypes) = types.partition(_ == StringType)
--- End diff --

Out of curiosity, does this work with array types too (array of string vs 
array of non string types)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r182100748
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -176,16 +176,16 @@ object TypeCoercion {
   }
 
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
-types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
-  case Some(d) => findWiderTypeForTwo(d, c)
-  // Currently we find the wider common type by comparing the two 
types from left to right,
-  // this can be a problem when you have two data types which don't 
have a common type but each
-  // can be promoted to StringType. For instance, (TimestampType, 
IntegerType, StringType)
-  // should have StringType as the wider common type.
-  case None if types.exists(_ == StringType) &&
-types.forall(stringPromotion(_, StringType).nonEmpty) => 
Some(StringType)
-  case _ => None
-})
+// `findWiderTypeForTwo` doesn't satisfy the associative law, i.e. (a 
op b) op c may not equal
+// to a op (b op c). This is only a problem when each of the types is 
StringType or can be
--- End diff --

`This is only a problem for StringType`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r181620108
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -178,7 +178,13 @@ object TypeCoercion {
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
 types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
   case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
+  // Currently we find the wider common type by comparing the two 
types from left to right,
--- End diff --

This is a behavior change. We need to make it configurable. Add a conf and 
update the migration guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21074#discussion_r181619518
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -178,7 +178,13 @@ object TypeCoercion {
   private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
 types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
   case Some(d) => findWiderTypeForTwo(d, c)
-  case None => None
+  // Currently we find the wider common type by comparing the two 
types from left to right,
--- End diff --

The real problem is, `findWiderTypeForTwo` doesn't satisfy the associative 
law, i.e. `(a op b) op c` may not equal to `a op (b op c)`. I think 
`StringType` is the only exception here, it's more clear to do
```
val (stringType, nonStringType) = types.partition(_ == StringType)
(stringType.distinct ++ nonStringType).foldLeft...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21074: [SPARK-21811][SQL] Fix the inconsistency behavior...

2018-04-15 Thread jiangxb1987
GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/21074

[SPARK-21811][SQL] Fix the inconsistency behavior when finding the widest 
common type

## What changes were proposed in this pull request?

Currently we find the wider common type by comparing the two types from 
left to right, this can be a problem when you have two data types which don't 
have a common type but each can be promoted to StringType.

For instance, if you have a table with the schema:
[c1: date, c2: string, c3: int]

The following succeeds:
SELECT coalesce(c1, c2, c3) FROM table

While the following produces an exception:
SELECT coalesce(c1, c3, c2) FROM table

This is only a issue when the seq of dataTypes contains `StringType` and 
all the types can do string promotion.

close #19033

## How was this patch tested?

Add test in `TypeCoercionSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark typeCoercion

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21074.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21074


commit 803a6a443ba9f7d3dc34d68b0d15f53c1b6054fb
Author: Xingbo Jiang 
Date:   2018-04-15T15:12:16Z

fix type coercion when promting to StringType




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org