[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150692098
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Let me send the proposal to dev soon tonight (KST)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150687775
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Partitioned columns are different from normal type coercion cases, they are 
literally all string type, and we are just trying to find a most reasonable 
type of them.

The previous behavior was there since the very beginning, which I think 
didn't go through a decent discussion. This is the first time we seriously 
design the type merging logic for partition discovery. I think it doesn't need 
to be blocked by the type coercion stabilization work, as they can diverge.

@HyukjinKwon can you send the proposal to dev list? I think we need more 
feedback, e.g. people may want more strict rules and have more cases to 
fallback to string.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150682875
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Both the values and the schemas could be changed. The external applications 
might be broken if the schema is different. 

The new behaviors are consistent with what we does for the other type 
coercion cases. However, implicit type casting and partition discovery are 
unstable. The other mature systems have clear/stable rules about it. Below is 
an example. 
https://docs.oracle.com/cloud/latest/db112/SQLRF/sql_elements002.htm#g195937

If each release introduces new behaviors, it becomes hard to use by the end 
users who have such expectation. Thus, my suggestion is to first stabilize our 
type coercion rules before addressing this. 

https://github.com/apache/spark/pull/18853 is the first PR attempt in this 
direction. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150665699
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

BTW this is only used in partition discovery, I can't think of a 
problematic case for it. The only thing it can break is, users give Spark a 
data dir to do partition discovering, and users make an assumption on the 
inferred partition schema, but then we can argue that why users ask Spark to do 
partition discovery at the beginning?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150664593
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Do we really wanna document the previous wrong behavior in the doc? An 
example is merging `TimestampType` and `DateType`, the result is 
non-deterministic, depends on which partition path gets parsed first. How do we 
document that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150660671
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

We also need to document the behavior change in the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150659762
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Since this changes the schema, we can't change it without a conf. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150568042
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

I think the new behavior is much better. It seems like the previous 
behavior is just wrong, but it's very rare to see different data types in 
partition columns, and that's why no users open tickets for it yet.

cc @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150528361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

So, for the types:

```scala
Seq(
NullType, IntegerType, LongType, DoubleType,
*DecimalType(...), DateType, TimestampType, StringType)
```


I produced a chart as below by this codes:

```scala
test("Print out chart") {
  val supportedTypes: Seq[DataType] = Seq(
NullType, IntegerType, LongType, DoubleType,
DecimalType(38, 0), DateType, TimestampType, StringType)

  val combinations = for {
t1 <- supportedTypes
t2 <- supportedTypes
  } yield Seq(t1,t2)

  // Old type conflict resolution:
  val upCastingOrder: Seq[DataType] =
Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)

  def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
val topType = dataTypes.maxBy(upCastingOrder.indexOf(_))
if (topType == NullType) StringType else topType
  }

  // New type conflict resolution:
  def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
TypeCoercion.findWiderCommonType(dataTypes) match {
  case Some(NullType) => StringType
  case Some(dt: DataType) => dt
  case _ => StringType
}
  }

  println("|Input types|Old output type|New output type|")
  println("|---|---|---|")
  combinations.foreach { pair =>
val oldType = oldResolveTypeConflicts(pair)
val newType = newResolveTypeConflicts(pair)

if (oldType != newType) {
  println(s"|[`${pair(0)}`, `${pair(1)}`]|`$oldType`|`$newType`|")
}
  }
}
```

So, looks this PR makes the changes in type resolution as below:

|Input types|Old output type|New output type|
|---|---|---|
|[`NullType`, `DecimalType(38,0)`]|`StringType`|`DecimalType(38,0)`|
|[`NullType`, `DateType`]|`StringType`|`DateType`|
|[`NullType`, `TimestampType`]|`StringType`|`TimestampType`|
|[`IntegerType`, `DecimalType(38,0)`]|`IntegerType`|`DecimalType(38,0)`|
|[`IntegerType`, `DateType`]|`IntegerType`|`StringType`|
|[`IntegerType`, `TimestampType`]|`IntegerType`|`StringType`|
|[`LongType`, `DecimalType(38,0)`]|`LongType`|`DecimalType(38,0)`|
|[`LongType`, `DateType`]|`LongType`|`StringType`|
|[`LongType`, `TimestampType`]|`LongType`|`StringType`|
|[`DoubleType`, `DateType`]|`DoubleType`|`StringType`|
|[`DoubleType`, `TimestampType`]|`DoubleType`|`StringType`|
|[`DecimalType(38,0)`, `NullType`]|`StringType`|`DecimalType(38,0)`|
|[`DecimalType(38,0)`, `IntegerType`]|`IntegerType`|`DecimalType(38,0)`|
|[`DecimalType(38,0)`, `LongType`]|`LongType`|`DecimalType(38,0)`|
|[`DecimalType(38,0)`, `DateType`]|`DecimalType(38,0)`|`StringType`|
|[`DecimalType(38,0)`, `TimestampType`]|`DecimalType(38,0)`|`StringType`|
|[`DateType`, `NullType`]|`StringType`|`DateType`|
|[`DateType`, `IntegerType`]|`IntegerType`|`StringType`|
|[`DateType`, `LongType`]|`LongType`|`StringType`|
|[`DateType`, `DoubleType`]|`DoubleType`|`StringType`|
|[`DateType`, `DecimalType(38,0)`]|`DateType`|`StringType`|
|[`DateType`, `TimestampType`]|`DateType`|`TimestampType`|
|[`TimestampType`, `NullType`]|`StringType`|`TimestampType`|
|[`TimestampType`, `IntegerType`]|`IntegerType`|`StringType`|
|[`TimestampType`, `LongType`]|`LongType`|`StringType`|
|[`TimestampType`, `DoubleType`]|`DoubleType`|`StringType`|
|[`TimestampType`, `DecimalType(38,0)`]|`TimestampType`|`StringType`|


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150528207
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

I think we will have the input types for this `resolveTypeConflicts`:

```scala
Seq(
NullType, IntegerType, LongType, DoubleType,
*DecimalType(...), DateType, TimestampType, StringType)
```

*`DecimalType` only when it's bigger than `LongType`:


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L393



Because:

this particular `resolveTypeConflicts` seems being only called through:


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L142


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L337


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L474

In the first call, I am seeing `pathsWithPartitionValues` is constructed by 
`partitionValues`, which is the output from `parsePartition`:


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L108

which parses the input by `parsePartitionColumn`:


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L209

which calls this `inferPartitionColumnValue`:



https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L428




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150443594
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

Let me try to describe what this PR explicitly changes soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-11-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r150426911
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -468,14 +460,16 @@ object PartitioningUtils {
   }
 
   /**
-   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
-   * types.
+   * Given a collection of [[Literal]]s, resolves possible type conflicts 
by
+   * [[TypeCoercion.findWiderCommonType]]. See 
[[TypeCoercion.findWiderTypeForTwo]].
*/
   private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
-val desiredType = {
-  val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

I think it's a bug, the previous code didn't consider the case when input 
literal types are outside of the `upCastingOrder`, and just pick the first type 
as the final type.

However I'm not sure what's the expected behavior. We need to figure out 
what's the possible data types for partition columns, and how to merge them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r14268
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala
 ---
@@ -1067,4 +1098,24 @@ class ParquetPartitionDiscoverySuite extends 
QueryTest with ParquetTest with Sha
   checkAnswer(spark.read.load(path.getAbsolutePath), df)
 }
   }
+
+  test("Resolve type conflicts - decimals, dates and timestamps in 
partition column") {
+withTempPath { path =>
+  val df = Seq((1, "2015-01-01"), (2, "2016-01-01 
00:01:00")).toDF("i", "ts")
+  
df.write.format("parquet").partitionBy("ts").save(path.getAbsolutePath)
--- End diff --

I think `2016-01-01 00:01:00` was placed when calling 
`resolveTypeConflicts`.  Let me try to make this test case not dependent on 
this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r141999696
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -149,7 +149,7 @@ object TypeCoercion {
   })
   }
 
-  private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
+  def findWiderCommonType(types: Seq[DataType]): Option[DataType] = {
--- End diff --

Actually, these were removed in 
[SPARK-16813](https://issues.apache.org/jira/browse/SPARK-16813) and 
https://github.com/apache/spark/pull/14418 :).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r141999620
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala
 ---
@@ -249,6 +249,11 @@ class ParquetPartitionDiscoverySuite extends QueryTest 
with ParquetTest with Sha
   true,
   rootPaths,
   timeZoneId)
+  assert(actualSpec.partitionColumns === spec.partitionColumns)
+  assert(actualSpec.partitions.length === spec.partitions.length)
+  actualSpec.partitions.zip(spec.partitions).foreach { case (actual, 
expected) =>
+assert(actual === expected)
+  }
--- End diff --

Yes, it is. It was difficult for me to find which one was different.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r141947058
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala
 ---
@@ -249,6 +249,11 @@ class ParquetPartitionDiscoverySuite extends QueryTest 
with ParquetTest with Sha
   true,
   rootPaths,
   timeZoneId)
+  assert(actualSpec.partitionColumns === spec.partitionColumns)
+  assert(actualSpec.partitions.length === spec.partitions.length)
+  actualSpec.partitions.zip(spec.partitions).foreach { case (actual, 
expected) =>
+assert(actual === expected)
+  }
--- End diff --

just for my own understanding -- these added asserts are just to improve 
failure msgs, right?  They would have all been covered by the 
`assert(actualSpec === spec)` below anywyay, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r141948681
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala
 ---
@@ -1067,4 +1098,24 @@ class ParquetPartitionDiscoverySuite extends 
QueryTest with ParquetTest with Sha
   checkAnswer(spark.read.load(path.getAbsolutePath), df)
 }
   }
+
+  test("Resolve type conflicts - decimals, dates and timestamps in 
partition column") {
+withTempPath { path =>
+  val df = Seq((1, "2015-01-01"), (2, "2016-01-01 
00:01:00")).toDF("i", "ts")
+  
df.write.format("parquet").partitionBy("ts").save(path.getAbsolutePath)
--- End diff --

this passes even before the change, right?

as I mentioned on the other PR, I don't really understand why this works 
despite the issue you're fixing.  Regardless, seems like a good test to have.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-29 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19389#discussion_r141949009
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -149,7 +149,7 @@ object TypeCoercion {
   })
   }
 
-  private def findWiderCommonType(types: Seq[DataType]): Option[DataType] 
= {
+  def findWiderCommonType(types: Seq[DataType]): Option[DataType] = {
--- End diff --

surprisingly, `TypeCoercion` is totally public, so this should probably be 
`private[sql]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19389: [SPARK-22165][SQL] Resolve type conflicts between...

2017-09-28 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19389

[SPARK-22165][SQL] Resolve type conflicts between decimals, dates and 
timestamps in partition column

## What changes were proposed in this pull request?

This PR proposes to re-use `TypeCoercion.findWiderCommonType` when 
resolving type conflicts in partition values. Currently, this uses numeric 
precedence-like comparison; therefore, it looks introducing failures for type 
conflicts between timestamps, dates and decimals, please see:

```scala
private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
```

The codes below:


```scala
val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()
```

produces output as below:

**Before**

```
root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)
```

**After**

```
root
 |-- i: integer (nullable = true)
 |-- ts: timestamp (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: decimal(30,0) (nullable = true)
```

## How was this patch tested?

Unit tests added in `ParquetPartitionDiscoverySuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark partition-type-coercion

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19389.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19389


commit 1e10336128bf1e78a889ee4438e4519bb12bd84a
Author: hyukjinkwon 
Date:   2017-09-29T05:18:05Z

Resolve type conflicts between decimals, dates and timestamps in partition 
column




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org