[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-12 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r201919906
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -48,7 +48,8 @@ case class CreateArray(children: Seq[Expression]) extends 
Expression {
 
   override def dataType: ArrayType = {
--- End diff --

Because the data type of the `CreateArray` itself is not the merged type 
but `ArrayType`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r201902635
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -48,7 +48,8 @@ case class CreateArray(children: Seq[Expression]) extends 
Expression {
 
   override def dataType: ArrayType = {
--- End diff --

why `CreateArray` doesn't extend `ComplexTypeMergingExpression`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r201902242
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -610,27 +636,27 @@ object TypeCoercion {
   // Coalesce should return the first non-null value, which could be 
any column
   // from the list. So we need to make sure the return type is 
deterministic and
   // compatible with every child column.
-  case c @ Coalesce(es) if !haveSameType(es) =>
+  case c @ Coalesce(es) if !c.areInputTypesForMergingEqual =>
 val types = es.map(_.dataType)
 findWiderCommonType(types) match {
-  case Some(finalDataType) => Coalesce(es.map(Cast(_, 
finalDataType)))
+  case Some(finalDataType) => Coalesce(es.map(castIfNotSameType(_, 
finalDataType)))
   case None => c
 }
 
   // When finding wider type for `Greatest` and `Least`, we should 
handle decimal types even if
   // we need to truncate, but we should not promote one side to string 
if the other side is
   // string.g
-  case g @ Greatest(children) if !haveSameType(children) =>
+  case g @ Greatest(children) if !g.areInputTypesForMergingEqual =>
--- End diff --

these are not just concat, can you update the PR title?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r201900374
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -533,15 +559,15 @@ object TypeCoercion {
   case a @ CreateArray(children) if !haveSameType(children) =>
 val types = children.map(_.dataType)
 findWiderCommonType(types) match {
-  case Some(finalDataType) => CreateArray(children.map(Cast(_, 
finalDataType)))
+  case Some(finalDataType) => 
CreateArray(children.map(castIfNotSameType(_, finalDataType)))
--- End diff --

Currently the optimizer doesn't remove the cast when the difference of the 
`finalDataType` is only the nullabilities of nested types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-05 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200265965
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

Please ignore the part of my previous comment regarding ```flatten``` 
function. The output data type of ```concat```, etc. will be the same 
regardless what resolves ```null``` flags.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200216296
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

+1 for the @mn-mikke idea


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200134825
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

@ueshin For ```Concat```, ```Coalesce```, etc.  it seems to be that case 
since a coercion rule is executed if there is any nullability difference on any 
level of nesting. But it's not the case of ```CaseWhenCoercion``` rule, since 
```sameType``` method is used for comparison.

I'm wondering if the goal is to avoid generation of extra ```Cast``` 
expressions, shouldn't other coercion rules utilize ```sameType``` method as 
well? Let's assume that the result of ```concat``` is subsequently used by 
```flatten```, wouldn't  it lead to generation of extra null safe checks as 
mentioned 
[here](https://github.com/apache/spark/pull/21704#discussion_r200110924)? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200127250
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

SGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200125302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

I guess the inner nullabilities are coerced during type-coercion? If the 
inner nullabilities are different, type coercion adds casts and they will 
remain.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200116898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

yeah, + ```valueContainsNull``` for ```MapType``` and ```nullable``` for 
```StructField```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200116003
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Oh, see. In that case, it would be nice to introduce a method that will 
resolve the output DataType and merges ```nullable```/```containNull``` flags 
of non-primitive types recursively for such expressions. For the most cases we 
could encapsulate the function into a new trait. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200115019
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

then shall we fix the `containNull` for the inner array?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200113984
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

Yes, it should work (see a 
[test](https://github.com/ueshin/apache-spark/blob/30d5aed41085cfb82e3da43099adf972b953ece8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala#L951)
 for it). Did we miss anything?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200111312
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
--- End diff --

do we support array of array in concat?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200110924
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

> what about changing SimplifyCasts rule to start replacing Cast with a new 
dummy cast expression that will hold only a target data type and won't perform 
any casting?

This can work, but my point is we should not add the cast to change 
`containsNull`, as it may not match the underlying child expression and 
generates unnecessary null check code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200060536
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Yeah, `Coalesce` should be fixed, and `Least` and `Greatest` are also 
suspicious.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200054252
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

So far, we've identified also ```CaseWhen``` and  ```If``` discussed 
[here](https://github.com/apache/spark/pull/21687). I've just noticed that 
```Coalesce``` looks also suspicious.

What is the key purpose of ```SimplifyCasts```? to remove an extra 
expression node or avoid casts from between to indentical types? If the second 
option is the purpose, what about changing ```SimplifyCasts``` rule to start 
replacing  ```Cast``` with a new dummy cast expression that will hold only a 
target data type and won't perform any casting?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200053804
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Currently we've found `CaseWhen` and `If` as at #21687. Seems like there 
are no other expressions in the collection functions at a glance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200042880
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

I think `SimplifyCasts` is valid, but the concat type coercion rule is 
sub-optimal. We should not force all the children to have same `containsNull`, 
which may add unnecessary null checks inside `Concat`.

BTW are there more expression like `Concat` that we need to fix the 
`containsNull`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200026633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Hmm, that also makes sense. I'm not sure we can remove the simplification, 
though. cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200025173
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Aha, I see. But,  I just have a hunch that `SimplifyCasts` cannot simplify 
array casts in some cases?, e.g., this concat case.  Since we basically cannot 
change semantics in optimization phase, I feel a little weird about this 
simplification.

Anyway, I'm ok with your approach because I can't find a better & simpler 
way to solve this in analysis phase... Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r200024783
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Added a test to show the wrong nullability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-03 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r19346
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Actually, `Concat` for array type has the type coercion to add casts to 
make all children the same type, but we also have the optimization 
`SimplifyCasts` to remove unnecessary casts which might remove casts from 
arrays not contains null to arrays contains null 
([optimizer/expressions.scala#L611](https://github.com/apache/spark/blob/d87a8c6c0d1a4db5c9444781160a65562f8ea738/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L611)).

E.g., `concat(array(1,2,3), array(4,null,6))` might generate a wrong data 
type during the execution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21704#discussion_r14021
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 }
   }
 
-  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+  override def dataType: DataType = {
+val dataTypes = children.map(_.dataType)
+dataTypes.headOption.map {
+  case ArrayType(et, _) =>
+ArrayType(et, 
dataTypes.exists(_.asInstanceOf[ArrayType].containsNull))
+  case dt => dt
+}.getOrElse(StringType)
+  }
--- End diff --

Can't we handle this case in type coercion (analysis phase)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...

2018-07-03 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/21704

[SPARK-24734][SQL] Fix containsNull of Concat for array type.

## What changes were proposed in this pull request?

Currently `Concat` for array type uses the data type of the first child as 
its own data type, but the children might include an array containing nulls.
We should aware the nullabilities of all children.

## How was this patch tested?

Modified and added some tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark 
issues/SPARK-24734/concat_containsnull

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21704.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21704


commit d87a8c6c0d1a4db5c9444781160a65562f8ea738
Author: Takuya UESHIN 
Date:   2018-07-03T11:21:06Z

Fix containsNull of Concat for array type.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org