Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan closed pull request #45383: [SPARK-47210][SQL] Addition of implicit 
casting without indeterminate support
URL: https://github.com/apache/spark/pull/45383


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan commented on PR #45383:
URL: https://github.com/apache/spark/pull/45383#issuecomment-2035818786

   the yarn and docker test failures are unrelated, thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549347237


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case ifExpr: If =>
+  ifExpr.withNewChildren(
+ifExpr.predicate +: collateToSingleType(Seq(ifExpr.trueValue, 
ifExpr.falseValue)))
+case caseWhenExpr: CaseWhen =>
+  val newValues = collateToSingleType(
+caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+  caseWhenExpr.withNewChildren(
+interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))

Review Comment:
   We actually do not blindly Cast in CollationTypeCasts. We only cast if we 
get different collations, as all other cases go to null branch in 
castStringType. Maybe it is better to keep this check internal to this rule, as 
we will add the priority flag and we will need to handle casting of priority in 
this rule as well later. Am adding the haveSameType for now to make it check 
the input types, but will change this code in the following PR to include 
priorities in the internal implementation of haveSameType check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549240911


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##
@@ -112,12 +112,14 @@ case class Collate(child: Expression, collationName: 
String)
   since = "4.0.0",
   group = "string_funcs")
 // scalastyle:on line.contains.tab
-case class Collation(child: Expression) extends UnaryExpression with 
RuntimeReplaceable {
-  override def dataType: DataType = StringType
+case class Collation(child: Expression)
+  extends UnaryExpression with RuntimeReplaceable with ExpectsInputTypes {
+  override def dataType: DataType = SQLConf.get.defaultStringType

Review Comment:
   We shouldn't override it. The default implementation is better 
`replacement.dataType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549240904


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case ifExpr: If =>
+  ifExpr.withNewChildren(
+ifExpr.predicate +: collateToSingleType(Seq(ifExpr.trueValue, 
ifExpr.falseValue)))
+case caseWhenExpr: CaseWhen =>
+  val newValues = collateToSingleType(
+caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+  caseWhenExpr.withNewChildren(
+interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))
+case substrExpr: Substring =>
+  // This case is necessary for changing Substring input to implicit 
collation
+  substrExpr.withNewChildren(
+collateToSingleType(Seq(substrExpr.str)) :+ substrExpr.pos :+ 
substrExpr.len)

Review Comment:
   For now we do not need it, I can revert this change and leave it for default 
collation handling. Because if we add a flag for collation priority, we will 
need to change the priority of the input to implicit, if it was default before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549239292


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case ifExpr: If =>
+  ifExpr.withNewChildren(
+ifExpr.predicate +: collateToSingleType(Seq(ifExpr.trueValue, 
ifExpr.falseValue)))
+case caseWhenExpr: CaseWhen =>
+  val newValues = collateToSingleType(
+caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+  caseWhenExpr.withNewChildren(
+interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))
+case substrExpr: Substring =>
+  // This case is necessary for changing Substring input to implicit 
collation
+  substrExpr.withNewChildren(
+collateToSingleType(Seq(substrExpr.str)) :+ substrExpr.pos :+ 
substrExpr.len)

Review Comment:
   I don't get it. Why do we find the common collation for a single type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549237247


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case ifExpr: If =>
+  ifExpr.withNewChildren(
+ifExpr.predicate +: collateToSingleType(Seq(ifExpr.trueValue, 
ifExpr.falseValue)))
+case caseWhenExpr: CaseWhen =>
+  val newValues = collateToSingleType(
+caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+  caseWhenExpr.withNewChildren(
+interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))

Review Comment:
   Actually this exposes some gaps in this new rule
   1. We should add a trigger condition and only enter the branch if 
`!haveSameType(...)`
   2. We should not blindly add cast but use `castIfNotSameType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-03 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1549233784


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case ifExpr: If =>
+  ifExpr.withNewChildren(
+ifExpr.predicate +: collateToSingleType(Seq(ifExpr.trueValue, 
ifExpr.falseValue)))
+case caseWhenExpr: CaseWhen =>
+  val newValues = collateToSingleType(
+caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+  caseWhenExpr.withNewChildren(
+interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))

Review Comment:
   The code looks a bit complicated now. Can we follow the existing rule?
   ```
 object CaseWhenCoercion extends TypeCoercionRule {
   override val transform: PartialFunction[Expression, Expression] = {
 case c: CaseWhen if c.childrenResolved && 
!haveSameType(c.inputTypesForMerging) =>
   val maybeCommonType = findWiderCommonType(c.inputTypesForMerging)
   maybeCommonType.map { commonType =>
 val newBranches = c.branches.map { case (condition, value) =>
   (condition, castIfNotSameType(value, commonType))
 }
 val newElseValue = c.elseValue.map(castIfNotSameType(_, 
commonType))
 CaseWhen(newBranches, newElseValue)
   }.getOrElse(c)
   }
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1548791139


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc@(_: In
+ | _: InSubquery
+ | _: CreateArray
+ | _: If
+ | _: ArrayJoin
+ | _: CaseWhen
+ | _: Concat
+ | _: Greatest
+ | _: Least
+ | _: Coalesce
+ | _: BinaryExpression
+ | _: ConcatWs
+ | _: Substring) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+  .distinctBy(_.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {

Review Comment:
   this can be `implicitTypes.length > 1` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1548790383


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc@(_: In
+ | _: InSubquery
+ | _: CreateArray
+ | _: If

Review Comment:
   We should add more case matches as some expressions do not require all its 
inputs to be the same type
   ```
   case if: If =>
 if.withNewChildren(if.predicate +: collateToSingleType(if.trueValue, 
if.falseValue))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


srielau commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547984669


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   Interesting. Yes array() sports an implicit collation. 'c' COLLATE 
UNICODE_CI sports an explicit collation.
   The rules are clear that the explicit collation wins.
   So the result is `array` with an IMPLICIT collation.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


srielau commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547984669


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   Interesting. Yes array() sports an implicit collation. 'c' COLLATE 
UNICODE_CI sports an explicit collation.
   The rules are clear that the explicit collation wins.
   So the result is array with an IMPLICIT collation.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547880103


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)

Review Comment:
   scala has `distinctBy`, we can distinct by collation id only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547845857


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)

Review Comment:
   I could change the else branch only here as well, to make it look nicer as 
well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547843264


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)

Review Comment:
   There is one problem with this. We said collations are only different if 
their collationId's are different. If we do this, we would need to change the 
line in else clause to something that does not look so nice. I have changed 
this structure in https://github.com/apache/spark/pull/45819, which is a 
followup for this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547841866


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {
+  throw QueryCompilationErrors.implicitCollationMismatchError()
+}
+else {
+  implicitTypes.find(dt => !(dt == SQLConf.get.defaultStringType))
+.getOrElse(SQLConf.get.defaultStringType)
+}
+}
+  }
+
+  /**
+   * This check is always preformed when we have no explicit collation. It 
returns true
+   * if there are more than one implicit collations. Collations are 
distinguished by their
+   * collationId.
+   * @param dataTypes
+   * @return
+   */
+  private def hasMultipleImplicits(dataTypes: Seq[StringType]): Boolean =
+dataTypes.map(_.collationId)
+  .filter(dt => !(dt == 
SQLConf.get.defaultStringType.collationId)).distinct.size > 1
+
+}
+
+/**
+ * This rule is used to collate all existing expressions related to StringType 
into a single
+ * 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547826569


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {
+  throw QueryCompilationErrors.implicitCollationMismatchError()
+}
+else {
+  implicitTypes.find(dt => !(dt == SQLConf.get.defaultStringType))
+.getOrElse(SQLConf.get.defaultStringType)
+}
+}
+  }
+
+  /**
+   * This check is always preformed when we have no explicit collation. It 
returns true
+   * if there are more than one implicit collations. Collations are 
distinguished by their
+   * collationId.
+   * @param dataTypes
+   * @return
+   */
+  private def hasMultipleImplicits(dataTypes: Seq[StringType]): Boolean =
+dataTypes.map(_.collationId)
+  .filter(dt => !(dt == 
SQLConf.get.defaultStringType.collationId)).distinct.size > 1
+
+}
+
+/**
+ * This rule is used to collate all existing expressions related to StringType 
into a single
+ * 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547818939


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,16 +705,17 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val children: Seq[Expression] = e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)

Review Comment:
   please revert these cosmetic changes.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,16 +705,17 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val children: Seq[Expression] = e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
 }
 e.withNewChildren(children)
 
   case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
 // Convert NullType into some specific target type for 
ExpectsInputTypes that don't do
 // general implicit casting.
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val children: Seq[Expression] =
+  e.children.zip(e.inputTypes).map { case (in, expected) =>

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547818283


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)

Review Comment:
   nit:
   ```
   val implicitCollations =
 ...
 .distinct
   if (implicitCollations > 1) {
 throw ...
   } else {
 implicitCollations.headOption.getOrElse(SQLConf.get.defaultStringType)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547814185


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {
+  throw QueryCompilationErrors.implicitCollationMismatchError()
+}
+else {
+  implicitTypes.find(dt => !(dt == SQLConf.get.defaultStringType))
+.getOrElse(SQLConf.get.defaultStringType)
+}
+}
+  }
+
+  /**
+   * This check is always preformed when we have no explicit collation. It 
returns true
+   * if there are more than one implicit collations. Collations are 
distinguished by their
+   * collationId.
+   * @param dataTypes
+   * @return
+   */
+  private def hasMultipleImplicits(dataTypes: Seq[StringType]): Boolean =
+dataTypes.map(_.collationId)
+  .filter(dt => !(dt == 
SQLConf.get.defaultStringType.collationId)).distinct.size > 1
+
+}
+
+/**
+ * This rule is used to collate all existing expressions related to StringType 
into a single
+ * 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547811709


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {
+  throw QueryCompilationErrors.implicitCollationMismatchError()
+}
+else {
+  implicitTypes.find(dt => !(dt == SQLConf.get.defaultStringType))
+.getOrElse(SQLConf.get.defaultStringType)
+}
+}
+  }
+
+  /**
+   * This check is always preformed when we have no explicit collation. It 
returns true
+   * if there are more than one implicit collations. Collations are 
distinguished by their
+   * collationId.
+   * @param dataTypes
+   * @return
+   */
+  private def hasMultipleImplicits(dataTypes: Seq[StringType]): Boolean =
+dataTypes.map(_.collationId)
+  .filter(dt => !(dt == 
SQLConf.get.defaultStringType.collationId)).distinct.size > 1
+
+}
+
+/**
+ * This rule is used to collate all existing expressions related to StringType 
into a single
+ * 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547812585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
Cast, Collate, ComplexTypeMergingExpression, ConcatWs, CreateArray, Expression, 
In, InSubquery, Substring}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+abstract class CollationTypeCasts extends TypeCoercionRule {
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)
+  .filter(dt => dt.collationId != 
SQLConf.get.defaultStringType.collationId)
+
+if (hasMultipleImplicits(implicitTypes)) {
+  throw QueryCompilationErrors.implicitCollationMismatchError()
+}
+else {
+  implicitTypes.find(dt => !(dt == SQLConf.get.defaultStringType))
+.getOrElse(SQLConf.get.defaultStringType)
+}
+}
+  }
+
+  /**
+   * This check is always preformed when we have no explicit collation. It 
returns true
+   * if there are more than one implicit collations. Collations are 
distinguished by their
+   * collationId.
+   * @param dataTypes
+   * @return
+   */
+  private def hasMultipleImplicits(dataTypes: Seq[StringType]): Boolean =
+dataTypes.map(_.collationId)
+  .filter(dt => !(dt == 
SQLConf.get.defaultStringType.collationId)).distinct.size > 1
+
+}
+
+/**
+ * This rule is used to collate all existing expressions related to StringType 
into a single
+ * 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547390616


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)

Review Comment:
   > If I understood correctly, we do not want to replicate the code from 
ImplicitTypeCasts.
   
   I don't think this is possible without significant refactoring. Correctness 
is more important than duplicated code at this stage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547309134


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)

Review Comment:
   This is the problem. If I understood correctly, we do not want to replicate 
the code from ImplicitTypeCasts. This rule is concerned with transforming 
datatypes into their expected DataTypes, but CollationTypeCasts is concerned 
with transforming StringTypes into their expected collated StringType. There is 
a difference, as collated StringType is calculated based on expression 
parameters that have StringTypes not on what someone expects the input types to 
be. As you mentioned for `If`, this new collation type rule does not want to 
fail instead of the IfCoercion which fails if it cannot find wider for left and 
right, but this new rule wants to fail and mitigate errors of a customer 
providing differently collated strings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547288333


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {

Review Comment:
   Note: if we want to have correlation between function inputs, please match 
the expression explicitly to do so. `ImplicitCastInputTypes` does not indicate 
input correlation and please DO NOT make such assumptions here.
   
   Looking at `ConcatWs`, it does need the inputs to use the same string 
collation. Let's match it in the new rule and deal with it correctly.
   
   In principle, let's not generalize things without seeing the full picture. 
What we should do to `ConcatWs` does not necessarily apply to all 
`ImplicitCastInputTypes` implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547273755


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {

Review Comment:
   Once again, this is the problem, we need to call getOutputCollation before 
we call implicitCast, as implicitCast only concerns one child, and not all 
children as a group. For collations we need to be sure we decided on the right 
collation first and then enter a casting rule. But if we do this, we would need 
to pass additional stuff to implicitCast, which changes the logic of the core 
function. Another idea was to implement implicitCasting for collations solely 
in a new rule, but we would end up copying a lot of already existent code in 
the implicitCast for TypeCollections and StringTypes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547259400


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>

Review Comment:
   Then please match the ConcatWs expression explicitly to handle this case. 
What I disagree with is to do this blindly for all expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547242430


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>

Review Comment:
   A simple example is ConcatWs. It can have ArrayType(StringType, _) for input 
strings and StringType for separator as parameters. What collations do we want 
for this then? We need to cast the ArrayType into a proper collation if 
separator is explicit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547239943


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {

Review Comment:
   It seems we are fixing the problem at the wrong place. If `TypeCollections` 
has issues, let's update the code in `def implicitCast` to fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547226617


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {

Review Comment:
   Well the problem is that we have TypeCollections as expected types. If 
someone defines StringType as non-first DataType (and consequently 
non-defaultConcreteType), we need to go through previous elements first to see 
if it will come to StringType cast. If it does, we need to cast it to the same 
collation. For example, look at `ConcatWs` if it receives a BinaryType, it 
should try first to cast to ArrayType, get None and then try to cast it to 
StringType. But on the other hand, we need for consistency with 
findWiderCommonType to cast collations at the beginning of rules, as we would 
get a breakage in castings otherwise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547226867


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {
+  case in if hasStringType(in.dataType) =>
+castStringType(in, st).getOrElse(in)
+  case in => in
 }
 e.withNewChildren(children)
 
   case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
 // Convert NullType into some specific target type for 
ExpectsInputTypes that don't do
 // general implicit casting.
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] =

Review Comment:
   This is not necessary, agreed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547210423


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)

Review Comment:
   Let's not assume thar it's safe to only cast string-type inputs for any 
expression. Please follow the existing implicit cast rules and explicitly match 
expressions that need some of their inputs to be the same type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547201438


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)

Review Comment:
   This rule is only touching children that have StringType's as their 
DataType. I can reorder execution to first filter out StringType arguments and 
then work on them, Agreed for `ExpectInputTypes`, I have checked now and it 
should not have any function with 2 StringType inputs, so we are safe not to 
cast it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547197835


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {

Review Comment:
   what are we doing here? `ImplicitCastInputTypes` defines the expected data 
type of each input, but does not require all inputs to be the same data type.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##
@@ -702,26 +707,39 @@ abstract class TypeCoercionBase {
 }.getOrElse(b)  // If there is no applicable conversion, leave 
expression unchanged.
 
   case e: ImplicitCastInputTypes if e.inputTypes.nonEmpty =>
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] = 
e.children.zip(e.inputTypes).map {
   // If we cannot do the implicit cast, just use the original input.
-  implicitCast(in, expected).getOrElse(in)
+  case (in, expected) => implicitCast(in, expected).getOrElse(in)
+}
+val st = getOutputCollation(e.children)
+val children: Seq[Expression] = childrenBeforeCollations.map {
+  case in if hasStringType(in.dataType) =>
+castStringType(in, st).getOrElse(in)
+  case in => in
 }
 e.withNewChildren(children)
 
   case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
 // Convert NullType into some specific target type for 
ExpectsInputTypes that don't do
 // general implicit casting.
-val children: Seq[Expression] = e.children.zip(e.inputTypes).map { 
case (in, expected) =>
+val childrenBeforeCollations: Seq[Expression] =

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547196013


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>

Review Comment:
   I disagree with special-case array type. The code looks broken. It assumes 
the children of the given expression can have both string type and array of 
string type, then tries to find a common collation between the string type 
child and the array element. This makes no sense without knowing the semantic 
of the given expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547193589


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)

Review Comment:
   How do we define implicit collation? The collation can come from multiple 
places:
   1. default collation
   2. table column collation
   3. was explicit collation before but does not get propagate
   
   Why do we special case default collation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547191461


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, castType: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId != castType.collationId => castType
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, castType).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val st = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, st).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(expr: Seq[Expression]): StringType = {
+val explicitTypes = expr.filter(_.isInstanceOf[Collate])
+  .map(_.dataType.asInstanceOf[StringType].collationId)
+  .distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => StringType(explicitTypes.head)
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+explicitTypes.map(t => StringType(t).typeName)
+  )
+  // Only implicit or default collations present
+  case 0 =>
+val implicitTypes = expr.map(_.dataType)
+  .filter(hasStringType)
+  .map(extractStringType)

Review Comment:
   the code looks confusing. If the default collation is not treated as 
implicit collation, we should filter it out here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-02 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1547189187


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, Elt, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+   | _: ComplexTypeMergingExpression
+   | _: CreateArray
+   | _: Elt
+   | _: ExpectsInputTypes
+   | _: Predicate
+   | _: SortOrder) =>
+  val newChildren = collateToSingleType(sc.children)

Review Comment:
   This is the tricky part of adding a new rule. We must make sure we follow 
the behavior of existing implicit cast rules and only add implicit cast for 
certain children of an expression. For example, the true and false branches of 
`If` expression need implicit cast, but not the if condition. 
`ExpectsInputTypes` does not indicate that the expression requires all its 
children to be the same type, and should not be handled here.
   
   Please carefully check all implicit cast rules and revisit this new rule. 
@mihailom-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546551263


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   The example query is `SELECT array('a', 'b' collate UNICODE) || ('c' collate 
UNICODE_CI)`. This is kind of artificial as users can write one `array(...)` 
function, but think about a rewrite like `SELECT arr || ('c' collate 
UNICODE_CI) FROM (SELECT array('a', 'b' collate UNICODE) as arr)`, I think 
returning an array with string of collation UNICODE_CI is fine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


srielau commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546517397


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   We should play out how to the behavior.
   If I want combine arrays or maps, how do I do that. Is this what drives us 
towards CAST support. And does that imply that teh result of a CAST with a 
COLLATE clause (on an element/key/value) is a explicit collation)?  
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546445756


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   This is a good news. Let's not special-case CreateArray, otherwise people 
may question about CreateMap and CreateStruct. They are just SQL functions, not 
a special building block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


srielau commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546420774


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   Explicit collation NEVER propagates. It only applies to the immediate 
consumer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546413940


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   How I understood it, arrays are bulding blocks for types as well. The 
difference is that arrays have to be built from some different types which 
makes them complex. But `substring`, `concat`, etc. are functions that return 
something based on a transformation behaviour. In other words, CreateArray is 
not even considering the the values, it only builds a new 'basic' block. That 
is why I am also concerned with what we want to do with Elt, do we want to do 
casting or just propagation of inputs, as it does not change any string, but 
only takes a specific value from the given index. But again, this differs from 
array access, as arrays need to have a single type for all input expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546398212


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   I'm not convinced. The propagation of "explicit collation" should be 
well-defined. Why don't funcitons like `substring`, `concat` propagate 
"explicit collation"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546397910


##
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##
@@ -27,7 +27,9 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
  * @param collationId The id of collation for this StringType.
  */
 @Stable
-class StringType private(val collationId: Int) extends AtomicType with 
Serializable {
+class StringType private(val collationId: Int, var isExplicit: Boolean = false)

Review Comment:
   Yeah, good point, needed it to be var when it was not in the constructor. 
Changing it to val.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546396569


##
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##
@@ -27,7 +27,9 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
  * @param collationId The id of collation for this StringType.
  */
 @Stable
-class StringType private(val collationId: Int) extends AtomicType with 
Serializable {
+class StringType private(val collationId: Int, var isExplicit: Boolean = false)

Review Comment:
   why it's a `var`? does it need to be mutable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546396114


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>

Review Comment:
   Added Elt to the check, but will probably revisit that function when we 
enter a PR for it's support.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546388826


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>

Review Comment:
   There are only 2 expressions that are potentially missing. One is Elt, but I 
figured this one can be done when we actually do the support for this function. 
And the other one is Stack, but for this one I would say we do not want to do 
casting, as we actually pass columns there. Additionally, as a followup, I have 
a ticket to check whether we want to do Map casting of stringTypes and in which 
way, but this feels like a less of a priority to indeterminate collation. I 
could add support for Elt, it is a single line of code, but I thought it would 
be better to work on it in a single PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546370140


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   This might be a good question for @srielau. How do we want this query to 
behave: `SELECT array('a', 'b' collate UNICODE) || ('c' collate UNICODE_CI)` 
(Concat_Ws in our implementation)? This query in Postgre fails with explicit 
type mismatch. Do we want to propagate explicit collations from elements to 
arrays. I would say yes, as ArrayType is constructed with a parameter called 
`elementType`. This suggests, if elements are explicit, then ArrayType should 
be as well. Also, on other operations with arrays, as we defined that function 
outputs are implicit, I would say array creation is then the only place where 
we should propagate this explicit meaning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546285554


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>

Review Comment:
   why only `CreateArray` can preserve explicit collation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546285091


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>
+  val newChildren = collateToSingleType(sc.children)
+  sc.withNewChildren(newChildren)
+case pesc @ (_: CreateArray) =>
+  val newChildren = collateToSingleType(pesc.children, true)
+  pesc.withNewChildren(newChildren)
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, st: StringType): Option[Expression] =
+castStringType(expr.dataType, st).map { dt => Cast(expr, dt)}
+
+  private def castStringType(inType: AbstractDataType, st: StringType): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case ost: StringType if ost.collationId == st.collationId
+&& ost.isExplicit == st.isExplicit => null
+  case _: StringType => st
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, st).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression],
+  preserveExplicit: Boolean = false): Seq[Expression] 
= {

Review Comment:
   nit 4 spaces indentation
   ```
   def func
   para1: xxx,
   para2: xxx,
   ...) ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546284265


##
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##
@@ -28,6 +28,7 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
  */
 @Stable
 class StringType private(val collationId: Int) extends AtomicType with 
Serializable {
+  var isExplicit: Boolean = false

Review Comment:
   why isn't it part of construct parameters?



##
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala:
##
@@ -28,6 +28,7 @@ import org.apache.spark.sql.catalyst.util.CollationFactory
  */
 @Stable
 class StringType private(val collationId: Int) extends AtomicType with 
Serializable {
+  var isExplicit: Boolean = false

Review Comment:
   why isn't it part of constructor parameters?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546283459


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.analysis.TypeCoercion.hasStringType
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, Expression, 
Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression) =>

Review Comment:
   Can you check all the implicit cast rules and see if we miss to handle any 
expression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546114682


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   I went through all the core functions and it seems like StringType implicit 
castings are already blocked, or go to default branch which is None. Is there 
something else we want to check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546065154


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   This is a good point that we can't reuse the existing `fold` style to do 
implicit cast for string collations. If we must rewrite the code, a new rule 
sounds better.
   
   To avoid the notorious analyzer rule execution order issues, let's make sure 
this new rule is orthogonal to existing implicit cast rules: the existing 
"fold" style implicit cast should not apply to string type with different 
collations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546046606


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   Also, I do not believe incorporating string collations into core functions 
is the best way to go. If you take a look at findWiderCommonType, it is 
implemented differently for AnsiTypeCoercion and TypeCoercion. Firstly both of 
those take leftFold and go through expressions in order, but in our case that 
is not what we want. We want to first know what collations are explicit and 
then look for anything else. This is important because we want to have clear 
design on which error has to come first, and explicit mismatch should always 
come before implicit one. Another thing is that AnsiTypeCoercion seems to 
behave differently for StringTypes, we do not reorder any type in the sequence, 
which would result in implementing the same thing we did here with addition of 
another rule, but just in a core, already well defined function.
   One more thing I would add is that not all rules actually use core 
functions, e.g. ConcatCoercion. Also doing collations casting in one rule is 
way more efficient, as otherwise we would have to reorder StringTypes in every 
other rule, to make sure explicit collation mismatches are thrown first, which 
would result in multiple reorderings as opposed to constant of 2, one at the 
beginning for all StringTypes and the other at the end for all expressions that 
were cast to StringType from other types in some other implicit casting rules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-04-01 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1546046606


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   Also, I do not believe incorporating string collations into core functions 
is the best way to go. If you take a look at findWiderCommonForTwo, it is 
implemented differently for AnsiTypeCoercion and TypeCoercion. Firstly both of 
those take leftFold and go through expressions in order, but in our case that 
is not what we want. We want to first know what collations are explicit and 
then look for anything else. This is important because we want to have clear 
design on which error has to come first, and explicit mismatch should always 
come before implicit one. Another thing is that AnsiTypeCoercion seems to 
behave differently for StringTypes, we do not reorder any type in the sequence, 
which would result in implementing the same thing we did here with addition of 
another rule, but just in a core, already well defined function.
   One more thing I would add is that not all rules actually use core 
functions, e.g. ConcatCoercion. Also doing collations casting in one rule is 
way more efficient, as otherwise we would have to reorder StringTypes in every 
other rule, to make sure explicit collation mismatches are thrown first, which 
would result in multiple reorderings as opposed to constant of 2, one at the 
beginning for all StringTypes and the other at the end for all expressions that 
were cast to StringType from other types in some other implicit casting rules.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-29 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544741676


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   @cloud-fan Is this what you meant? Now this type of query will be broken 
`SELECT array('b' collate unicode_ci) || 'a' collate unicode`, but I can add 
support for array explicit as a followup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-29 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544588809


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+case sc @ (_: BinaryExpression

Review Comment:
   This looks like a clean solution for string collation implicit cast, but 
actually it is not. The current design of the Spark implicit cast system is: we 
define some core functions like `findTightestCommonType`, 
`findWiderTypeForTwo`, etc., and there are a bunch of rules to find the target 
expressions to apply implicit cast using these core functions.
   
   The string collation implicit cast should be part of the core functions, 
instead of a new rule. Otherwise this new rule is a bit inconsistent with the 
current system and we need to replicate the matching of expressions that need 
implicit cast.
   
   Since the core functions only take `DataType`, not `Expression`, I suggest 
we add a new boolean flag to `StringType` to indicate it's explicit or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544100458


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##
@@ -112,12 +112,14 @@ case class Collate(child: Expression, collationName: 
String)
   since = "4.0.0",
   group = "string_funcs")
 // scalastyle:on line.contains.tab
-case class Collation(child: Expression) extends UnaryExpression with 
RuntimeReplaceable {
-  override def dataType: DataType = StringType
+case class Collation(child: Expression)
+  extends UnaryExpression with RuntimeReplaceable with ExpectsInputTypes {
+  override def dataType: DataType = child.dataType

Review Comment:
   I was thinking of some cases when indeterminate comes. What if our costumer 
wants to concat the collation used to their strings and filter on something 
like that. If we return a StringType, we would face a problem of getting 
indeterminate collation failing to keep the pass through for our functions. 
This is more of a problem if users set default collation to be something else, 
so we might say that changing return type to default collation would be the 
better choice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


mihailom-db commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544095362


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+// Case when we do fail if resulting collation is indeterminate
+case checkCastWithoutIndeterminate @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression
+  | _: CreateArray)
+  if shouldCast(checkCastWithoutIndeterminate.children) =>
+  val newChildren = 
collateToSingleType(checkCastWithoutIndeterminate.children)
+  checkCastWithoutIndeterminate.withNewChildren(newChildren)
+  }
+
+  def shouldCast(types: Seq[Expression]): Boolean = {
+types.filter(e => hasStringType(e.dataType))
+  .map(e => extractStringType(e.dataType).collationId).distinct.size > 1
+  }
+
+  /**
+   * Checks whether given data type contains StringType.
+   */
+  @tailrec
+  def hasStringType(dt: DataType): Boolean = dt match {
+case _: StringType => true
+case ArrayType(et, _) => hasStringType(et)
+case _ => false
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, collationId: Int): Option[Expression] =
+castStringType(expr.dataType, collationId).map { dt =>
+  if (dt == expr.dataType) expr else Cast(expr, dt)
+}
+
+  private def castStringType(inType: AbstractDataType, collationId: Int): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId == collationId => st
+  case _: StringType => StringType(collationId)
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, collationId).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val collationId = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, collationId).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(exprs: Seq[Expression]): Int = {
+val explicitTypes = exprs.filter(hasExplicitCollation)
+  .map(e => extractStringType(e.dataType).collationId).distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => explicitTypes.head
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+ 

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544061247


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+// Case when we do fail if resulting collation is indeterminate
+case checkCastWithoutIndeterminate @ (_: BinaryExpression
+  | _: Predicate
+  | _: SortOrder
+  | _: ExpectsInputTypes
+  | _: ComplexTypeMergingExpression
+  | _: CreateArray)
+  if shouldCast(checkCastWithoutIndeterminate.children) =>
+  val newChildren = 
collateToSingleType(checkCastWithoutIndeterminate.children)
+  checkCastWithoutIndeterminate.withNewChildren(newChildren)
+  }
+
+  def shouldCast(types: Seq[Expression]): Boolean = {
+types.filter(e => hasStringType(e.dataType))
+  .map(e => extractStringType(e.dataType).collationId).distinct.size > 1
+  }
+
+  /**
+   * Checks whether given data type contains StringType.
+   */
+  @tailrec
+  def hasStringType(dt: DataType): Boolean = dt match {
+case _: StringType => true
+case ArrayType(et, _) => hasStringType(et)
+case _ => false
+  }
+
+  /**
+   * Extracts StringTypes from filtered hasStringType
+   */
+  @tailrec
+  private def extractStringType(dt: DataType): StringType = dt match {
+case st: StringType => st
+case ArrayType(et, _) => extractStringType(et)
+  }
+
+  /**
+   * Casts given expression to collated StringType with id equal to 
collationId only
+   * if expression has StringType in the first place.
+   * @param expr
+   * @param collationId
+   * @return
+   */
+  def castStringType(expr: Expression, collationId: Int): Option[Expression] =
+castStringType(expr.dataType, collationId).map { dt =>
+  if (dt == expr.dataType) expr else Cast(expr, dt)
+}
+
+  private def castStringType(inType: AbstractDataType, collationId: Int): 
Option[DataType] = {
+@Nullable val ret: DataType = inType match {
+  case st: StringType if st.collationId == collationId => st
+  case _: StringType => StringType(collationId)
+  case ArrayType(arrType, nullable) =>
+castStringType(arrType, collationId).map(ArrayType(_, nullable)).orNull
+  case _ => null
+}
+Option(ret)
+  }
+
+  /**
+   * Collates input expressions to a single collation.
+   */
+  def collateToSingleType(exprs: Seq[Expression]): Seq[Expression] = {
+val collationId = getOutputCollation(exprs)
+
+exprs.map(e => castStringType(e, collationId).getOrElse(e))
+  }
+
+  /**
+   * Based on the data types of the input expressions this method determines
+   * a collation type which the output will have. This function accepts Seq of
+   * any expressions, but will only be affected by collated StringTypes or
+   * complex DataTypes with collated StringTypes (e.g. ArrayType)
+   */
+  def getOutputCollation(exprs: Seq[Expression]): Int = {
+val explicitTypes = exprs.filter(hasExplicitCollation)
+  .map(e => extractStringType(e.dataType).collationId).distinct
+
+explicitTypes.size match {
+  // We have 1 explicit collation
+  case 1 => explicitTypes.head
+  // Multiple explicit collations occurred
+  case size if size > 1 =>
+throw QueryCompilationErrors
+  .explicitCollationMismatchError(
+   

Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544060704


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala:
##
@@ -112,12 +112,14 @@ case class Collate(child: Expression, collationName: 
String)
   since = "4.0.0",
   group = "string_funcs")
 // scalastyle:on line.contains.tab
-case class Collation(child: Expression) extends UnaryExpression with 
RuntimeReplaceable {
-  override def dataType: DataType = StringType
+case class Collation(child: Expression)
+  extends UnaryExpression with RuntimeReplaceable with ExpectsInputTypes {
+  override def dataType: DataType = child.dataType

Review Comment:
   shouldn't the collation name be a normal string type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544055842


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala:
##
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import javax.annotation.Nullable
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions.{BinaryExpression, Cast, 
Collate, ComplexTypeMergingExpression, CreateArray, ExpectsInputTypes, 
Expression, Predicate, SortOrder}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, 
StringType}
+
+object CollationTypeCasts extends TypeCoercionRule {
+  override val transform: PartialFunction[Expression, Expression] = {
+case e if !e.childrenResolved => e
+// Case when we do fail if resulting collation is indeterminate

Review Comment:
   I thought we excluded indeterminate collation from this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47210][SQL] Addition of implicit casting without indeterminate support [spark]

2024-03-28 Thread via GitHub


cloud-fan commented on code in PR #45383:
URL: https://github.com/apache/spark/pull/45383#discussion_r1544004871


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -467,6 +467,24 @@
 ],
 "sqlState" : "42704"
   },
+  "COLLATION_MISMATCH" : {
+"message" : [
+  "Could not determine which collation to use for string comparison."

Review Comment:
   shall we say "string functions and operators"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org