This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new db15b82 [SPARK-37822][SQL] StringSplit should return an array on non-null elements db15b82 is described below commit db15b82be96cfd0f392b149b43b06148b639d9d7 Author: Shardul Mahadik <smaha...@linkedin.com> AuthorDate: Thu Jan 6 15:51:33 2022 +0800 [SPARK-37822][SQL] StringSplit should return an array on non-null elements ### What changes were proposed in this pull request? Currently, `split` [returns the data type](https://github.com/apache/spark/blob/08dd010860cc176a33073928f4c0780d0ee98a08/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L532) `ArrayType(StringType)` which means the resultant array can contain nullable elements. However I do not see any case where the array can contain nulls. In the case where either the provided string or delimiter is `NULL`, the output will be a `NULL` array. In case of empty string or no chars between delemiters, the output array will contain empty strings but never `NULL`s. So I propose we change the return type of `split` to mark elements as non-null. ### Why are the changes needed? Provides a more accurate return type for `split` ### Does this PR introduce _any_ user-facing change? Yes, schema for queries using `split` will change to show an array of non-null elements. ### How was this patch tested? Trivial change. Manually tested with Spark shell ``` scala> spark.sql("SELECT split('a,b,c', ',')").printSchema root |-- split(a,b,c, ,, -1): array (nullable = false) | |-- element: string (containsNull = false) ``` I can't think of a better test case than just testing `StringSplit().dataType == Array(StringType, containsNull = false)` at which point, it is just duplicating the actual definition of `StringSplit`. Closes #35111 from shardulm94/spark-37822. Authored-by: Shardul Mahadik <smaha...@linkedin.com> Signed-off-by: Wenchen Fan <wenc...@databricks.com> --- .../org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala index e14e9ab..889c53b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala @@ -529,7 +529,7 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress case class StringSplit(str: Expression, regex: Expression, limit: Expression) extends TernaryExpression with ImplicitCastInputTypes with NullIntolerant { - override def dataType: DataType = ArrayType(StringType) + override def dataType: DataType = ArrayType(StringType, containsNull = false) override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType) override def first: Expression = str override def second: Expression = regex --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org