[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21282 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r193958595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -308,6 +309,234 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp override def prettyName: String = "map_entries" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression { + + @transient + private lazy val dataTypeDetails: Option[(MapType, Boolean, Boolean)] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, keyNullable, _), +StructField(_, valueType, valueNullable, _))), + containsNull) => Some((MapType(keyType, valueType, valueNullable), keyNullable, containsNull)) +case _ => None + } + + private def nullEntries: Boolean = dataTypeDetails.get._3 + + override def dataType: MapType = dataTypeDetails.get._1 + + override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only arrays of pair structs.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val numEntries = if (nullEntries) (0 until length).count(!arrayData.isNullAt(_)) else length +val keyArray = new Array[AnyRef](numEntries) +val valueArray = new Array[AnyRef](numEntries) +var i = 0 +var j = 0 +while (i < length) { + if (!arrayData.isNullAt(i)) { --- End diff -- Yeah, that sounds reasonable. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r192875181 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -308,6 +309,234 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp override def prettyName: String = "map_entries" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression { + + @transient + private lazy val dataTypeDetails: Option[(MapType, Boolean, Boolean)] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, keyNullable, _), +StructField(_, valueType, valueNullable, _))), + containsNull) => Some((MapType(keyType, valueType, valueNullable), keyNullable, containsNull)) +case _ => None + } + + private def nullEntries: Boolean = dataTypeDetails.get._3 + + override def dataType: MapType = dataTypeDetails.get._1 + + override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only arrays of pair structs.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val numEntries = if (nullEntries) (0 until length).count(!arrayData.isNullAt(_)) else length +val keyArray = new Array[AnyRef](numEntries) +val valueArray = new Array[AnyRef](numEntries) +var i = 0 +var j = 0 +while (i < length) { + if (!arrayData.isNullAt(i)) { --- End diff -- Hi @ueshin, wouldn't it be better return `null` in this case? And follow null handling of other functions like `flatten`? ``` flatten(array(array(1,2), null, array(3,4))) => null ``` WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r192861844 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -308,6 +309,234 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp override def prettyName: String = "map_entries" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression { + + @transient + private lazy val dataTypeDetails: Option[(MapType, Boolean, Boolean)] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, keyNullable, _), +StructField(_, valueType, valueNullable, _))), + containsNull) => Some((MapType(keyType, valueType, valueNullable), keyNullable, containsNull)) +case _ => None + } + + private def nullEntries: Boolean = dataTypeDetails.get._3 + + override def dataType: MapType = dataTypeDetails.get._1 + + override def checkInputDataTypes(): TypeCheckResult = dataTypeDetails match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only arrays of pair structs.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val numEntries = if (nullEntries) (0 until length).count(!arrayData.isNullAt(_)) else length +val keyArray = new Array[AnyRef](numEntries) +val valueArray = new Array[AnyRef](numEntries) +var i = 0 +var j = 0 +while (i < length) { + if (!arrayData.isNullAt(i)) { --- End diff -- We should throw an exception if `arrayData.isNullAt(i)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r192573167 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), +StructField(_, valueType, valueNullable, _))), + false) => Some(MapType(keyType, valueType, valueNullable)) +case _ => None + } + + override def dataType: MapType = resolvedDataType.get + + override def checkInputDataTypes(): TypeCheckResult = resolvedDataType match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only null-free arrays " + + "of pair structs. Values of the first struct field can't contain nulls and produce " + + "duplicates.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val keyArray = new Array[AnyRef](length) +val keySet = new OpenHashSet[AnyRef]() +val valueArray = new Array[AnyRef](length) +var i = 0; +while (i < length) { + val entry = arrayData.getStruct(i, 2) + val key = entry.get(0, dataType.keyType) + if (key == null) { +throw new RuntimeException("The first field from a struct (key) can't be null.") + } + if (keySet.contains(key)) { --- End diff -- Ok, no problem. I've removed duplicity checks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r192531374 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), +StructField(_, valueType, valueNullable, _))), + false) => Some(MapType(keyType, valueType, valueNullable)) +case _ => None + } + + override def dataType: MapType = resolvedDataType.get + + override def checkInputDataTypes(): TypeCheckResult = resolvedDataType match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only null-free arrays " + + "of pair structs. Values of the first struct field can't contain nulls and produce " + + "duplicates.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val keyArray = new Array[AnyRef](length) +val keySet = new OpenHashSet[AnyRef]() +val valueArray = new Array[AnyRef](length) +var i = 0; +while (i < length) { + val entry = arrayData.getStruct(i, 2) + val key = entry.get(0, dataType.keyType) + if (key == null) { +throw new RuntimeException("The first field from a struct (key) can't be null.") + } + if (keySet.contains(key)) { --- End diff -- I'm sorry for the super delay. Let's just ignore the duplicated key like `CreateMap` for now. We will need to discuss map-related topics, such as duplicate keys, equality or ordering, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user adrian-wang commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187291692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ --- End diff -- Nit: style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187282249 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), +StructField(_, valueType, valueNullable, _))), + false) => Some(MapType(keyType, valueType, valueNullable)) +case _ => None + } + + override def dataType: MapType = resolvedDataType.get + + override def checkInputDataTypes(): TypeCheckResult = resolvedDataType match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only null-free arrays " + + "of pair structs. Values of the first struct field can't contain nulls and produce " + + "duplicates.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val keyArray = new Array[AnyRef](length) +val keySet = new OpenHashSet[AnyRef]() +val valueArray = new Array[AnyRef](length) +var i = 0; +while (i < length) { + val entry = arrayData.getStruct(i, 2) + val key = entry.get(0, dataType.keyType) + if (key == null) { +throw new RuntimeException("The first field from a struct (key) can't be null.") + } + if (keySet.contains(key)) { --- End diff -- Yeah, we've already touched this topic in [your PR for SPARK-23933]( https://github.com/apache/spark/pull/21258). I think if some hashing is added into maps in future, these duplicity checks will have to be introduced anyway. So if we add it now, we can avoid breaking changes in future. But I understand your point of view. Presto also doesn't support duplicates: ``` presto:default> SELECT map_from_entries(ARRAY[(1, 'x'), (1, 'y')]); Query 20180510_090536_5_468a9 failed: Duplicate keys (1) are not allowed ``` WDYT @ueshin @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187258000 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { --- End diff -- `@transient`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187263146 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), --- End diff -- We don't need key field to be `nullable = false` because we check the nullability when creating an array? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187264919 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), +StructField(_, valueType, valueNullable, _))), + false) => Some(MapType(keyType, valueType, valueNullable)) --- End diff -- Can we reject an array with `containsNull = true` here? The array might not contain `null`s. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21282#discussion_r187234431 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -118,6 +120,229 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns a map created from the given array of entries. + */ +@ExpressionDescription( + usage = "_FUNC_(arrayOfEntries) - Returns a map created from the given array of entries.", + examples = """ +Examples: + > SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b'))); + {1:"a",2:"b"} + """, + since = "2.4.0") +case class MapFromEntries(child: Expression) extends UnaryExpression +{ + private lazy val resolvedDataType: Option[MapType] = child.dataType match { +case ArrayType( + StructType(Array( +StructField(_, keyType, false, _), +StructField(_, valueType, valueNullable, _))), + false) => Some(MapType(keyType, valueType, valueNullable)) +case _ => None + } + + override def dataType: MapType = resolvedDataType.get + + override def checkInputDataTypes(): TypeCheckResult = resolvedDataType match { +case Some(_) => TypeCheckResult.TypeCheckSuccess +case None => TypeCheckResult.TypeCheckFailure(s"'${child.sql}' is of " + + s"${child.dataType.simpleString} type. $prettyName accepts only null-free arrays " + + "of pair structs. Values of the first struct field can't contain nulls and produce " + + "duplicates.") + } + + override protected def nullSafeEval(input: Any): Any = { +val arrayData = input.asInstanceOf[ArrayData] +val length = arrayData.numElements() +val keyArray = new Array[AnyRef](length) +val keySet = new OpenHashSet[AnyRef]() +val valueArray = new Array[AnyRef](length) +var i = 0; +while (i < length) { + val entry = arrayData.getStruct(i, 2) + val key = entry.get(0, dataType.keyType) + if (key == null) { +throw new RuntimeException("The first field from a struct (key) can't be null.") + } + if (keySet.contains(key)) { --- End diff -- Is this check necessary for now? This is because other operations (e.g. `CreateMap`) allows us to create a map with duplicated key. Is it better to be consistent in Spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21282: [SPARK-23934][SQL] Adding map_from_entries functi...
GitHub user mn-mikke opened a pull request: https://github.com/apache/spark/pull/21282 [SPARK-23934][SQL] Adding map_from_entries function ## What changes were proposed in this pull request? The PR adds the `map_from_entries` function that returns a map created from the given array of entries. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionSuite` ## Note The proposed behavior doesn't allow duplicated keys in contrast with `map` function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/AbsaOSS/spark feature/array-api-map_from_entries-to-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21282.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21282 commit 8c6039c7b7f31f0343c4b0098a4e12dfff125128 Author: Marek NovotnyDate: 2018-05-07T12:23:18Z [SPARK-23934][SQL] Adding map_from_entries function --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org