[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464245#comment-16464245 ] Bruce Robbins commented on SPARK-23936: --- [~ueshin] I have a question about map_concat's behavior as it pertains to this part of the function description: "If a key is found in multiple given maps, that key’s value in the resulting map comes from the last one of those maps." Spark maps can have duplicate keys, e.g.: {noformat} scala> val df = sql("select map('a', 1, 'a', 2, 'b', 3, 'c', 10) as map1, map('a', 7, 'b', 8, 'b', 9) as map2") scala> df.show(truncate=false) +-++ |map1 |map2| +-++ |[a -> 1, a -> 2, b -> 3, c -> 10]|[a -> 7, b -> 8, b -> 9]| +-++ {noformat} I'm not sure the duplicate handling part of the description makes sense for maps that allow duplicate keys. I can think of 3 ways of handling the duplicate key handling requirement: Scheme #1: Ignore it. map_concat would be a pure concantenation. Using the above example maps: {noformat} scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false) +-+ |map_concat(map1, map2) | +-+ |[a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9]| +-+ {noformat} Duplicate keys are preserved from the original maps, and, in this example, additional duplicates are introduced. Scheme #2: Preserve duplicates within input maps, but still pick a winner across maps. That is, treat the maps like so: {noformat} map1: a -> [1, 2] b -> [3] c -> [10] map2: a -> [7] b -> [8, 9] {noformat} Then use the rule that the key's value comes from the last map in which the key appears: {noformat} resulting map a -> [7]// from map2 b -> [8, 9] // from map2 c -> [10] // from map1 {noformat} In Spark, it would look like this: {noformat} scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false) +-+ |map_concat(map1, map2) | +-+ |[a -> 7, b -> 8, b -> 9, c -> 10]| +-+ {noformat} Scheme #3: Don't allow any duplicates in the resulting map. That is, treat the input maps collectively as a stream of tuples, and keep only the last value for _any_ key: {noformat} a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9 ^^ ^ ^ || | | overwrites overwrites |overwrites a -> 1 a -> 2 | b -> 8 overwrites b -> 3 scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false) +-+ |map_concat(map1, map2) | +-+ |[a -> 7, b -> 9, c -> 10]| +-+ {noformat} Note: This is what I've actually implemented in my PR. It made sense to me due to the requirement that we pick a winner across maps. But I wasn't aware then that the source maps could have duplicates. As a wrinkle to this, spark-sql, for some reason, eliminates duplicates in maps on display: {noformat} spark-sql> select map1, map2 from mapsWithDupKeys; {"a":2,"b":3,"c":10}{"a":7,"b":9} Time taken: 0.147 seconds, Fetched 1 row(s) spark-sql> select map_keys(map1) from mapsWithDupKeys; ["a","a","b","c"] Time taken: 0.093 seconds, Fetched 1 row(s) {noformat} > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438576#comment-16438576 ] Apache Spark commented on SPARK-23936: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/21073 > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437985#comment-16437985 ] Bruce Robbins commented on SPARK-23936: --- I will have a WIP pull request tonight or tomorrow sometime. > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436133#comment-16436133 ] Bruce Robbins commented on SPARK-23936: --- I would like to take this one, assuming no one has taken it. I will also watch for responses to [~mn-mikke] question. > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434185#comment-16434185 ] Marek Novotny commented on SPARK-23936: --- Shouldn't we overload _concat_ function for maps instead of introducing _map_concat_? > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org