[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-05-04 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464245#comment-16464245
 ] 

Bruce Robbins commented on SPARK-23936:
---

[~ueshin]

I have a question about map_concat's behavior as it pertains to this part of 
the function description: "If a key is found in multiple given maps, that key’s 
value in the resulting map comes from the last one of those maps."

Spark maps can have duplicate keys, e.g.:
{noformat}
scala> val df = sql("select map('a', 1, 'a', 2, 'b', 3, 'c', 10) as map1, 
map('a', 7, 'b', 8, 'b', 9) as map2")
scala> df.show(truncate=false)
+-++
|map1 |map2|
+-++
|[a -> 1, a -> 2, b -> 3, c -> 10]|[a -> 7, b -> 8, b -> 9]|
+-++
{noformat}
I'm not sure the duplicate handling part of the description makes sense for 
maps that allow duplicate keys.

I can think of 3 ways of handling the duplicate key handling requirement:

Scheme #1: Ignore it. map_concat would be a pure concantenation. Using the 
above example maps:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9]|
+-+
{noformat}
Duplicate keys are preserved from the original maps, and, in this example, 
additional duplicates are introduced.

Scheme #2: Preserve duplicates within input maps, but still pick a winner 
across maps. That is, treat the maps like so:
{noformat}
map1:
a -> [1, 2]
b -> [3]
c -> [10]

map2:
a -> [7]
b -> [8, 9]
{noformat}
Then use the rule that the key's value comes from the last map in which the key 
appears:
{noformat}
resulting map
a -> [7]// from map2
b -> [8, 9] // from map2
c -> [10]   // from map1
{noformat}
In Spark, it would look like this:
{noformat}
scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 7, b -> 8, b -> 9, c -> 10]|
+-+
{noformat}
Scheme #3: Don't allow any duplicates in the resulting map. That is, treat the 
input maps collectively as a stream of tuples, and keep only the last value for 
_any_ key:
{noformat}
a -> 1, a -> 2, b -> 3, c -> 10, a -> 7, b -> 8, b -> 9
^^   ^   ^
||   |   |
 overwrites   overwrites |overwrites
   a -> 1   a -> 2   |  b -> 8
 overwrites
   b -> 3

scala> df.selectExpr("map_concat(map1, map2)").show(truncate=false)
+-+
|map_concat(map1, map2)   |
+-+
|[a -> 7, b -> 9, c -> 10]|
+-+
{noformat}
Note: This is what I've actually implemented in my PR. It made sense to me due 
to the requirement that we pick a winner across maps. But I wasn't aware then 
that the source maps could have duplicates.

As a wrinkle to this, spark-sql, for some reason, eliminates duplicates in maps 
on display:
{noformat}
spark-sql> select map1, map2 from mapsWithDupKeys;
{"a":2,"b":3,"c":10}{"a":7,"b":9}
Time taken: 0.147 seconds, Fetched 1 row(s)
spark-sql> select map_keys(map1) from mapsWithDupKeys;
["a","a","b","c"]
Time taken: 0.093 seconds, Fetched 1 row(s)
{noformat}

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-04-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438576#comment-16438576
 ] 

Apache Spark commented on SPARK-23936:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21073

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-04-13 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437985#comment-16437985
 ] 

Bruce Robbins commented on SPARK-23936:
---

I will have a WIP pull request tonight or tomorrow sometime.

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-04-12 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436133#comment-16436133
 ] 

Bruce Robbins commented on SPARK-23936:
---

I would like to take this one, assuming no one has taken it. I will also watch 
for responses to [~mn-mikke] question.

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>

2018-04-11 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434185#comment-16434185
 ] 

Marek Novotny commented on SPARK-23936:
---

Shouldn't we overload _concat_ function for maps instead of introducing 
_map_concat_? 

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org