[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696394#comment-16696394
 ] 

Apache Spark commented on SPARK-25829:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23124

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696393#comment-16696393
 ] 

Apache Spark commented on SPARK-25829:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23124

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663553#comment-16663553
 ] 

Marco Gaido commented on SPARK-25829:
-

I think the main issue is that since this is not a SQL standard thing, every DB 
works in its way. Eg. Postgres just says that when duplicate keys are entered, 
there is no guarantee on the result 
(https://www.postgresql.org/docs/9.0/static/hstore.html); not a great policy, I 
agree. Maybe we can check Hive, since Spark takes much of its behavior from it. 
Anyway, I think we just need to define a coherent behavior across the codebase.

One consideration is that enforcing a policy like Presto (eg. fail in such a 
situation) has 2 main drawbacks:
 - We usually don't fail with bad data (most of the times we return NULL 
instead of throwing exceptions in other situations);
 - Checking if a key is already present, with the current {{ArrayData}} 
representation, is very inefficient and we can do workarounds for this, but we 
would need to replicate workarounds in any function which can produce keys, so 
it is going to be problematic to maintain.


> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663539#comment-16663539
 ] 

Liang-Chi Hsieh commented on SPARK-25829:
-

Although I think the inconsistent handling exists for a while, if we decide 
"later entry wins" and revert some functions from 2.4, will this be a blocker 
for 2.4?

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663484#comment-16663484
 ] 

Liang-Chi Hsieh commented on SPARK-25829:
-

Besides Java/Scala, is there any related definition in SQL standard? Or maybe 
we can be consistent with common behavior of other SQL systems.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663394#comment-16663394
 ] 

Kazuaki Ishizaki commented on SPARK-25829:
--

I am curious about behavior in other systems such as Presto.
Here are test cases for 
[array|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestArrayOperators.java]
 and 
[map|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestMapOperators.java].

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663368#comment-16663368
 ] 

Kazuaki Ishizaki commented on SPARK-25829:
--

cc [~ueshin]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663230#comment-16663230
 ] 

Dongjoon Hyun commented on SPARK-25829:
---

Thank you for further investigation! Both tasks look not easy. For me, +1 for 
`later entry wins` semantics because it's Java/Scala language style and many 
users know those languages. Also, Spark works in that way, especially during 
the writing operation.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663121#comment-16663121
 ] 

Wenchen Fan commented on SPARK-25829:
-

If we decide to follow "later entry wins", the following functions need to be 
reverted from 2.4
MapFilter, MapZipWith, TransformKeys, TransformValues

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118
 ] 

Wenchen Fan commented on SPARK-25829:
-

More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapConcat, MapFilter, and also 
reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103
 ] 

Wenchen Fan commented on SPARK-25829:
-

After more thoughts, both the map lookup behavior and `Dataset.collect` 
behavior are visible to end-users. It's hard to say which one is the official 
semantic as there is no doc, and we have to do behavior change for one of them.

If we want to stick with the "earlier entry wins" semantic, then we need to fix 
the 3 sub-tasks listed here.

If we want to stick with the "later entry wins" semantic, then we need to fix 
the map lookup(GetMapValue) and other related functions like `map_filter`. And 
for 2.4 we should revert these function if they are newly added, like 
`map_filter`.

Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org