[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696394#comment-16696394 ] Apache Spark commented on SPARK-25829: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23124 > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696393#comment-16696393 ] Apache Spark commented on SPARK-25829: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23124 > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663553#comment-16663553 ] Marco Gaido commented on SPARK-25829: - I think the main issue is that since this is not a SQL standard thing, every DB works in its way. Eg. Postgres just says that when duplicate keys are entered, there is no guarantee on the result (https://www.postgresql.org/docs/9.0/static/hstore.html); not a great policy, I agree. Maybe we can check Hive, since Spark takes much of its behavior from it. Anyway, I think we just need to define a coherent behavior across the codebase. One consideration is that enforcing a policy like Presto (eg. fail in such a situation) has 2 main drawbacks: - We usually don't fail with bad data (most of the times we return NULL instead of throwing exceptions in other situations); - Checking if a key is already present, with the current {{ArrayData}} representation, is very inefficient and we can do workarounds for this, but we would need to replicate workarounds in any function which can produce keys, so it is going to be problematic to maintain. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663539#comment-16663539 ] Liang-Chi Hsieh commented on SPARK-25829: - Although I think the inconsistent handling exists for a while, if we decide "later entry wins" and revert some functions from 2.4, will this be a blocker for 2.4? > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663484#comment-16663484 ] Liang-Chi Hsieh commented on SPARK-25829: - Besides Java/Scala, is there any related definition in SQL standard? Or maybe we can be consistent with common behavior of other SQL systems. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663394#comment-16663394 ] Kazuaki Ishizaki commented on SPARK-25829: -- I am curious about behavior in other systems such as Presto. Here are test cases for [array|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestArrayOperators.java] and [map|https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/type/TestMapOperators.java]. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663368#comment-16663368 ] Kazuaki Ishizaki commented on SPARK-25829: -- cc [~ueshin] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663230#comment-16663230 ] Dongjoon Hyun commented on SPARK-25829: --- Thank you for further investigation! Both tasks look not easy. For me, +1 for `later entry wins` semantics because it's Java/Scala language style and many users know those languages. Also, Spark works in that way, especially during the writing operation. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663121#comment-16663121 ] Wenchen Fan commented on SPARK-25829: - If we decide to follow "later entry wins", the following functions need to be reverted from 2.4 MapFilter, MapZipWith, TransformKeys, TransformValues > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118 ] Wenchen Fan commented on SPARK-25829: - More investigation on "later entry wins". If we still allow duplicated keys in map physically, following functions need to be updated: Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, TransformKeys, TransformValues, MapZipWith If we want to forbid duplicated keys in map, following functions need to be updated: CreateMap, MapFromArrays, MapFromEntries, MapConcat, MapFilter, and also reading map from data sources. So "later entry wins" semantic is more ideal but needs more works. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103 ] Wenchen Fan commented on SPARK-25829: - After more thoughts, both the map lookup behavior and `Dataset.collect` behavior are visible to end-users. It's hard to say which one is the official semantic as there is no doc, and we have to do behavior change for one of them. If we want to stick with the "earlier entry wins" semantic, then we need to fix the 3 sub-tasks listed here. If we want to stick with the "later entry wins" semantic, then we need to fix the map lookup(GetMapValue) and other related functions like `map_filter`. And for 2.4 we should revert these function if they are newly added, like `map_filter`. Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org