[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736645#comment-16736645 ] Dongjoon Hyun commented on SPARK-25823: --- +1. Thank you, [~hyukjin.kwon]. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736582#comment-16736582 ] Hyukjin Kwon commented on SPARK-25823: -- Looks [~Thincrs] bot is still active. I'm going to ask directly via emails. If the bot is still active, I'm going to open an infra JIRA to ben this bot. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736446#comment-16736446 ] Thincrs commented on SPARK-25823: - A user of thincrs has selected this issue. Deadline: Mon, Jan 14, 2019 10:32 PM > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899 ] Dongjoon Hyun commented on SPARK-25823: --- Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please feel free to adjust the priority if needed for releasing 2.4.0. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662891#comment-16662891 ] Sean Owen commented on SPARK-25823: --- I was referring to your comment at https://issues.apache.org/jira/browse/SPARK-25824?focusedCommentId=16662725=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16662725 – this one is not cosmetic. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662760#comment-16662760 ] Dongjoon Hyun commented on SPARK-25823: --- For me, this is a `data correctness` issue at `map_filter` operation. We had better fix this. So, I'm investigating this. Apache Spark PMC may choose another path for 2.4.0 release. I also respect that. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662743#comment-16662743 ] Dongjoon Hyun commented on SPARK-25823: --- Sorry, but who says this issue is cosmetic? I think you are confused between both issues due to clicking email links. :) bq. Got it, you're saying that's cosmetic. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662737#comment-16662737 ] Sean Owen commented on SPARK-25823: --- Got it, you're saying that's cosmetic. But I understood from here that there's a real underlying issue in map(). That's what this is about, then? and is it a blocker? It again seems like something that isn't a hard blocker, but whether to fix it for 2.4 depends also on how long and how reliable a fix would be. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662735#comment-16662735 ] Dongjoon Hyun commented on SPARK-25823: --- [~srowen] and [~cloud_fan]. To be clear, - This issue doesn't aim to fix CreateMap or existing `map_keys/map_values`. (Please see the description) - This issue only aims to fix the wrongly materialized cases like CTAS. Is it correct that `SELECT map_filter(m)` returns unseen values at `SELECT m`? Do you think so? {code} CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726 ] Dongjoon Hyun commented on SPARK-25823: --- [~srowen]. I commend on SPARK-25824. Please see there. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662717#comment-16662717 ] Sean Owen commented on SPARK-25823: --- Hm, another tough one. It's not so much about what these new functions do but what map() already does in 2.3.0. Yeah, the current behavior isn't even internally consistent. It's a bug, and that's what SPARK-25824 tracks now (right? bug, not improvement?) Is this a duplicate? or here to say we should note the map() behavior as a known issue? I'd again say this isn't a blocker, even though it's a regression from a significantly older release. I'm still OK drawing that distinction as it seems to be the constructive place to draw a bright line. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662673#comment-16662673 ] Dongjoon Hyun commented on SPARK-25823: --- Ah, got it. I thought a different one. For mine, I created SPARK-25824 as a minor improvement in Spark 3.0. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662666#comment-16662666 ] Wenchen Fan commented on SPARK-25823: - Anyway we should make map lookup and toString/collect consistent. It's weird if `map(1,2,1,3)[1]` returns 2 but `string(map(1,2,1,3))` returns 1->3. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660 ] Dongjoon Hyun commented on SPARK-25823: --- BTW, [~cloud_fan]. I'm looking around this issue. I'll create another improvement issue to fix `show` function for the following your comment. bq. BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu]. {code} scala> sql("SELECT map(1,2,1,3)").show +---+ |map(1, 2, 1, 3)| +---+ |Map(1 -> 3)| +---+ scala> spark.version res2: String = 2.2.2 {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662646#comment-16662646 ] Wenchen Fan commented on SPARK-25823: - [~dongjoon] good catch! I think we should update collect to match the behavior of map lookup. Going back to this ticket, the current behavior is different from presto but is consistent with how map type behaves in Spark. If others think this is serious, I'd suggest we remove map-related high-order functions from 2.4. However we can't remove `CreateMap`, so the behavior of map type in Spark is still as it was. Personally I don't want to remove the map-related high-order functions, as they follow the map type semantic in Spark and are implemented correctly. The only benefit I can think of is to not spread the unexpected behavior of map type in Spark. In the master branch we can work on making Spark map type consistent with Presto. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618 ] Dongjoon Hyun commented on SPARK-25823: --- Right. This is a long lasting issue when CreateMap added. And, when we do collect, the last entry wins. {code} scala> sql("SELECT map(1,2,1,3)").collect res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662616#comment-16662616 ] Wenchen Fan commented on SPARK-25823: - BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662613#comment-16662613 ] Wenchen Fan commented on SPARK-25823: - I did a few experiments {code} scala> sql("SELECT map(1,2,1,3)").show ++ | map(1, 2, 1, 3)| ++ |[1 -> 2, 1 -> 3]| ++ scala> sql("SELECT map(1,2,1,3)[1]").show +--+ |map(1, 2, 1, 3)[1]| +--+ | 2| +--+ {code} So this is a long-standing problem that Spark allows duplicated map keys, and during lookup the first matched entry wins. I'm not sure we should change this behavior for now. Maybe we should find a place to document it. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662583#comment-16662583 ] Dongjoon Hyun commented on SPARK-25823: --- cc [~cloud_fan], [~srowen], [~smilegator], [~hyukjin.kwon] and [~kabhwan]. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org