subject:"\[jira\] \[Commented\] \(SPARK\-25823\) map_filter can generate incorrect data"

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736645#comment-16736645
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

+1. Thank you, [~hyukjin.kwon].

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736582#comment-16736582
 ] 

Hyukjin Kwon commented on SPARK-25823:
--

Looks [~Thincrs] bot is still active. I'm going to ask directly via emails. If 
the bot is still active, I'm going to open an infra JIRA to ben this bot.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2019-01-07 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736446#comment-16736446
 ] 

Thincrs commented on SPARK-25823:
-

A user of thincrs has selected this issue. Deadline: Mon, Jan 14, 2019 10:32 PM

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please 
feel free to adjust the priority if needed for releasing 2.4.0.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662891#comment-16662891
 ] 

Sean Owen commented on SPARK-25823:
---

I was referring to your comment at 
https://issues.apache.org/jira/browse/SPARK-25824?focusedCommentId=16662725=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16662725
 – this one is not cosmetic.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662760#comment-16662760
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

For me, this is a `data correctness` issue at `map_filter` operation. We had 
better fix this. So, I'm investigating this.
Apache Spark PMC may choose another path for 2.4.0 release. I also respect that.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662743#comment-16662743
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Sorry, but who says this issue is cosmetic? I think you are confused between 
both issues due to clicking email links. :)
bq. Got it, you're saying that's cosmetic. 

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662737#comment-16662737
 ] 

Sean Owen commented on SPARK-25823:
---

Got it, you're saying that's cosmetic. But I understood from here that there's 
a real underlying issue in map(). That's what this is about, then? and is it a 
blocker? It again seems like something that isn't a hard blocker, but whether 
to fix it for 2.4 depends also on how long and how reliable a fix would be.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662735#comment-16662735
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

[~srowen] and [~cloud_fan]. To be clear, 
- This issue doesn't aim to fix CreateMap or existing `map_keys/map_values`. 
(Please see the description)
- This issue only aims to fix the wrongly materialized cases like CTAS.

Is it correct that `SELECT map_filter(m)` returns unseen values at `SELECT m`? 
Do you think so?
{code}
CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

[~srowen]. I commend on SPARK-25824. Please see there.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662717#comment-16662717
 ] 

Sean Owen commented on SPARK-25823:
---

Hm, another tough one. It's not so much about what these new functions do but 
what map() already does in 2.3.0. Yeah, the current behavior isn't even 
internally consistent. It's a bug, and that's what SPARK-25824 tracks now 
(right? bug, not improvement?)

Is this a duplicate? or here to say we should note the map() behavior as a 
known issue?

I'd again say this isn't a blocker, even though it's a regression from a 
significantly older release. I'm still OK drawing that distinction as it seems 
to be the constructive place to draw a bright line.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662673#comment-16662673
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Ah, got it. I thought a different one. For mine, I created SPARK-25824 as a 
minor improvement in Spark 3.0.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662666#comment-16662666
 ] 

Wenchen Fan commented on SPARK-25823:
-

Anyway we should make map lookup and toString/collect consistent. It's weird if 
`map(1,2,1,3)[1]` returns 2 but `string(map(1,2,1,3))` returns 1->3.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

BTW, [~cloud_fan]. I'm looking around this issue. I'll create another 
improvement issue to fix `show` function for the following your comment.
bq. BTW one improvement we can do is to remove duplicated map keys when 
converting map values to string, to make it invisible to end-users.

It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu].
{code}
scala> sql("SELECT map(1,2,1,3)").show
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+

scala> spark.version
res2: String = 2.2.2
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662646#comment-16662646
 ] 

Wenchen Fan commented on SPARK-25823:
-

[~dongjoon] good catch! I think we should update collect to match the behavior 
of map lookup.

Going back to this ticket, the current behavior is different from presto but is 
consistent with how map type behaves in Spark. If others think this is serious, 
I'd suggest we remove map-related high-order functions from 2.4. However we 
can't remove `CreateMap`, so the behavior of map type in Spark is still as it 
was.

Personally I don't want to remove the map-related high-order functions, as they 
follow the map type semantic in Spark and are implemented correctly. The only 
benefit I can think of is to not spread the unexpected behavior of map type in 
Spark.

In the master branch we can work on making Spark map type consistent with 
Presto.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Right. This is a long lasting issue when CreateMap added. And, when we do 
collect, the last entry wins.
{code}
scala> sql("SELECT map(1,2,1,3)").collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662616#comment-16662616
 ] 

Wenchen Fan commented on SPARK-25823:
-

BTW one improvement we can do is to remove duplicated map keys when converting 
map values to string, to make it invisible to end-users.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662613#comment-16662613
 ] 

Wenchen Fan commented on SPARK-25823:
-

I did a few experiments
{code}
scala> sql("SELECT map(1,2,1,3)").show
++
| map(1, 2, 1, 3)|
++
|[1 -> 2, 1 -> 3]|
++

scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+
{code}

So this is a long-standing problem that Spark allows duplicated map keys, and 
during lookup the first matched entry wins. I'm not sure we should change this 
behavior for now. Maybe we should find a place to document it.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662583#comment-16662583
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

cc [~cloud_fan], [~srowen], [~smilegator], [~hyukjin.kwon] and [~kabhwan].

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

19 matches

Site Navigation

Mail list logo

Footer information