[jira] [Updated] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25832:
--
Priority: Blocker  (was: Major)

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25832:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25832:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663233#comment-16663233
 ] 

Apache Spark commented on SPARK-25832:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22821

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-24 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25832:
---

 Summary: remove newly added map related functions from 
FunctionRegistry
 Key: SPARK-25832
 URL: https://issues.apache.org/jira/browse/SPARK-25832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663230#comment-16663230
 ] 

Dongjoon Hyun commented on SPARK-25829:
---

Thank you for further investigation! Both tasks look not easy. For me, +1 for 
`later entry wins` semantics because it's Java/Scala language style and many 
users know those languages. Also, Spark works in that way, especially during 
the writing operation.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663121#comment-16663121
 ] 

Wenchen Fan commented on SPARK-25829:
-

If we decide to follow "later entry wins", the following functions need to be 
reverted from 2.4
MapFilter, MapZipWith, TransformKeys, TransformValues

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118
 ] 

Wenchen Fan commented on SPARK-25829:
-

More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapConcat, MapFilter, and also 
reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103
 ] 

Wenchen Fan edited comment on SPARK-25829 at 10/25/18 2:20 AM:
---

After more thoughts, both the map lookup behavior and `Dataset.collect` 
behavior are visible to end-users. It's hard to say which one is the official 
semantic as there is no doc, and we have to do behavior change for one of them.

If we want to stick with the "earlier entry wins" semantic, then we need to fix 
the 3 sub-tasks listed here.

If we want to stick with the "later entry wins" semantic, then we need to fix 
the map lookup(GetMapValue) and other related functions like `map_filter`, or 
deduplicate map keys at all the places that may create map. And for 2.4 we 
should revert these function if they are newly added, like `map_filter`.

Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido]


was (Author: cloud_fan):
After more thoughts, both the map lookup behavior and `Dataset.collect` 
behavior are visible to end-users. It's hard to say which one is the official 
semantic as there is no doc, and we have to do behavior change for one of them.

If we want to stick with the "earlier entry wins" semantic, then we need to fix 
the 3 sub-tasks listed here.

If we want to stick with the "later entry wins" semantic, then we need to fix 
the map lookup(GetMapValue) and other related functions like `map_filter`. And 
for 2.4 we should revert these function if they are newly added, like 
`map_filter`.

Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103
 ] 

Wenchen Fan commented on SPARK-25829:
-

After more thoughts, both the map lookup behavior and `Dataset.collect` 
behavior are visible to end-users. It's hard to say which one is the official 
semantic as there is no doc, and we have to do behavior change for one of them.

If we want to stick with the "earlier entry wins" semantic, then we need to fix 
the 3 sub-tasks listed here.

If we want to stick with the "later entry wins" semantic, then we need to fix 
the map lookup(GetMapValue) and other related functions like `map_filter`. And 
for 2.4 we should revert these function if they are newly added, like 
`map_filter`.

Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido]

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24917) Sending a partition over netty results in 2x memory usage

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24917.
---
Resolution: Won't Fix

Evidently obsoleted by the update to Netty 4.1.30

> Sending a partition over netty results in 2x memory usage
> -
>
> Key: SPARK-24917
> URL: https://issues.apache.org/jira/browse/SPARK-24917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Vincent
>Priority: Major
>
> Hello
> while investigating some OOM errors in Spark 2.2 [(here's my call 
> stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following 
> behavior happening, which I think is weird:
>  * a request happens to send a partition over network
>  * this partition is 1.9 GB and is persisted in memory
>  * this partition is apparently stored in a ByteBufferBlockData, that is made 
> of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each.
>  * the call to toNetty() is supposed to only wrap all the arrays and not 
> allocate any memory
>  * yet the call stack shows that netty is allocating memory and is trying to 
> consolidate all the chunks into one big 1.9GB array
>  * this means that at this point the memory footprint is 2x the size of the 
> actual partition (which is huge when the partition is 1.9GB)
> Is this transient allocation expected?
> After digging, it turns out that the actual copy is due to [this 
> method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260]
>  in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS 
> (16) components it will trigger a re-allocation of all the buffer. This netty 
> issue was fixed in this recent change : 
> [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2]
>  
> As a result, is it possible to benefit from this change somehow in spark 2.2 
> and above? I don't know how the netty dependencies are handled for spark
>  
> NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] 
> kinda changed the approach for spark 2.4 by bypassing netty buffer 
> altogether. However as it is written in the ticket, this approach *still* 
> needs to have the *entire* block serialized in memory, so this would be a 
> downgrade from fixing the netty issue when your buffer in <  2GB
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect

2018-10-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25830:

Description: 
{code}
scala> sql("select map(1,2,1,3)").collect
res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
{code}

We mistakenly apply "later entry wins"

> should apply "earlier entry wins" in Dataset.collect
> 
>
> Key: SPARK-25830
> URL: https://issues.apache.org/jira/browse/SPARK-25830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> scala> sql("select map(1,2,1,3)").collect
> res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> {code}
> We mistakenly apply "later entry wins"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25831) should apply "earlier entry wins" in hive map value converter

2018-10-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25831:

Description: 
{code}
scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
res11: org.apache.spark.sql.DataFrame = []

scala> sql("select * from t").show
++
| map|
++
|[1 -> 3]|
++
{code}

We mistakenly apply "later entry wins"

> should apply "earlier entry wins" in hive map value converter
> -
>
> Key: SPARK-25831
> URL: https://issues.apache.org/jira/browse/SPARK-25831
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map")
> res11: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from t").show
> ++
> | map|
> ++
> |[1 -> 3]|
> ++
> {code}
> We mistakenly apply "later entry wins"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25831) should apply "earlier entry wins" in hive map value converter

2018-10-24 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25831:
---

 Summary: should apply "earlier entry wins" in hive map value 
converter
 Key: SPARK-25831
 URL: https://issues.apache.org/jira/browse/SPARK-25831
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect

2018-10-24 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25830:
---

 Summary: should apply "earlier entry wins" in Dataset.collect
 Key: SPARK-25830
 URL: https://issues.apache.org/jira/browse/SPARK-25830
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25824:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25829

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25829:

Description: 
In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
e.g.
{code}
scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+
{code}

However, this handling is not applied consistently.

  was:
In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
e.g.
{code}
scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+
{code}

However, this handling is not applied consistenly.


> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25829:

Description: 
In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
e.g.
{code}
scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+
{code}

However, this handling is not applied consistenly.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistenly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25829) Duplicated map keys are not handled consistently

2018-10-24 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25829:
---

 Summary: Duplicated map keys are not handled consistently
 Key: SPARK-25829
 URL: https://issues.apache.org/jira/browse/SPARK-25829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663060#comment-16663060
 ] 

Dongjoon Hyun commented on SPARK-25824:
---

According to [~cloud_fan]'s analysis, this one is converted as a bug.
- 
https://lists.apache.org/thread.html/11afc74162b922fbef81db1e96c082f2e6f217d79dc1d82ec2702aef@%3Cdev.spark.apache.org%3E

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25824:
--
Issue Type: Bug  (was: Improvement)

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663045#comment-16663045
 ] 

Apache Spark commented on SPARK-25828:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22820

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663043#comment-16663043
 ] 

Apache Spark commented on SPARK-25828:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22820

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25828:


Assignee: Apache Spark

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25828:


Assignee: (was: Apache Spark)

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663033#comment-16663033
 ] 

Stavros Kontopoulos commented on SPARK-25828:
-

Cool [~eje] [~ifilonenko] are you working on it?

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23229) Dataset.hint should use planWithBarrier logical plan

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23229.
---
Resolution: Won't Fix

> Dataset.hint should use planWithBarrier logical plan
> 
>
> Key: SPARK-23229
> URL: https://issues.apache.org/jira/browse/SPARK-23229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Every time {{Dataset.hint}} is used it triggers execution of logical 
> commands, their unions and hint resolution (among other things that analyzer 
> does).
> {{hint}} should use {{planWithBarrier}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662978#comment-16662978
 ] 

Erik Erlandson commented on SPARK-25828:


cc [~skonto]

> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Ilan Filonenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-25828:
---
Description: 
Upgrade the Kubernetes client version to at least 
[4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
as we are falling behind on fabric8 updates. This will be an update to both 
kubernetes/core and kubernetes/integration-tests


  was:
Upgrade the Kubernetes client version to at least 
[4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
as we are falling behind on fabric8 updates. This will be an update to both in 
kubernetes/core and kubernetes/integration-tests



> Bumping Version of kubernetes.client to latest version
> --
>
> Key: SPARK-25828
> URL: https://issues.apache.org/jira/browse/SPARK-25828
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Upgrade the Kubernetes client version to at least 
> [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
> as we are falling behind on fabric8 updates. This will be an update to both 
> kubernetes/core and kubernetes/integration-tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25828) Bumping Version of kubernetes.client to latest version

2018-10-24 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25828:
--

 Summary: Bumping Version of kubernetes.client to latest version
 Key: SPARK-25828
 URL: https://issues.apache.org/jira/browse/SPARK-25828
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


Upgrade the Kubernetes client version to at least 
[4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] 
as we are falling behind on fabric8 updates. This will be an update to both in 
kubernetes/core and kubernetes/integration-tests




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24899) Add example of monotonically_increasing_id standard function to scaladoc

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24899.
---
Resolution: Won't Fix

> Add example of monotonically_increasing_id standard function to scaladoc
> 
>
> Key: SPARK-24899
> URL: https://issues.apache.org/jira/browse/SPARK-24899
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I think an example of {{monotonically_increasing_id}} standard function in 
> scaladoc would help people understand why the function is monotonically 
> increasing and unique but not consecutive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25490) Refactor KryoBenchmark

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25490:
-

Assignee: Gengliang Wang

> Refactor KryoBenchmark
> --
>
> Key: SPARK-25490
> URL: https://issues.apache.org/jira/browse/SPARK-25490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor KryoBenchmark to use main method and print the output as a separate 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25490) Refactor KryoBenchmark

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25490.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22663
[https://github.com/apache/spark/pull/22663]

> Refactor KryoBenchmark
> --
>
> Key: SPARK-25490
> URL: https://issues.apache.org/jira/browse/SPARK-25490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor KryoBenchmark to use main method and print the output as a separate 
> file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please 
feel free to adjust the priority if needed for releasing 2.4.0.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899
 ] 

Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 9:42 PM:
-

Right. That one is cosmetic. Not this one. I sent my opinion on dev mailing 
list, too. Please feel free to adjust the priority if needed for releasing 
2.4.0.


was (Author: dongjoon):
Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please 
feel free to adjust the priority if needed for releasing 2.4.0.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662891#comment-16662891
 ] 

Sean Owen commented on SPARK-25823:
---

I was referring to your comment at 
https://issues.apache.org/jira/browse/SPARK-25824?focusedCommentId=16662725=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16662725
 – this one is not cosmetic.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662830#comment-16662830
 ] 

kevin yu commented on SPARK-25807:
--

Thanks Sean, ok, I will leave as it is. 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2018-10-24 Thread Kevin Grealish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750
 ] 

Kevin Grealish edited comment on SPARK-23015 at 10/24/18 8:51 PM:
--

One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

{{
// Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
// Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
// to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
// and then executes that command. %RANDOM% does not have sufficent 
range to avoid collisions when launching many Spark processes.
// As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
// "The process cannot access the file because it is being used by 
another process."
// "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
// As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
string newTemp = null;
if (AppRuntimeEnvironment.IsRunningOnWindows())
{
var ourTemp = Environment.GetEnvironmentVariable("TEMP");
var newDirName = "dprep" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
newTemp = Path.Combine(ourTemp, newDirName);
Directory.CreateDirectory(newTemp);
start.Environment["TEMP"] = newTemp;
}
}}


was (Author: kevingre):
One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

{{
// Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
// Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
// to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
// and then executes that command. %RANDOM% does not have sufficent 
range to avoid collisions when launching many Spark processes.
// As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
// "The process cannot access the file because it is being used by 
another process."
// "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
// As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
string newTemp = null;
if (AppRuntimeEnvironment.IsRunningOnWindows())
{
var ourTemp = Environment.GetEnvironmentVariable("TEMP");
var newDirName = "dprep" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
newTemp = Path.Combine(ourTemp, newDirName);
Directory.CreateDirectory(newTemp);
start.Environment["TEMP"] = newTemp;
}

}}

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> 

[jira] [Assigned] (SPARK-25827) Replicating a block > 2gb with encryption fails

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25827:


Assignee: Apache Spark

> Replicating a block > 2gb with encryption fails
> ---
>
> Key: SPARK-25827
> URL: https://issues.apache.org/jira/browse/SPARK-25827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> When replicating large blocks with encryption, we try to allocate an array of 
> size {{Int.MaxValue}} which is just a bit too big for the JVM.  This is 
> basically the same as SPARK-25704, just another case.
> In DiskStore:
> {code}
> val chunkSize = math.min(remaining, Int.MaxValue)
> {code}
> {noformat}
> 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to 
> ..., failure #0
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> ...
> Caused by: java.lang.RuntimeException: java.io.IOException: Destination 
> failed while reading stream
> ...
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25827) Replicating a block > 2gb with encryption fails

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25827:


Assignee: (was: Apache Spark)

> Replicating a block > 2gb with encryption fails
> ---
>
> Key: SPARK-25827
> URL: https://issues.apache.org/jira/browse/SPARK-25827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> When replicating large blocks with encryption, we try to allocate an array of 
> size {{Int.MaxValue}} which is just a bit too big for the JVM.  This is 
> basically the same as SPARK-25704, just another case.
> In DiskStore:
> {code}
> val chunkSize = math.min(remaining, Int.MaxValue)
> {code}
> {noformat}
> 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to 
> ..., failure #0
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> ...
> Caused by: java.lang.RuntimeException: java.io.IOException: Destination 
> failed while reading stream
> ...
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25827) Replicating a block > 2gb with encryption fails

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662825#comment-16662825
 ] 

Apache Spark commented on SPARK-25827:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22818

> Replicating a block > 2gb with encryption fails
> ---
>
> Key: SPARK-25827
> URL: https://issues.apache.org/jira/browse/SPARK-25827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> When replicating large blocks with encryption, we try to allocate an array of 
> size {{Int.MaxValue}} which is just a bit too big for the JVM.  This is 
> basically the same as SPARK-25704, just another case.
> In DiskStore:
> {code}
> val chunkSize = math.min(remaining, Int.MaxValue)
> {code}
> {noformat}
> 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to 
> ..., failure #0
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> ...
> Caused by: java.lang.RuntimeException: java.io.IOException: Destination 
> failed while reading stream
> ...
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
>   at 
> org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221)
>   at 
> org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25807:
--
Target Version/s:   (was: 2.4.0, 2.4.1, 2.5.0, 3.0.0)

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662817#comment-16662817
 ] 

Sean Owen commented on SPARK-25807:
---

They are meant to match Hive, SQL. They should not match Java, Python. No, this 
should not be changed.

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25827) Replicating a block > 2gb with encryption fails

2018-10-24 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-25827:


 Summary: Replicating a block > 2gb with encryption fails
 Key: SPARK-25827
 URL: https://issues.apache.org/jira/browse/SPARK-25827
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Imran Rashid


When replicating large blocks with encryption, we try to allocate an array of 
size {{Int.MaxValue}} which is just a bit too big for the JVM.  This is 
basically the same as SPARK-25704, just another case.

In DiskStore:
{code}
val chunkSize = math.min(remaining, Int.MaxValue)
{code}

{noformat}
18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to 
..., failure #0
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
...
Caused by: java.lang.RuntimeException: java.io.IOException: Destination failed 
while reading stream
...
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at 
org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
at 
org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446)
at 
org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221)
at 
org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662760#comment-16662760
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

For me, this is a `data correctness` issue at `map_filter` operation. We had 
better fix this. So, I'm investigating this.
Apache Spark PMC may choose another path for 2.4.0 release. I also respect that.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25815) Kerberos Support in Kubernetes resource manager (Client Mode)

2018-10-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662761#comment-16662761
 ] 

Marcelo Vanzin commented on SPARK-25815:


I actually have this working (and also I'm adding proper principal/keytab 
support inline with YARN/Mesos), but it's built on top of SPARK-23781 which is 
not merged yet.

> Kerberos Support in Kubernetes resource manager (Client Mode)
> -
>
> Key: SPARK-25815
> URL: https://issues.apache.org/jira/browse/SPARK-25815
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Include Kerberos support for Spark on K8S jobs running in client-mode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2018-10-24 Thread Kevin Grealish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750
 ] 

Kevin Grealish edited comment on SPARK-23015 at 10/24/18 7:47 PM:
--

One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

{{
// Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
// Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
// to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
// and then executes that command. %RANDOM% does not have sufficent 
range to avoid collisions when launching many Spark processes.
// As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
// "The process cannot access the file because it is being used by 
another process."
// "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
// As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
string newTemp = null;
if (AppRuntimeEnvironment.IsRunningOnWindows())
{
var ourTemp = Environment.GetEnvironmentVariable("TEMP");
var newDirName = "dprep" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
newTemp = Path.Combine(ourTemp, newDirName);
Directory.CreateDirectory(newTemp);
start.Environment["TEMP"] = newTemp;
}

}}


was (Author: kevingre):
One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

```

    // Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
    // Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
    // to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
    // and then executes that command. %RANDOM% does not have sufficent 
range to avoid collisions when launching many Spark processes.
    // As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
    // "The process cannot access the file because it is being used by 
another process."
    // "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
    // As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
    string newTemp = null;
    if (AppRuntimeEnvironment.IsRunningOnWindows())
    {
    var ourTemp = Environment.GetEnvironmentVariable("TEMP");
    var newDirName = "dprep" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
    newTemp = Path.Combine(ourTemp, newDirName);
    Directory.CreateDirectory(newTemp);
    start.Environment["TEMP"] = newTemp;
    }


 ```

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:

[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25737:
--
Description: 
In ancient times in 2013, JavaSparkContext got a superclass 
JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
[http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]

I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. 

I believe we can now remove this workaround. {{union(JavaRDD, List)}} 
should just be {{union(JavaRDD*)}} and likewise for JavaPairRDD, JavaDoubleRDD.

 

  was:
In ancient times in 2013, JavaSparkContext got a superclass 
JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
[http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]

I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. 

I believe we can now remove this workaround. Along the way, I think we can also 
avoid the duplicated definitions of {{union()}}. Where we should be able to 
just have one varargs method, we have up to 3 forms:
 - {{union(RDD, Seq/List)}}

 - {{union(RDD*)}}

 - {{union(RDD, RDD*)}}

While this pattern is sometimes used to avoid type collision due to erasure, I 
don't think it applies here.

After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
(for the 3 Java RDD types), not 11 methods.

The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
{{sc.union(Seq(rdd1, rdd2): _*)}}


> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. {{union(JavaRDD, 
> List)}} should just be {{union(JavaRDD*)}} and likewise for 
> JavaPairRDD, JavaDoubleRDD.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2018-10-24 Thread Kevin Grealish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750
 ] 

Kevin Grealish edited comment on SPARK-23015 at 10/24/18 7:45 PM:
--

One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

```

    // Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
    // Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
    // to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
    // and then executes that command. %RANDOM% does not have sufficent 
range to avoid collisions when launching many Spark processes.
    // As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
    // "The process cannot access the file because it is being used by 
another process."
    // "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
    // As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
    string newTemp = null;
    if (AppRuntimeEnvironment.IsRunningOnWindows())
    {
    var ourTemp = Environment.GetEnvironmentVariable("TEMP");
    var newDirName = "dprep" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
    newTemp = Path.Combine(ourTemp, newDirName);
    Directory.CreateDirectory(newTemp);
    start.Environment["TEMP"] = newTemp;
    }


 ```


was (Author: kevingre):
One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

{{            // Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
}}

{{    // Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
    // to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
    // and then executes that command. %RANDOM% does not have 
sufficient range to avoid collisions when launching many Spark processes.
    // As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
    // "The process cannot access the file because it is being used by 
another process."
    // "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
    // As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
    string newTemp = null;
    if (AppRuntimeEnvironment.IsRunningOnWindows())
    {
    var ourTemp = Environment.GetEnvironmentVariable("TEMP");
    var newDirName = "t" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
    newTemp = Path.Combine(ourTemp, newDirName);
    Directory.CreateDirectory(newTemp);
    start.Environment["TEMP"] = newTemp;
    }
}}

 

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from 

[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25737:
--
Labels: release-notes  (was: )

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2018-10-24 Thread Kevin Grealish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750
 ] 

Kevin Grealish commented on SPARK-23015:


One workaround is to create a temp directory in temp and set that to be the 
TEMP directory for that process being launched. This way each process you 
launch gets its on temp speace. For example, when launching from C#:

{{            // Workaround for Spark bug 
https://issues.apache.org/jira/browse/SPARK-23015
}}

{{    // Spark Submit's launching library prints the command to execute 
the launcher (org.apache.spark.launcher.main) 
    // to a temporary text file 
("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into 
a variable,
    // and then executes that command. %RANDOM% does not have 
sufficient range to avoid collisions when launching many Spark processes.
    // As a result the Spark processes end up running one anothers' 
commands (silently) or gives an error like:
    // "The process cannot access the file because it is being used by 
another process."
    // "The system cannot find the file 
C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt."
    // As a workaround, we give each run its own TEMP directory, we 
create using a GUID.
    string newTemp = null;
    if (AppRuntimeEnvironment.IsRunningOnWindows())
    {
    var ourTemp = Environment.GetEnvironmentVariable("TEMP");
    var newDirName = "t" + 
Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 
22).Replace('/', '-');
    newTemp = Path.Combine(ourTemp, newDirName);
    Directory.CreateDirectory(newTemp);
    start.Environment["TEMP"] = newTemp;
    }
}}

 

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods

2018-10-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25737.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22729
[https://github.com/apache/spark/pull/22729]

> Remove JavaSparkContextVarargsWorkaround and standardize union() methods
> 
>
> Key: SPARK-25737
> URL: https://issues.apache.org/jira/browse/SPARK-25737
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> In ancient times in 2013, JavaSparkContext got a superclass 
> JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: 
> [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772]
> I believe this was really resolved by the {{@varags}} annotation in Scala 
> 2.9. 
> I believe we can now remove this workaround. Along the way, I think we can 
> also avoid the duplicated definitions of {{union()}}. Where we should be able 
> to just have one varargs method, we have up to 3 forms:
>  - {{union(RDD, Seq/List)}}
>  - {{union(RDD*)}}
>  - {{union(RDD, RDD*)}}
> While this pattern is sometimes used to avoid type collision due to erasure, 
> I don't think it applies here.
> After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods 
> (for the 3 Java RDD types), not 11 methods.
> The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, 
> rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or 
> {{sc.union(Seq(rdd1, rdd2): _*)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662743#comment-16662743
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Sorry, but who says this issue is cosmetic? I think you are confused between 
both issues due to clicking email links. :)
bq. Got it, you're saying that's cosmetic. 

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662737#comment-16662737
 ] 

Sean Owen commented on SPARK-25823:
---

Got it, you're saying that's cosmetic. But I understood from here that there's 
a real underlying issue in map(). That's what this is about, then? and is it a 
blocker? It again seems like something that isn't a hard blocker, but whether 
to fix it for 2.4 depends also on how long and how reliable a fix would be.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662735#comment-16662735
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

[~srowen] and [~cloud_fan]. To be clear, 
- This issue doesn't aim to fix CreateMap or existing `map_keys/map_values`. 
(Please see the description)
- This issue only aims to fix the wrongly materialized cases like CTAS.

Is it correct that `SELECT map_filter(m)` returns unseen values at `SELECT m`? 
Do you think so?
{code}
CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726
 ] 

Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 7:31 PM:
-

[~srowen]. I commented on SPARK-25824. Please see there.


was (Author: dongjoon):
[~srowen]. I commend on SPARK-25824. Please see there.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

[~srowen]. I commend on SPARK-25824. Please see there.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662725#comment-16662725
 ] 

Dongjoon Hyun commented on SPARK-25824:
---

[~srowen]. Please see the description. The string notation is just a collection 
of the stored data.
`[1 -> 2, 1 -> 3]`

If you materialize that string to `map` again, the result will be `1->3` 
eventually. In that sense, I didn't categorize this as a bug.
{code}
scala> Map(1->2,1->3)
res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)
{code}

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662717#comment-16662717
 ] 

Sean Owen commented on SPARK-25823:
---

Hm, another tough one. It's not so much about what these new functions do but 
what map() already does in 2.3.0. Yeah, the current behavior isn't even 
internally consistent. It's a bug, and that's what SPARK-25824 tracks now 
(right? bug, not improvement?)

Is this a duplicate? or here to say we should note the map() behavior as a 
known issue?

I'd again say this isn't a blocker, even though it's a regression from a 
significantly older release. I'm still OK drawing that distinction as it seems 
to be the constructive place to draw a bright line.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25826) Kerberos Support in Kubernetes resource manager

2018-10-24 Thread Ilan Filonenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-25826:
---
Description: This is the umbrella issue for all Kerberos related tasks with 
relation to Spark on Kubernetes  (was: This is the umbrella issue for all 
Kerberos related tasks)
Summary: Kerberos Support in Kubernetes resource manager  (was: 
Kerberos Support in Kubernetes)

> Kerberos Support in Kubernetes resource manager
> ---
>
> Key: SPARK-25826
> URL: https://issues.apache.org/jira/browse/SPARK-25826
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> This is the umbrella issue for all Kerberos related tasks with relation to 
> Spark on Kubernetes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25826) Kerberos Support in Kubernetes

2018-10-24 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25826:
--

 Summary: Kerberos Support in Kubernetes
 Key: SPARK-25826
 URL: https://issues.apache.org/jira/browse/SPARK-25826
 Project: Spark
  Issue Type: Umbrella
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


This is the umbrella issue for all Kerberos related tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25825) Kerberos Support for Long Running Jobs in Kubernetes

2018-10-24 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25825:
--

 Summary: Kerberos Support for Long Running Jobs in Kubernetes 
 Key: SPARK-25825
 URL: https://issues.apache.org/jira/browse/SPARK-25825
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


When provided with a --keytab and --principal combination, there is an 
expectation that Kubernetes would leverage the Driver to spin up a renewal 
thread to handle token renewal. However, in the case that a --keytab and 
--principal are not provided and instead a secretName and secretItemKey is 
provided, there should be an option to specify a config that specifies that a 
external renewal service exists. The driver should, therefore, be responsible 
for discovering changes to the secret and send the updated token data to the 
executor with the UpdateDelegationTokens message. Thereby enabling token 
renewal given just a secret in addition to the traditional use-case via 
--keytab and --principal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25816:


Assignee: Apache Spark

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Brian Zhang
>Assignee: Apache Spark
>Priority: Critical
> Attachments: source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662688#comment-16662688
 ] 

Apache Spark commented on SPARK-25816:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22817

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25816:


Assignee: (was: Apache Spark)

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Brian Zhang
>Priority: Critical
> Attachments: source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662673#comment-16662673
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Ah, got it. I thought a different one. For mine, I created SPARK-25824 as a 
minor improvement in Spark 3.0.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662670#comment-16662670
 ] 

Dongjoon Hyun commented on SPARK-25824:
---

cc [~maropu]

> Remove duplicated map entries in `showString`
> -
>
> Key: SPARK-25824
> URL: https://issues.apache.org/jira/browse/SPARK-25824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `showString` doesn't eliminate the duplication. So, it looks different from 
> the result of `collect` and select from saved rows.
> *Spark 2.2.2*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("SELECT map(1,2,1,3)").show
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
> {code}
> *Spark 2.3.0 ~ 2.4.0-rc4*
> {code}
> spark-sql> select map(1,2,1,3);
> {1:3}
> scala> sql("SELECT map(1,2,1,3)").collect
> res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
> scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
> scala> sql("SELECT * FROM m").show
> ++
> |   a|
> ++
> |[1 -> 3]|
> ++
> scala> sql("SELECT map(1,2,1,3)").show
> ++
> | map(1, 2, 1, 3)|
> ++
> |[1 -> 2, 1 -> 3]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14989) Upgrade to Jackson 2.7.3

2018-10-24 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662668#comment-16662668
 ] 

nirav patel commented on SPARK-14989:
-

Any updates on this? Spark 2.2 still uses Jackson 1.9.13! It's not playing 
along with other artifacts (play 2.6 for one)

> Upgrade to Jackson 2.7.3
> 
>
> Key: SPARK-14989
> URL: https://issues.apache.org/jira/browse/SPARK-14989
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Priority: Major
>
> For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25824) Remove duplicated map entries in `showString`

2018-10-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25824:
-

 Summary: Remove duplicated map entries in `showString`
 Key: SPARK-25824
 URL: https://issues.apache.org/jira/browse/SPARK-25824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


`showString` doesn't eliminate the duplication. So, it looks different from the 
result of `collect` and select from saved rows.

*Spark 2.2.2*
{code}
spark-sql> select map(1,2,1,3);
{1:3}

scala> sql("SELECT map(1,2,1,3)").collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])

scala> sql("SELECT map(1,2,1,3)").show
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+
{code}

*Spark 2.3.0 ~ 2.4.0-rc4*
{code}
spark-sql> select map(1,2,1,3);
{1:3}

scala> sql("SELECT map(1,2,1,3)").collect
res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])

scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a")
scala> sql("SELECT * FROM m").show
++
|   a|
++
|[1 -> 3]|
++

scala> sql("SELECT map(1,2,1,3)").show
++
| map(1, 2, 1, 3)|
++
|[1 -> 2, 1 -> 3]|
++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662666#comment-16662666
 ] 

Wenchen Fan commented on SPARK-25823:
-

Anyway we should make map lookup and toString/collect consistent. It's weird if 
`map(1,2,1,3)[1]` returns 2 but `string(map(1,2,1,3))` returns 1->3.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660
 ] 

Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 6:38 PM:
-

Ur, I think `collect` is the correct one as you can see as CTAS example. We 
save with the last win entries.

 

BTW, [~cloud_fan]. I'm looking around this issue. I'll create another 
improvement issue to fix `show` function for the following your comment.
{quote}BTW one improvement we can do is to remove duplicated map keys when 
converting map values to string, to make it invisible to end-users.
{quote}
It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu].
{code:java}
scala> sql("SELECT map(1,2,1,3)").show
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+

scala> spark.version
res2: String = 2.2.2
{code}


was (Author: dongjoon):
BTW, [~cloud_fan]. I'm looking around this issue. I'll create another 
improvement issue to fix `show` function for the following your comment.
bq. BTW one improvement we can do is to remove duplicated map keys when 
converting map values to string, to make it invisible to end-users.

It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu].
{code}
scala> sql("SELECT map(1,2,1,3)").show
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+

scala> spark.version
res2: String = 2.2.2
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

BTW, [~cloud_fan]. I'm looking around this issue. I'll create another 
improvement issue to fix `show` function for the following your comment.
bq. BTW one improvement we can do is to remove duplicated map keys when 
converting map values to string, to make it invisible to end-users.

It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu].
{code}
scala> sql("SELECT map(1,2,1,3)").show
+---+
|map(1, 2, 1, 3)|
+---+
|Map(1 -> 3)|
+---+

scala> spark.version
res2: String = 2.2.2
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662646#comment-16662646
 ] 

Wenchen Fan commented on SPARK-25823:
-

[~dongjoon] good catch! I think we should update collect to match the behavior 
of map lookup.

Going back to this ticket, the current behavior is different from presto but is 
consistent with how map type behaves in Spark. If others think this is serious, 
I'd suggest we remove map-related high-order functions from 2.4. However we 
can't remove `CreateMap`, so the behavior of map type in Spark is still as it 
was.

Personally I don't want to remove the map-related high-order functions, as they 
follow the map type semantic in Spark and are implemented correctly. The only 
benefit I can think of is to not spread the unexpected behavior of map type in 
Spark.

In the master branch we can work on making Spark map type consistent with 
Presto.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25812:
--
Component/s: Tests

> Flaky test: PagedTableSuite.pageNavigation
> --
>
> Key: SPARK-25812
> URL: https://issues.apache.org/jira/browse/SPARK-25812
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/]
> {code:java}
> [info] PagedTableSuite:
> [info] - pageNavigation *** FAILED *** (2 milliseconds)
> [info]   
> [info] 
> [info]class="form-inline pull-right" style="margin-bottom: 0px;">
> [info] 
> [info] 
> [info] 1 Pages. Jump to
> [info]  value="1" class="span1"/>
> [info]   
> [info] . Show 
> [info]  value="10" class="span1"/>
> [info] items in a page.
> [info]   
> [info] Go
> [info]   
> [info] 
> [info] 
> [info]   Page: 
> [info]   
> [info] 
> [info] 
> [info] 1
> [info] 
> [info] 
> [info]   
> [info] 
> [info]did not equal List() (PagedTableSuite.scala:76)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76)
> [info]   at 
> org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23415:
--
Component/s: Tests

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> The test suite fails due to 60-second timeout sometimes.
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4759/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/412/]
>  (June 15th)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25542:
--
Component/s: Tests

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23622) Flaky Test: HiveClientSuites

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23622:
--
Component/s: Tests

> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325
> {code}
> Error Message
> java.lang.reflect.InvocationTargetException: null
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
>   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
>   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
>   at org.scalatest.Suite$class.run(Suite.scala:1144)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
>   ... 29 more
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
>   ... 31 more
> Caused by: sbt.ForkMain$ForkError: 
> java.lang.reflect.InvocationTargetException: null
>   at 

[jira] [Updated] (SPARK-25792) Flaky test: BarrierStageOnSubmittedSuite."submit a barrier ResultStage that requires more slots than current total under local-cluster mode"

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25792:
--
Component/s: Tests

> Flaky test: BarrierStageOnSubmittedSuite."submit a barrier ResultStage that 
> requires more slots than current total under local-cluster mode"
> 
>
> Key: SPARK-25792
> URL: https://issues.apache.org/jira/browse/SPARK-25792
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> [info] - submit a barrier ResultStage that requires more slots than current 
> total under local-cluster mode *** FAILED *** (5 seconds, 303 milliseconds)
> [info]   Expected exception org.apache.spark.SparkException to be thrown, but 
> java.util.concurrent.TimeoutException was thrown 
> (BarrierStageOnSubmittedSuite.scala:54)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:812)
> [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
> [info]   at 
> org.apache.spark.BarrierStageOnSubmittedSuite.org$apache$spark$BarrierStageOnSubmittedSuite$$testSubmitJob(BarrierStageOnSubmittedSuite.scala:54)
> [info]   at 
> org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply$mcV$sp(BarrierStageOnSubmittedSuite.scala:240)
> [info]   at 
> org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply(BarrierStageOnSubmittedSuite.scala:227)
> [info]   at 
> org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply(BarrierStageOnSubmittedSuite.scala:227)
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/133/testReport
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/132/testReport
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/125/testReport
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/123/testReport



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24318) Flaky test: SortShuffleSuite

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24318:
--
Component/s: Tests

> Flaky test: SortShuffleSuite
> 
>
> Key: SPARK-24318
> URL: https://issues.apache.org/jira/browse/SPARK-24318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/346/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/336/
> {code}
> Error Message
> java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1073)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24285:
--
Component/s: Tests

> Flaky test: ContinuousSuite.query without test harness
> --
>
> Key: SPARK-24285
> URL: https://issues.apache.org/jira/browse/SPARK-24285
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *2.5.0-SNAPSHOT*
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640]
> {code:java}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, 
> scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => 
> org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row])
>  was false{code}
> *2.3.x*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24153) Flaky Test: DirectKafkaStreamSuite

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24153:
--
Component/s: Tests

> Flaky Test: DirectKafkaStreamSuite
> --
>
> Key: SPARK-24153
> URL: https://issues.apache.org/jira/browse/SPARK-24153
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> Test Result (5 failures / +5)
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.receiving from 
> largest starting offset
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.creating 
> stream by offset
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.Direct Kafka 
> stream report input information
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/348/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24239) Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from earliest offsets

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24239:
--
Component/s: Tests

> Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from 
> earliest offsets
> --
>
> Key: SPARK-24239
> URL: https://issues.apache.org/jira/browse/SPARK-24239
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/360/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/353/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24211:
--
Component/s: Tests

> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *windowed left outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/]
> *windowed right outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
> *left outer join with non-key condition violated*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/386/]
> *left outer early state exclusion on left*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/385/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24173) Flaky Test: VersionsSuite

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24173:
--
Component/s: Tests

> Flaky Test: VersionsSuite
> -
>
> Key: SPARK-24173
> URL: https://issues.apache.org/jira/browse/SPARK-24173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *BRANCH-2.2*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/
> *BRANCH-2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/369/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/383
> *MASTER*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4843/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24140) Flaky test: KMeansClusterSuite.task size should be small in both training and prediction

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662637#comment-16662637
 ] 

Dongjoon Hyun commented on SPARK-24140:
---

Sorry, I don't have new links. You may take a look at the Jenkins log.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

> Flaky test: KMeansClusterSuite.task size should be small in both training and 
> prediction
> 
>
> Key: SPARK-24140
> URL: https://issues.apache.org/jira/browse/SPARK-24140
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4765/
> {code}
> Error Message
> Job 0 cancelled because SparkContext was shut down
> Stacktrace
>   org.apache.spark.SparkException: Job 0 cancelled because SparkContext 
> was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1841)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1754)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1927)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1303)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1926)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:574)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1907)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2030)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2051)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
>   at org.apache.spark.rdd.RDD$$anonfun$takeSample$1.apply(RDD.scala:571)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.takeSample(RDD.scala:560)
>   at org.apache.spark.mllib.clustering.KMeans.initRandom(KMeans.scala:360)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24140) Flaky test: KMeansClusterSuite.task size should be small in both training and prediction

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24140:
--
Component/s: Tests

> Flaky test: KMeansClusterSuite.task size should be small in both training and 
> prediction
> 
>
> Key: SPARK-24140
> URL: https://issues.apache.org/jira/browse/SPARK-24140
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4765/
> {code}
> Error Message
> Job 0 cancelled because SparkContext was shut down
> Stacktrace
>   org.apache.spark.SparkException: Job 0 cancelled because SparkContext 
> was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1841)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1754)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1927)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1303)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1926)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:574)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1907)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2030)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2051)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
>   at org.apache.spark.rdd.RDD$$anonfun$takeSample$1.apply(RDD.scala:571)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>   at org.apache.spark.rdd.RDD.takeSample(RDD.scala:560)
>   at org.apache.spark.mllib.clustering.KMeans.initRandom(KMeans.scala:360)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618
 ] 

Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 5:59 PM:
-

Right. This is related to the a long lasting issue when CreateMap added. And, 
when we do collect, the last entry wins.
{code}
scala> sql("SELECT map(1,2,1,3)").collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
{code}


was (Author: dongjoon):
Right. This is a long lasting issue when CreateMap added. And, when we do 
collect, the last entry wins.
{code}
scala> sql("SELECT map(1,2,1,3)").collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

Right. This is a long lasting issue when CreateMap added. And, when we do 
collect, the last entry wins.
{code}
scala> sql("SELECT map(1,2,1,3)").collect
res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)])
{code}

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25823:
--
Labels: correctness  (was: )

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662616#comment-16662616
 ] 

Wenchen Fan commented on SPARK-25823:
-

BTW one improvement we can do is to remove duplicated map keys when converting 
map values to string, to make it invisible to end-users.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662613#comment-16662613
 ] 

Wenchen Fan commented on SPARK-25823:
-

I did a few experiments
{code}
scala> sql("SELECT map(1,2,1,3)").show
++
| map(1, 2, 1, 3)|
++
|[1 -> 2, 1 -> 3]|
++

scala> sql("SELECT map(1,2,1,3)[1]").show
+--+
|map(1, 2, 1, 3)[1]|
+--+
| 2|
+--+
{code}

So this is a long-standing problem that Spark allows duplicated map keys, and 
during lookup the first matched entry wins. I'm not sure we should change this 
behavior for now. Maybe we should find a place to document it.

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25823:
--
Description: 
This is not a regression because this occurs in new high-order functions like 
`map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the 
duplication. If we want to allow this difference in new high-order functions, 
we had better add some warning about this different on these functions after 
RC4 voting pass at least. Otherwise, this will surprise Presto-based users.

*Spark 2.4*
{code:java}
spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
(SELECT map_concat(map(1,2), map(1,3)) m);
spark-sql> SELECT * FROM t;
{1:3}   {1:2}
{code}
*Presto 0.212*
{code:java}
presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
   a   | _col1
---+---
 {1=3} | {}
{code}

  was:
This is not a regression because this occurs in new high-order functions like 
`map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the 
duplication. If we want to allow this difference in new high-order functions, 
we had better add some warning about this different on these functions after 
RC4 voting at least. Otherwise, this will surprise Presto-based users.

*Spark 2.4*
{code:java}
spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
(SELECT map_concat(map(1,2), map(1,3)) m);
spark-sql> SELECT * FROM t;
{1:3}   {1:2}
{code}
*Presto 0.212*
{code:java}
presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
   a   | _col1
---+---
 {1=3} | {}
{code}


> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting pass at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662583#comment-16662583
 ] 

Dongjoon Hyun commented on SPARK-25823:
---

cc [~cloud_fan], [~srowen], [~smilegator], [~hyukjin.kwon] and [~kabhwan].

> map_filter can generate incorrect data
> --
>
> Key: SPARK-25823
> URL: https://issues.apache.org/jira/browse/SPARK-25823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> This is not a regression because this occurs in new high-order functions like 
> `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows 
> the duplication. If we want to allow this difference in new high-order 
> functions, we had better add some warning about this different on these 
> functions after RC4 voting at least. Otherwise, this will surprise 
> Presto-based users.
> *Spark 2.4*
> {code:java}
> spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
> (SELECT map_concat(map(1,2), map(1,3)) m);
> spark-sql> SELECT * FROM t;
> {1:3} {1:2}
> {code}
> *Presto 0.212*
> {code:java}
> presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
> map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
>a   | _col1
> ---+---
>  {1=3} | {}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662582#comment-16662582
 ] 

Apache Spark commented on SPARK-25822:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22816

> Fix a race condition when releasing a Python worker
> ---
>
> Key: SPARK-25822
> URL: https://issues.apache.org/jira/browse/SPARK-25822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> There is a race condition when releasing a Python worker. If 
> "ReaderIterator.handleEndOfDataSection" is not running in the task thread, 
> when a task is early terminated (such as "take(N)"), the task completion 
> listener may close the worker but "handleEndOfDataSection" can still put the 
> worker into the worker pool to reuse.
> https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
>  is a patch to reproduce this issue.
> I also found a user reported this in the mail list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25822:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a race condition when releasing a Python worker
> ---
>
> Key: SPARK-25822
> URL: https://issues.apache.org/jira/browse/SPARK-25822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> There is a race condition when releasing a Python worker. If 
> "ReaderIterator.handleEndOfDataSection" is not running in the task thread, 
> when a task is early terminated (such as "take(N)"), the task completion 
> listener may close the worker but "handleEndOfDataSection" can still put the 
> worker into the worker pool to reuse.
> https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
>  is a patch to reproduce this issue.
> I also found a user reported this in the mail list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25822:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a race condition when releasing a Python worker
> ---
>
> Key: SPARK-25822
> URL: https://issues.apache.org/jira/browse/SPARK-25822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> There is a race condition when releasing a Python worker. If 
> "ReaderIterator.handleEndOfDataSection" is not running in the task thread, 
> when a task is early terminated (such as "take(N)"), the task completion 
> listener may close the worker but "handleEndOfDataSection" can still put the 
> worker into the worker pool to reuse.
> https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
>  is a patch to reproduce this issue.
> I also found a user reported this in the mail list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662580#comment-16662580
 ] 

Apache Spark commented on SPARK-25822:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22816

> Fix a race condition when releasing a Python worker
> ---
>
> Key: SPARK-25822
> URL: https://issues.apache.org/jira/browse/SPARK-25822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> There is a race condition when releasing a Python worker. If 
> "ReaderIterator.handleEndOfDataSection" is not running in the task thread, 
> when a task is early terminated (such as "take(N)"), the task completion 
> listener may close the worker but "handleEndOfDataSection" can still put the 
> worker into the worker pool to reuse.
> https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
>  is a patch to reproduce this issue.
> I also found a user reported this in the mail list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25823) map_filter can generate incorrect data

2018-10-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25823:
-

 Summary: map_filter can generate incorrect data
 Key: SPARK-25823
 URL: https://issues.apache.org/jira/browse/SPARK-25823
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


This is not a regression because this occurs in new high-order functions like 
`map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the 
duplication. If we want to allow this difference in new high-order functions, 
we had better add some warning about this different on these functions after 
RC4 voting at least. Otherwise, this will surprise Presto-based users.

*Spark 2.4*
{code:java}
spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM 
(SELECT map_concat(map(1,2), map(1,3)) m);
spark-sql> SELECT * FROM t;
{1:3}   {1:2}
{code}
*Presto 0.212*
{code:java}
presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT 
map_concat(map(array[1],array[2]), map(array[1],array[3])) a);
   a   | _col1
---+---
 {1=3} | {}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-24 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25822:


 Summary: Fix a race condition when releasing a Python worker
 Key: SPARK-25822
 URL: https://issues.apache.org/jira/browse/SPARK-25822
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


There is a race condition when releasing a Python worker. If 
"ReaderIterator.handleEndOfDataSection" is not running in the task thread, when 
a task is early terminated (such as "take(N)"), the task completion listener 
may close the worker but "handleEndOfDataSection" can still put the worker into 
the worker pool to reuse.

https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
 is a patch to reproduce this issue.

I also found a user reported this in the mail list: 
http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25798) Internally document type conversion between Pandas data and SQL types in Pandas UDFs

2018-10-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25798.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22795
[https://github.com/apache/spark/pull/22795]

> Internally document type conversion between Pandas data and SQL types in 
> Pandas UDFs
> 
>
> Key: SPARK-25798
> URL: https://issues.apache.org/jira/browse/SPARK-25798
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, UDF's type coercion is not cleanly defined. See also 
> https://github.com/apache/spark/pull/20163 and 
> https://github.com/apache/spark/pull/22610
> This JIRA targets to describe the type conversion logic internally. For 
> instance:
> {code}
> # 
> +--+--+---+++++-+-+-+-+---+-++--+---++
>   # noqa
> # |SQL Type \ Pandas Type|True(bool)|1(int8)|1(int16)|
> 1(int32)|
> 1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|a(object)|1970-01-01 
> 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
> US/Eastern])|1.0(float64)|[1 2 3](object(array))|A(category)|1 days 
> 00:00:00(timedelta64[ns])|  # noqa
> # 
> +--+--+---+++++-+-+-+-+---+-++--+---++
>   # noqa
> # |   boolean|  True|   True|True|
> True|True|True| True| True| True|X|   
>False| 
>False|   False| X|  X| 
>   False|  # noqa
> # |   tinyint| 1|  1|   1|   
> 1|   1|   X|X|X|X|X|  
> X|
> X|   1| X|  0|
>X|  # noqa
> # |  smallint| 1|  1|   1|   
> 1|   1|   1|X|X|X|X|  
> X|
> X|   1| X|  X|
>X|  # noqa
> # |   int| 1|  1|   1|   
> 1|   1|   1|1|X|X|X|  
> X|
> X|   1| X|  X|
>X|  # noqa
> # |bigint| 1|  1|   1|   
> 1|   1|   1|1|1|X|X|  
> 0|   
> 18|   1| X|  X|   
> X|  # noqa
> # |string|   u''|u'\x01'| u'\x01'| 
> u'\x01'| u'\x01'| u'\x01'|  u'\x01'|  u'\x01'|  u'\x01'| 
> u'a'|  X| 
>X| u''| X|  X| 
>   X|  # noqa
> # |  date| X|  X|   
> X|datetime.date(197...|   X|   X|X|X| 
>X|X|   datetime.date(197...|   
>  X|   X| X|  X|   
> X|  # noqa
> # | timestamp| X|  X|   X|   
> X|datetime.datetime...|   X|X|X|X|X|  
>  datetime.datetime...| 
> datetime.datetime...|   X| X|  X| 
>   X|  # noqa
> # | float|   1.0|1.0| 1.0| 
> 1.0| 

  1   2   >