[jira] [Updated] (SPARK-25832) remove newly added map related functions from FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25832: -- Priority: Blocker (was: Major) > remove newly added map related functions from FunctionRegistry > -- > > Key: SPARK-25832 > URL: https://issues.apache.org/jira/browse/SPARK-25832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25832) remove newly added map related functions from FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25832: Assignee: Apache Spark (was: Wenchen Fan) > remove newly added map related functions from FunctionRegistry > -- > > Key: SPARK-25832 > URL: https://issues.apache.org/jira/browse/SPARK-25832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25832) remove newly added map related functions from FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25832: Assignee: Wenchen Fan (was: Apache Spark) > remove newly added map related functions from FunctionRegistry > -- > > Key: SPARK-25832 > URL: https://issues.apache.org/jira/browse/SPARK-25832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25832) remove newly added map related functions from FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663233#comment-16663233 ] Apache Spark commented on SPARK-25832: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22821 > remove newly added map related functions from FunctionRegistry > -- > > Key: SPARK-25832 > URL: https://issues.apache.org/jira/browse/SPARK-25832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25832) remove newly added map related functions from FunctionRegistry
Wenchen Fan created SPARK-25832: --- Summary: remove newly added map related functions from FunctionRegistry Key: SPARK-25832 URL: https://issues.apache.org/jira/browse/SPARK-25832 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663230#comment-16663230 ] Dongjoon Hyun commented on SPARK-25829: --- Thank you for further investigation! Both tasks look not easy. For me, +1 for `later entry wins` semantics because it's Java/Scala language style and many users know those languages. Also, Spark works in that way, especially during the writing operation. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663121#comment-16663121 ] Wenchen Fan commented on SPARK-25829: - If we decide to follow "later entry wins", the following functions need to be reverted from 2.4 MapFilter, MapZipWith, TransformKeys, TransformValues > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118 ] Wenchen Fan commented on SPARK-25829: - More investigation on "later entry wins". If we still allow duplicated keys in map physically, following functions need to be updated: Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, TransformKeys, TransformValues, MapZipWith If we want to forbid duplicated keys in map, following functions need to be updated: CreateMap, MapFromArrays, MapFromEntries, MapConcat, MapFilter, and also reading map from data sources. So "later entry wins" semantic is more ideal but needs more works. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103 ] Wenchen Fan edited comment on SPARK-25829 at 10/25/18 2:20 AM: --- After more thoughts, both the map lookup behavior and `Dataset.collect` behavior are visible to end-users. It's hard to say which one is the official semantic as there is no doc, and we have to do behavior change for one of them. If we want to stick with the "earlier entry wins" semantic, then we need to fix the 3 sub-tasks listed here. If we want to stick with the "later entry wins" semantic, then we need to fix the map lookup(GetMapValue) and other related functions like `map_filter`, or deduplicate map keys at all the places that may create map. And for 2.4 we should revert these function if they are newly added, like `map_filter`. Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido] was (Author: cloud_fan): After more thoughts, both the map lookup behavior and `Dataset.collect` behavior are visible to end-users. It's hard to say which one is the official semantic as there is no doc, and we have to do behavior change for one of them. If we want to stick with the "earlier entry wins" semantic, then we need to fix the 3 sub-tasks listed here. If we want to stick with the "later entry wins" semantic, then we need to fix the map lookup(GetMapValue) and other related functions like `map_filter`. And for 2.4 we should revert these function if they are newly added, like `map_filter`. Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663103#comment-16663103 ] Wenchen Fan commented on SPARK-25829: - After more thoughts, both the map lookup behavior and `Dataset.collect` behavior are visible to end-users. It's hard to say which one is the official semantic as there is no doc, and we have to do behavior change for one of them. If we want to stick with the "earlier entry wins" semantic, then we need to fix the 3 sub-tasks listed here. If we want to stick with the "later entry wins" semantic, then we need to fix the map lookup(GetMapValue) and other related functions like `map_filter`. And for 2.4 we should revert these function if they are newly added, like `map_filter`. Any ideas? cc [~rxin] [~LI,Xiao] [~dongjoon] [~viirya] [~mgaido] > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24917) Sending a partition over netty results in 2x memory usage
[ https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24917. --- Resolution: Won't Fix Evidently obsoleted by the update to Netty 4.1.30 > Sending a partition over netty results in 2x memory usage > - > > Key: SPARK-24917 > URL: https://issues.apache.org/jira/browse/SPARK-24917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Vincent >Priority: Major > > Hello > while investigating some OOM errors in Spark 2.2 [(here's my call > stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following > behavior happening, which I think is weird: > * a request happens to send a partition over network > * this partition is 1.9 GB and is persisted in memory > * this partition is apparently stored in a ByteBufferBlockData, that is made > of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each. > * the call to toNetty() is supposed to only wrap all the arrays and not > allocate any memory > * yet the call stack shows that netty is allocating memory and is trying to > consolidate all the chunks into one big 1.9GB array > * this means that at this point the memory footprint is 2x the size of the > actual partition (which is huge when the partition is 1.9GB) > Is this transient allocation expected? > After digging, it turns out that the actual copy is due to [this > method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260] > in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS > (16) components it will trigger a re-allocation of all the buffer. This netty > issue was fixed in this recent change : > [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2] > > As a result, is it possible to benefit from this change somehow in spark 2.2 > and above? I don't know how the netty dependencies are handled for spark > > NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] > kinda changed the approach for spark 2.4 by bypassing netty buffer > altogether. However as it is written in the ticket, this approach *still* > needs to have the *entire* block serialized in memory, so this would be a > downgrade from fixing the netty issue when your buffer in < 2GB > > Thanks! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect
[ https://issues.apache.org/jira/browse/SPARK-25830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25830: Description: {code} scala> sql("select map(1,2,1,3)").collect res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) {code} We mistakenly apply "later entry wins" > should apply "earlier entry wins" in Dataset.collect > > > Key: SPARK-25830 > URL: https://issues.apache.org/jira/browse/SPARK-25830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > {code} > scala> sql("select map(1,2,1,3)").collect > res14: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > {code} > We mistakenly apply "later entry wins" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25831) should apply "earlier entry wins" in hive map value converter
[ https://issues.apache.org/jira/browse/SPARK-25831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25831: Description: {code} scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") res11: org.apache.spark.sql.DataFrame = [] scala> sql("select * from t").show ++ | map| ++ |[1 -> 3]| ++ {code} We mistakenly apply "later entry wins" > should apply "earlier entry wins" in hive map value converter > - > > Key: SPARK-25831 > URL: https://issues.apache.org/jira/browse/SPARK-25831 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > {code} > scala> sql("CREATE TABLE t AS SELECT map(1,2,1,3) as map") > res11: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from t").show > ++ > | map| > ++ > |[1 -> 3]| > ++ > {code} > We mistakenly apply "later entry wins" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25831) should apply "earlier entry wins" in hive map value converter
Wenchen Fan created SPARK-25831: --- Summary: should apply "earlier entry wins" in hive map value converter Key: SPARK-25831 URL: https://issues.apache.org/jira/browse/SPARK-25831 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25830) should apply "earlier entry wins" in Dataset.collect
Wenchen Fan created SPARK-25830: --- Summary: should apply "earlier entry wins" in Dataset.collect Key: SPARK-25830 URL: https://issues.apache.org/jira/browse/SPARK-25830 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25824: Issue Type: Sub-task (was: Bug) Parent: SPARK-25829 > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25829: Description: In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. e.g. {code} scala> sql("SELECT map(1,2,1,3)[1]").show +--+ |map(1, 2, 1, 3)[1]| +--+ | 2| +--+ {code} However, this handling is not applied consistently. was: In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. e.g. {code} scala> sql("SELECT map(1,2,1,3)[1]").show +--+ |map(1, 2, 1, 3)[1]| +--+ | 2| +--+ {code} However, this handling is not applied consistenly. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25829) Duplicated map keys are not handled consistently
[ https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25829: Description: In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. e.g. {code} scala> sql("SELECT map(1,2,1,3)[1]").show +--+ |map(1, 2, 1, 3)[1]| +--+ | 2| +--+ {code} However, this handling is not applied consistenly. > Duplicated map keys are not handled consistently > > > Key: SPARK-25829 > URL: https://issues.apache.org/jira/browse/SPARK-25829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. > e.g. > {code} > scala> sql("SELECT map(1,2,1,3)[1]").show > +--+ > |map(1, 2, 1, 3)[1]| > +--+ > | 2| > +--+ > {code} > However, this handling is not applied consistenly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25829) Duplicated map keys are not handled consistently
Wenchen Fan created SPARK-25829: --- Summary: Duplicated map keys are not handled consistently Key: SPARK-25829 URL: https://issues.apache.org/jira/browse/SPARK-25829 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663060#comment-16663060 ] Dongjoon Hyun commented on SPARK-25824: --- According to [~cloud_fan]'s analysis, this one is converted as a bug. - https://lists.apache.org/thread.html/11afc74162b922fbef81db1e96c082f2e6f217d79dc1d82ec2702aef@%3Cdev.spark.apache.org%3E > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25824: -- Issue Type: Bug (was: Improvement) > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663045#comment-16663045 ] Apache Spark commented on SPARK-25828: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22820 > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663043#comment-16663043 ] Apache Spark commented on SPARK-25828: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22820 > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25828: Assignee: Apache Spark > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25828: Assignee: (was: Apache Spark) > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663033#comment-16663033 ] Stavros Kontopoulos commented on SPARK-25828: - Cool [~eje] [~ifilonenko] are you working on it? > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23229) Dataset.hint should use planWithBarrier logical plan
[ https://issues.apache.org/jira/browse/SPARK-23229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23229. --- Resolution: Won't Fix > Dataset.hint should use planWithBarrier logical plan > > > Key: SPARK-23229 > URL: https://issues.apache.org/jira/browse/SPARK-23229 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Jacek Laskowski >Priority: Trivial > > Every time {{Dataset.hint}} is used it triggers execution of logical > commands, their unions and hint resolution (among other things that analyzer > does). > {{hint}} should use {{planWithBarrier}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662978#comment-16662978 ] Erik Erlandson commented on SPARK-25828: cc [~skonto] > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25828) Bumping Version of kubernetes.client to latest version
[ https://issues.apache.org/jira/browse/SPARK-25828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-25828: --- Description: Upgrade the Kubernetes client version to at least [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] as we are falling behind on fabric8 updates. This will be an update to both kubernetes/core and kubernetes/integration-tests was: Upgrade the Kubernetes client version to at least [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] as we are falling behind on fabric8 updates. This will be an update to both in kubernetes/core and kubernetes/integration-tests > Bumping Version of kubernetes.client to latest version > -- > > Key: SPARK-25828 > URL: https://issues.apache.org/jira/browse/SPARK-25828 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Upgrade the Kubernetes client version to at least > [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] > as we are falling behind on fabric8 updates. This will be an update to both > kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25828) Bumping Version of kubernetes.client to latest version
Ilan Filonenko created SPARK-25828: -- Summary: Bumping Version of kubernetes.client to latest version Key: SPARK-25828 URL: https://issues.apache.org/jira/browse/SPARK-25828 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko Upgrade the Kubernetes client version to at least [4.0.0|https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.0.0] as we are falling behind on fabric8 updates. This will be an update to both in kubernetes/core and kubernetes/integration-tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24899) Add example of monotonically_increasing_id standard function to scaladoc
[ https://issues.apache.org/jira/browse/SPARK-24899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24899. --- Resolution: Won't Fix > Add example of monotonically_increasing_id standard function to scaladoc > > > Key: SPARK-24899 > URL: https://issues.apache.org/jira/browse/SPARK-24899 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.1 >Reporter: Jacek Laskowski >Priority: Minor > > I think an example of {{monotonically_increasing_id}} standard function in > scaladoc would help people understand why the function is monotonically > increasing and unique but not consecutive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25490) Refactor KryoBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25490: - Assignee: Gengliang Wang > Refactor KryoBenchmark > -- > > Key: SPARK-25490 > URL: https://issues.apache.org/jira/browse/SPARK-25490 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Refactor KryoBenchmark to use main method and print the output as a separate > file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25490) Refactor KryoBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25490. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22663 [https://github.com/apache/spark/pull/22663] > Refactor KryoBenchmark > -- > > Key: SPARK-25490 > URL: https://issues.apache.org/jira/browse/SPARK-25490 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Refactor KryoBenchmark to use main method and print the output as a separate > file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899 ] Dongjoon Hyun commented on SPARK-25823: --- Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please feel free to adjust the priority if needed for releasing 2.4.0. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662899#comment-16662899 ] Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 9:42 PM: - Right. That one is cosmetic. Not this one. I sent my opinion on dev mailing list, too. Please feel free to adjust the priority if needed for releasing 2.4.0. was (Author: dongjoon): Right. That one is cosmetic. I send my opinion on dev mailing list, too. Please feel free to adjust the priority if needed for releasing 2.4.0. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662891#comment-16662891 ] Sean Owen commented on SPARK-25823: --- I was referring to your comment at https://issues.apache.org/jira/browse/SPARK-25824?focusedCommentId=16662725=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16662725 – this one is not cosmetic. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662830#comment-16662830 ] kevin yu commented on SPARK-25807: -- Thanks Sean, ok, I will leave as it is. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750 ] Kevin Grealish edited comment on SPARK-23015 at 10/24/18 8:51 PM: -- One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: {{ // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficent range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "dprep" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } }} was (Author: kevingre): One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: {{ // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficent range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "dprep" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } }} > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: >
[jira] [Assigned] (SPARK-25827) Replicating a block > 2gb with encryption fails
[ https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25827: Assignee: Apache Spark > Replicating a block > 2gb with encryption fails > --- > > Key: SPARK-25827 > URL: https://issues.apache.org/jira/browse/SPARK-25827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > When replicating large blocks with encryption, we try to allocate an array of > size {{Int.MaxValue}} which is just a bit too big for the JVM. This is > basically the same as SPARK-25704, just another case. > In DiskStore: > {code} > val chunkSize = math.min(remaining, Int.MaxValue) > {code} > {noformat} > 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to > ..., failure #0 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > ... > Caused by: java.lang.RuntimeException: java.io.IOException: Destination > failed while reading stream > ... > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221) > at > org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449) > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25827) Replicating a block > 2gb with encryption fails
[ https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25827: Assignee: (was: Apache Spark) > Replicating a block > 2gb with encryption fails > --- > > Key: SPARK-25827 > URL: https://issues.apache.org/jira/browse/SPARK-25827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > When replicating large blocks with encryption, we try to allocate an array of > size {{Int.MaxValue}} which is just a bit too big for the JVM. This is > basically the same as SPARK-25704, just another case. > In DiskStore: > {code} > val chunkSize = math.min(remaining, Int.MaxValue) > {code} > {noformat} > 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to > ..., failure #0 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > ... > Caused by: java.lang.RuntimeException: java.io.IOException: Destination > failed while reading stream > ... > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221) > at > org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449) > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25827) Replicating a block > 2gb with encryption fails
[ https://issues.apache.org/jira/browse/SPARK-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662825#comment-16662825 ] Apache Spark commented on SPARK-25827: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/22818 > Replicating a block > 2gb with encryption fails > --- > > Key: SPARK-25827 > URL: https://issues.apache.org/jira/browse/SPARK-25827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > When replicating large blocks with encryption, we try to allocate an array of > size {{Int.MaxValue}} which is just a bit too big for the JVM. This is > basically the same as SPARK-25704, just another case. > In DiskStore: > {code} > val chunkSize = math.min(remaining, Int.MaxValue) > {code} > {noformat} > 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to > ..., failure #0 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > ... > Caused by: java.lang.RuntimeException: java.io.IOException: Destination > failed while reading stream > ... > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) > at > org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221) > at > org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449) > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25807: -- Target Version/s: (was: 2.4.0, 2.4.1, 2.5.0, 3.0.0) > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion
[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662817#comment-16662817 ] Sean Owen commented on SPARK-25807: --- They are meant to match Hive, SQL. They should not match Java, Python. No, this should not be changed. > Mitigate 1-based substr() confusion > --- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark >Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0 >Reporter: Oron Navon >Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25827) Replicating a block > 2gb with encryption fails
Imran Rashid created SPARK-25827: Summary: Replicating a block > 2gb with encryption fails Key: SPARK-25827 URL: https://issues.apache.org/jira/browse/SPARK-25827 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Imran Rashid When replicating large blocks with encryption, we try to allocate an array of size {{Int.MaxValue}} which is just a bit too big for the JVM. This is basically the same as SPARK-25704, just another case. In DiskStore: {code} val chunkSize = math.min(remaining, Int.MaxValue) {code} {noformat} 18/10/22 17:04:06 WARN storage.BlockManager: Failed to replicate rdd_1_1 to ..., failure #0 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:133) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1421) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1230) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) ... Caused by: java.lang.RuntimeException: java.io.IOException: Destination failed while reading stream ... Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) at org.apache.spark.storage.BlockManager$$anon$1$$anonfun$7.apply(BlockManager.scala:446) at org.apache.spark.storage.EncryptedBlockData.toChunkedByteBuffer(DiskStore.scala:221) at org.apache.spark.storage.BlockManager$$anon$1.onComplete(BlockManager.scala:449) ... {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662760#comment-16662760 ] Dongjoon Hyun commented on SPARK-25823: --- For me, this is a `data correctness` issue at `map_filter` operation. We had better fix this. So, I'm investigating this. Apache Spark PMC may choose another path for 2.4.0 release. I also respect that. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25815) Kerberos Support in Kubernetes resource manager (Client Mode)
[ https://issues.apache.org/jira/browse/SPARK-25815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662761#comment-16662761 ] Marcelo Vanzin commented on SPARK-25815: I actually have this working (and also I'm adding proper principal/keytab support inline with YARN/Mesos), but it's built on top of SPARK-23781 which is not merged yet. > Kerberos Support in Kubernetes resource manager (Client Mode) > - > > Key: SPARK-25815 > URL: https://issues.apache.org/jira/browse/SPARK-25815 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Include Kerberos support for Spark on K8S jobs running in client-mode -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750 ] Kevin Grealish edited comment on SPARK-23015 at 10/24/18 7:47 PM: -- One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: {{ // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficent range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "dprep" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } }} was (Author: kevingre): One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: ``` // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficent range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "dprep" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } ``` > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr:
[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25737: -- Description: In ancient times in 2013, JavaSparkContext got a superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. I believe we can now remove this workaround. {{union(JavaRDD, List)}} should just be {{union(JavaRDD*)}} and likewise for JavaPairRDD, JavaDoubleRDD. was: In ancient times in 2013, JavaSparkContext got a superclass JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] I believe this was really resolved by the {{@varags}} annotation in Scala 2.9. I believe we can now remove this workaround. Along the way, I think we can also avoid the duplicated definitions of {{union()}}. Where we should be able to just have one varargs method, we have up to 3 forms: - {{union(RDD, Seq/List)}} - {{union(RDD*)}} - {{union(RDD, RDD*)}} While this pattern is sometimes used to avoid type collision due to erasure, I don't think it applies here. After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods (for the 3 Java RDD types), not 11 methods. The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or {{sc.union(Seq(rdd1, rdd2): _*)}} > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. {{union(JavaRDD, > List)}} should just be {{union(JavaRDD*)}} and likewise for > JavaPairRDD, JavaDoubleRDD. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750 ] Kevin Grealish edited comment on SPARK-23015 at 10/24/18 7:45 PM: -- One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: ``` // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficent range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "dprep" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } ``` was (Author: kevingre): One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: {{ // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 }} {{ // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficient range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "t" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } }} > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from
[jira] [Updated] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25737: -- Labels: release-notes (was: ) > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel
[ https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662750#comment-16662750 ] Kevin Grealish commented on SPARK-23015: One workaround is to create a temp directory in temp and set that to be the TEMP directory for that process being launched. This way each process you launch gets its on temp speace. For example, when launching from C#: {{ // Workaround for Spark bug https://issues.apache.org/jira/browse/SPARK-23015 }} {{ // Spark Submit's launching library prints the command to execute the launcher (org.apache.spark.launcher.main) // to a temporary text file ("%TEMP%\spark-class-launcher-output-%RANDOM%.txt"), reads the result back into a variable, // and then executes that command. %RANDOM% does not have sufficient range to avoid collisions when launching many Spark processes. // As a result the Spark processes end up running one anothers' commands (silently) or gives an error like: // "The process cannot access the file because it is being used by another process." // "The system cannot find the file C:\VsoAgent\_work\_temp\spark-class-launcher-output-654.txt." // As a workaround, we give each run its own TEMP directory, we create using a GUID. string newTemp = null; if (AppRuntimeEnvironment.IsRunningOnWindows()) { var ourTemp = Environment.GetEnvironmentVariable("TEMP"); var newDirName = "t" + Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0, 22).Replace('/', '-'); newTemp = Path.Combine(ourTemp, newDirName); Directory.CreateDirectory(newTemp); start.Environment["TEMP"] = newTemp; } }} > spark-submit fails when submitting several jobs in parallel > --- > > Key: SPARK-23015 > URL: https://issues.apache.org/jira/browse/SPARK-23015 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 > Environment: Windows 10 (1709/16299.125) > Spark 2.3.0 > Java 8, Update 151 >Reporter: Hugh Zabriskie >Priority: Major > > Spark Submit's launching library prints the command to execute the launcher > (org.apache.spark.launcher.main) to a temporary text file, reads the result > back into a variable, and then executes that command. > {code} > set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt > "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main > %* > %LAUNCHER_OUTPUT% > {code} > [bin/spark-class2.cmd, > L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66] > That temporary text file is given a pseudo-random name by the %RANDOM% env > variable generator, which generates a number between 0 and 32767. > This appears to be the cause of an error occurring when several spark-submit > jobs are launched simultaneously. The following error is returned from stderr: > {quote}The process cannot access the file because it is being used by another > process. The system cannot find the file > USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt. > The process cannot access the file because it is being used by another > process.{quote} > My hypothesis is that %RANDOM% is returning the same value for multiple jobs, > causing the launcher library to attempt to write to the same file from > multiple processes. Another mechanism is needed for reliably generating the > names of the temporary files so that the concurrency issue is resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25737) Remove JavaSparkContextVarargsWorkaround and standardize union() methods
[ https://issues.apache.org/jira/browse/SPARK-25737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25737. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22729 [https://github.com/apache/spark/pull/22729] > Remove JavaSparkContextVarargsWorkaround and standardize union() methods > > > Key: SPARK-25737 > URL: https://issues.apache.org/jira/browse/SPARK-25737 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > In ancient times in 2013, JavaSparkContext got a superclass > JavaSparkContextVarargsWorkaround to deal with some Scala 2.7 issue: > [http://www.scala-archive.org/Workaround-for-implementing-java-varargs-in-2-7-2-final-td1944767.html#a1944772] > I believe this was really resolved by the {{@varags}} annotation in Scala > 2.9. > I believe we can now remove this workaround. Along the way, I think we can > also avoid the duplicated definitions of {{union()}}. Where we should be able > to just have one varargs method, we have up to 3 forms: > - {{union(RDD, Seq/List)}} > - {{union(RDD*)}} > - {{union(RDD, RDD*)}} > While this pattern is sometimes used to avoid type collision due to erasure, > I don't think it applies here. > After cleaning it, we'll have 1 SparkContext and 3 JavaSparkContext methods > (for the 3 Java RDD types), not 11 methods. > The only difference for callers in Spark 3 would be that {{sc.union(Seq(rdd1, > rdd2))}} now has to be {{sc.union(rdd1, rdd2)}} (simpler) or > {{sc.union(Seq(rdd1, rdd2): _*)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662743#comment-16662743 ] Dongjoon Hyun commented on SPARK-25823: --- Sorry, but who says this issue is cosmetic? I think you are confused between both issues due to clicking email links. :) bq. Got it, you're saying that's cosmetic. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662737#comment-16662737 ] Sean Owen commented on SPARK-25823: --- Got it, you're saying that's cosmetic. But I understood from here that there's a real underlying issue in map(). That's what this is about, then? and is it a blocker? It again seems like something that isn't a hard blocker, but whether to fix it for 2.4 depends also on how long and how reliable a fix would be. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662735#comment-16662735 ] Dongjoon Hyun commented on SPARK-25823: --- [~srowen] and [~cloud_fan]. To be clear, - This issue doesn't aim to fix CreateMap or existing `map_keys/map_values`. (Please see the description) - This issue only aims to fix the wrongly materialized cases like CTAS. Is it correct that `SELECT map_filter(m)` returns unseen values at `SELECT m`? Do you think so? {code} CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726 ] Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 7:31 PM: - [~srowen]. I commented on SPARK-25824. Please see there. was (Author: dongjoon): [~srowen]. I commend on SPARK-25824. Please see there. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662726#comment-16662726 ] Dongjoon Hyun commented on SPARK-25823: --- [~srowen]. I commend on SPARK-25824. Please see there. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662725#comment-16662725 ] Dongjoon Hyun commented on SPARK-25824: --- [~srowen]. Please see the description. The string notation is just a collection of the stored data. `[1 -> 2, 1 -> 3]` If you materialize that string to `map` again, the result will be `1->3` eventually. In that sense, I didn't categorize this as a bug. {code} scala> Map(1->2,1->3) res5: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3) {code} > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662717#comment-16662717 ] Sean Owen commented on SPARK-25823: --- Hm, another tough one. It's not so much about what these new functions do but what map() already does in 2.3.0. Yeah, the current behavior isn't even internally consistent. It's a bug, and that's what SPARK-25824 tracks now (right? bug, not improvement?) Is this a duplicate? or here to say we should note the map() behavior as a known issue? I'd again say this isn't a blocker, even though it's a regression from a significantly older release. I'm still OK drawing that distinction as it seems to be the constructive place to draw a bright line. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25826) Kerberos Support in Kubernetes resource manager
[ https://issues.apache.org/jira/browse/SPARK-25826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-25826: --- Description: This is the umbrella issue for all Kerberos related tasks with relation to Spark on Kubernetes (was: This is the umbrella issue for all Kerberos related tasks) Summary: Kerberos Support in Kubernetes resource manager (was: Kerberos Support in Kubernetes) > Kerberos Support in Kubernetes resource manager > --- > > Key: SPARK-25826 > URL: https://issues.apache.org/jira/browse/SPARK-25826 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > This is the umbrella issue for all Kerberos related tasks with relation to > Spark on Kubernetes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25826) Kerberos Support in Kubernetes
Ilan Filonenko created SPARK-25826: -- Summary: Kerberos Support in Kubernetes Key: SPARK-25826 URL: https://issues.apache.org/jira/browse/SPARK-25826 Project: Spark Issue Type: Umbrella Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko This is the umbrella issue for all Kerberos related tasks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25825) Kerberos Support for Long Running Jobs in Kubernetes
Ilan Filonenko created SPARK-25825: -- Summary: Kerberos Support for Long Running Jobs in Kubernetes Key: SPARK-25825 URL: https://issues.apache.org/jira/browse/SPARK-25825 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko When provided with a --keytab and --principal combination, there is an expectation that Kubernetes would leverage the Driver to spin up a renewal thread to handle token renewal. However, in the case that a --keytab and --principal are not provided and instead a secretName and secretItemKey is provided, there should be an option to specify a config that specifies that a external renewal service exists. The driver should, therefore, be responsible for discovering changes to the secret and send the updated token data to the executor with the UpdateDelegationTokens message. Thereby enabling token renewal given just a secret in addition to the traditional use-case via --keytab and --principal -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly
[ https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25816: Assignee: Apache Spark > Functions does not resolve Columns correctly > > > Key: SPARK-25816 > URL: https://issues.apache.org/jira/browse/SPARK-25816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Brian Zhang >Assignee: Apache Spark >Priority: Critical > Attachments: source.snappy.parquet > > > When there is a duplicate column name in the current Dataframe and orginal > Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does > not resolve the column correctly when using it in the expression, hence > causing casting issue. The same code is working in Spark 2.2.1 > Please see below code to reproduce the issue > import org.apache.spark._ > import org.apache.spark.rdd._ > import org.apache.spark.storage.StorageLevel._ > import org.apache.spark.sql._ > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.catalyst.expressions._ > import org.apache.spark.sql.Column > val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet") > val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*) > val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2")) > val v5_2 = $"2" > v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0) > //v00's 3rdcolumn is binary and 16th is map > Error: > org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to > data type mismatch: argument 1 requires map type, however, '`2`' is of binary > type.; > > 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < > {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- > Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- > Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS > 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS > 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS > 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, > c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS > 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- > Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542] > parquet -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25816) Functions does not resolve Columns correctly
[ https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662688#comment-16662688 ] Apache Spark commented on SPARK-25816: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/22817 > Functions does not resolve Columns correctly > > > Key: SPARK-25816 > URL: https://issues.apache.org/jira/browse/SPARK-25816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Brian Zhang >Priority: Critical > Attachments: source.snappy.parquet > > > When there is a duplicate column name in the current Dataframe and orginal > Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does > not resolve the column correctly when using it in the expression, hence > causing casting issue. The same code is working in Spark 2.2.1 > Please see below code to reproduce the issue > import org.apache.spark._ > import org.apache.spark.rdd._ > import org.apache.spark.storage.StorageLevel._ > import org.apache.spark.sql._ > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.catalyst.expressions._ > import org.apache.spark.sql.Column > val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet") > val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*) > val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2")) > val v5_2 = $"2" > v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0) > //v00's 3rdcolumn is binary and 16th is map > Error: > org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to > data type mismatch: argument 1 requires map type, however, '`2`' is of binary > type.; > > 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < > {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- > Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- > Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS > 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS > 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS > 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, > c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS > 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- > Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542] > parquet -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly
[ https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25816: Assignee: (was: Apache Spark) > Functions does not resolve Columns correctly > > > Key: SPARK-25816 > URL: https://issues.apache.org/jira/browse/SPARK-25816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Brian Zhang >Priority: Critical > Attachments: source.snappy.parquet > > > When there is a duplicate column name in the current Dataframe and orginal > Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does > not resolve the column correctly when using it in the expression, hence > causing casting issue. The same code is working in Spark 2.2.1 > Please see below code to reproduce the issue > import org.apache.spark._ > import org.apache.spark.rdd._ > import org.apache.spark.storage.StorageLevel._ > import org.apache.spark.sql._ > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.catalyst.expressions._ > import org.apache.spark.sql.Column > val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet") > val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*) > val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2")) > val v5_2 = $"2" > v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0) > //v00's 3rdcolumn is binary and 16th is map > Error: > org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to > data type mismatch: argument 1 requires map type, however, '`2`' is of binary > type.; > > 'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < > {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- > Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- > Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS > 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS > 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS > 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, > c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS > 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- > Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542] > parquet -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662673#comment-16662673 ] Dongjoon Hyun commented on SPARK-25823: --- Ah, got it. I thought a different one. For mine, I created SPARK-25824 as a minor improvement in Spark 3.0. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25824) Remove duplicated map entries in `showString`
[ https://issues.apache.org/jira/browse/SPARK-25824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662670#comment-16662670 ] Dongjoon Hyun commented on SPARK-25824: --- cc [~maropu] > Remove duplicated map entries in `showString` > - > > Key: SPARK-25824 > URL: https://issues.apache.org/jira/browse/SPARK-25824 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `showString` doesn't eliminate the duplication. So, it looks different from > the result of `collect` and select from saved rows. > *Spark 2.2.2* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("SELECT map(1,2,1,3)").show > +---+ > |map(1, 2, 1, 3)| > +---+ > |Map(1 -> 3)| > +---+ > {code} > *Spark 2.3.0 ~ 2.4.0-rc4* > {code} > spark-sql> select map(1,2,1,3); > {1:3} > scala> sql("SELECT map(1,2,1,3)").collect > res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) > scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") > scala> sql("SELECT * FROM m").show > ++ > | a| > ++ > |[1 -> 3]| > ++ > scala> sql("SELECT map(1,2,1,3)").show > ++ > | map(1, 2, 1, 3)| > ++ > |[1 -> 2, 1 -> 3]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14989) Upgrade to Jackson 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662668#comment-16662668 ] nirav patel commented on SPARK-14989: - Any updates on this? Spark 2.2 still uses Jackson 1.9.13! It's not playing along with other artifacts (play 2.6 for one) > Upgrade to Jackson 2.7.3 > > > Key: SPARK-14989 > URL: https://issues.apache.org/jira/browse/SPARK-14989 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Josh Rosen >Priority: Major > > For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25824) Remove duplicated map entries in `showString`
Dongjoon Hyun created SPARK-25824: - Summary: Remove duplicated map entries in `showString` Key: SPARK-25824 URL: https://issues.apache.org/jira/browse/SPARK-25824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun `showString` doesn't eliminate the duplication. So, it looks different from the result of `collect` and select from saved rows. *Spark 2.2.2* {code} spark-sql> select map(1,2,1,3); {1:3} scala> sql("SELECT map(1,2,1,3)").collect res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) scala> sql("SELECT map(1,2,1,3)").show +---+ |map(1, 2, 1, 3)| +---+ |Map(1 -> 3)| +---+ {code} *Spark 2.3.0 ~ 2.4.0-rc4* {code} spark-sql> select map(1,2,1,3); {1:3} scala> sql("SELECT map(1,2,1,3)").collect res1: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) scala> sql("CREATE TABLE m AS SELECT map(1,2,1,3) a") scala> sql("SELECT * FROM m").show ++ | a| ++ |[1 -> 3]| ++ scala> sql("SELECT map(1,2,1,3)").show ++ | map(1, 2, 1, 3)| ++ |[1 -> 2, 1 -> 3]| ++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662666#comment-16662666 ] Wenchen Fan commented on SPARK-25823: - Anyway we should make map lookup and toString/collect consistent. It's weird if `map(1,2,1,3)[1]` returns 2 but `string(map(1,2,1,3))` returns 1->3. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660 ] Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 6:38 PM: - Ur, I think `collect` is the correct one as you can see as CTAS example. We save with the last win entries. BTW, [~cloud_fan]. I'm looking around this issue. I'll create another improvement issue to fix `show` function for the following your comment. {quote}BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. {quote} It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu]. {code:java} scala> sql("SELECT map(1,2,1,3)").show +---+ |map(1, 2, 1, 3)| +---+ |Map(1 -> 3)| +---+ scala> spark.version res2: String = 2.2.2 {code} was (Author: dongjoon): BTW, [~cloud_fan]. I'm looking around this issue. I'll create another improvement issue to fix `show` function for the following your comment. bq. BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu]. {code} scala> sql("SELECT map(1,2,1,3)").show +---+ |map(1, 2, 1, 3)| +---+ |Map(1 -> 3)| +---+ scala> spark.version res2: String = 2.2.2 {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662660#comment-16662660 ] Dongjoon Hyun commented on SPARK-25823: --- BTW, [~cloud_fan]. I'm looking around this issue. I'll create another improvement issue to fix `show` function for the following your comment. bq. BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. It's a regression introduced by SPARK-23023 at Spark 2.3.0. cc [~maropu]. {code} scala> sql("SELECT map(1,2,1,3)").show +---+ |map(1, 2, 1, 3)| +---+ |Map(1 -> 3)| +---+ scala> spark.version res2: String = 2.2.2 {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662646#comment-16662646 ] Wenchen Fan commented on SPARK-25823: - [~dongjoon] good catch! I think we should update collect to match the behavior of map lookup. Going back to this ticket, the current behavior is different from presto but is consistent with how map type behaves in Spark. If others think this is serious, I'd suggest we remove map-related high-order functions from 2.4. However we can't remove `CreateMap`, so the behavior of map type in Spark is still as it was. Personally I don't want to remove the map-related high-order functions, as they follow the map type semantic in Spark and are implemented correctly. The only benefit I can think of is to not spread the unexpected behavior of map type in Spark. In the master branch we can work on making Spark map type consistent with Presto. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25812) Flaky test: PagedTableSuite.pageNavigation
[ https://issues.apache.org/jira/browse/SPARK-25812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25812: -- Component/s: Tests > Flaky test: PagedTableSuite.pageNavigation > -- > > Key: SPARK-25812 > URL: https://issues.apache.org/jira/browse/SPARK-25812 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5074/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5073/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5072/testReport/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5070/testReport/ > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97878/testReport/org.apache.spark.ui/PagedTableSuite/pageNavigation/] > {code:java} > [info] PagedTableSuite: > [info] - pageNavigation *** FAILED *** (2 milliseconds) > [info] > [info] > [info]class="form-inline pull-right" style="margin-bottom: 0px;"> > [info] > [info] > [info] 1 Pages. Jump to > [info] value="1" class="span1"/> > [info] > [info] . Show > [info] value="10" class="span1"/> > [info] items in a page. > [info] > [info] Go > [info] > [info] > [info] > [info] Page: > [info] > [info] > [info] > [info] 1 > [info] > [info] > [info] > [info] > [info]did not equal List() (PagedTableSuite.scala:76) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:76) > [info] at > org.apache.spark.ui.PagedTableSuite$$anonfun$4.apply(PagedTableSuite.scala:52) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23415: -- Component/s: Tests > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > The test suite fails due to 60-second timeout sometimes. > {code:java} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4759/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/412/] > (June 15th) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25542: -- Component/s: Tests > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23622: -- Component/s: Tests > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test > (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 > {code} > Error Message > java.lang.reflect.InvocationTargetException: null > Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) > at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) > at > org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) > at > org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) > at > org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) > at org.scalatest.Suite$class.run(Suite.scala:1144) > at > org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) > ... 29 more > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425) > ... 31 more > Caused by: sbt.ForkMain$ForkError: > java.lang.reflect.InvocationTargetException: null > at
[jira] [Updated] (SPARK-25792) Flaky test: BarrierStageOnSubmittedSuite."submit a barrier ResultStage that requires more slots than current total under local-cluster mode"
[ https://issues.apache.org/jira/browse/SPARK-25792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25792: -- Component/s: Tests > Flaky test: BarrierStageOnSubmittedSuite."submit a barrier ResultStage that > requires more slots than current total under local-cluster mode" > > > Key: SPARK-25792 > URL: https://issues.apache.org/jira/browse/SPARK-25792 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > [info] - submit a barrier ResultStage that requires more slots than current > total under local-cluster mode *** FAILED *** (5 seconds, 303 milliseconds) > [info] Expected exception org.apache.spark.SparkException to be thrown, but > java.util.concurrent.TimeoutException was thrown > (BarrierStageOnSubmittedSuite.scala:54) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:812) > [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > [info] at > org.apache.spark.BarrierStageOnSubmittedSuite.org$apache$spark$BarrierStageOnSubmittedSuite$$testSubmitJob(BarrierStageOnSubmittedSuite.scala:54) > [info] at > org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply$mcV$sp(BarrierStageOnSubmittedSuite.scala:240) > [info] at > org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply(BarrierStageOnSubmittedSuite.scala:227) > [info] at > org.apache.spark.BarrierStageOnSubmittedSuite$$anonfun$18.apply(BarrierStageOnSubmittedSuite.scala:227) > {code} > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/133/testReport > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/132/testReport > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/125/testReport > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/123/testReport -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24318) Flaky test: SortShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-24318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24318: -- Component/s: Tests > Flaky test: SortShuffleSuite > > > Key: SPARK-24318 > URL: https://issues.apache.org/jira/browse/SPARK-24318 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/346/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/336/ > {code} > Error Message > java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905 > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1073) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness
[ https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24285: -- Component/s: Tests > Flaky test: ContinuousSuite.query without test harness > -- > > Key: SPARK-24285 > URL: https://issues.apache.org/jira/browse/SPARK-24285 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *2.5.0-SNAPSHOT* > - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96640] > {code:java} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > scala.this.Predef.Set.apply[Int](0, 1, 2, 3).map[org.apache.spark.sql.Row, > scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => > org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.this.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) > was false{code} > *2.3.x* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24153) Flaky Test: DirectKafkaStreamSuite
[ https://issues.apache.org/jira/browse/SPARK-24153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24153: -- Component/s: Tests > Flaky Test: DirectKafkaStreamSuite > -- > > Key: SPARK-24153 > URL: https://issues.apache.org/jira/browse/SPARK-24153 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > Test Result (5 failures / +5) > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.receiving from > largest starting offset > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.creating > stream by offset > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset > recovery from kafka > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.Direct Kafka > stream report input information > {code} > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/348/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24239) Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from earliest offsets
[ https://issues.apache.org/jira/browse/SPARK-24239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24239: -- Component/s: Tests > Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from > earliest offsets > -- > > Key: SPARK-24239 > URL: https://issues.apache.org/jira/browse/SPARK-24239 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/360/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/353/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite
[ https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24211: -- Component/s: Tests > Flaky test: StreamingOuterJoinSuite > --- > > Key: SPARK-24211 > URL: https://issues.apache.org/jira/browse/SPARK-24211 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *windowed left outer join* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/] > *windowed right outer join* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/] > *left outer join with non-key condition violated* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/386/] > *left outer early state exclusion on left* > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375] > - > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/385/] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24173) Flaky Test: VersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-24173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24173: -- Component/s: Tests > Flaky Test: VersionsSuite > - > > Key: SPARK-24173 > URL: https://issues.apache.org/jira/browse/SPARK-24173 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *BRANCH-2.2* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/ > *BRANCH-2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/369/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/383 > *MASTER* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4843/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24140) Flaky test: KMeansClusterSuite.task size should be small in both training and prediction
[ https://issues.apache.org/jira/browse/SPARK-24140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662637#comment-16662637 ] Dongjoon Hyun commented on SPARK-24140: --- Sorry, I don't have new links. You may take a look at the Jenkins log. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ > Flaky test: KMeansClusterSuite.task size should be small in both training and > prediction > > > Key: SPARK-24140 > URL: https://issues.apache.org/jira/browse/SPARK-24140 > Project: Spark > Issue Type: Bug > Components: ML, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4765/ > {code} > Error Message > Job 0 cancelled because SparkContext was shut down > Stacktrace > org.apache.spark.SparkException: Job 0 cancelled because SparkContext > was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1841) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1754) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1927) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1303) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1926) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:574) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1907) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2030) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2051) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095) > at org.apache.spark.rdd.RDD.count(RDD.scala:1162) > at org.apache.spark.rdd.RDD$$anonfun$takeSample$1.apply(RDD.scala:571) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.takeSample(RDD.scala:560) > at org.apache.spark.mllib.clustering.KMeans.initRandom(KMeans.scala:360) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24140) Flaky test: KMeansClusterSuite.task size should be small in both training and prediction
[ https://issues.apache.org/jira/browse/SPARK-24140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24140: -- Component/s: Tests > Flaky test: KMeansClusterSuite.task size should be small in both training and > prediction > > > Key: SPARK-24140 > URL: https://issues.apache.org/jira/browse/SPARK-24140 > Project: Spark > Issue Type: Bug > Components: ML, Tests >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4765/ > {code} > Error Message > Job 0 cancelled because SparkContext was shut down > Stacktrace > org.apache.spark.SparkException: Job 0 cancelled because SparkContext > was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1841) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1754) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1927) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1303) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1926) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:574) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1907) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2030) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2051) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2070) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095) > at org.apache.spark.rdd.RDD.count(RDD.scala:1162) > at org.apache.spark.rdd.RDD$$anonfun$takeSample$1.apply(RDD.scala:571) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.takeSample(RDD.scala:560) > at org.apache.spark.mllib.clustering.KMeans.initRandom(KMeans.scala:360) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618 ] Dongjoon Hyun edited comment on SPARK-25823 at 10/24/18 5:59 PM: - Right. This is related to the a long lasting issue when CreateMap added. And, when we do collect, the last entry wins. {code} scala> sql("SELECT map(1,2,1,3)").collect res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) {code} was (Author: dongjoon): Right. This is a long lasting issue when CreateMap added. And, when we do collect, the last entry wins. {code} scala> sql("SELECT map(1,2,1,3)").collect res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662618#comment-16662618 ] Dongjoon Hyun commented on SPARK-25823: --- Right. This is a long lasting issue when CreateMap added. And, when we do collect, the last entry wins. {code} scala> sql("SELECT map(1,2,1,3)").collect res0: Array[org.apache.spark.sql.Row] = Array([Map(1 -> 3)]) {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25823: -- Labels: correctness (was: ) > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662616#comment-16662616 ] Wenchen Fan commented on SPARK-25823: - BTW one improvement we can do is to remove duplicated map keys when converting map values to string, to make it invisible to end-users. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662613#comment-16662613 ] Wenchen Fan commented on SPARK-25823: - I did a few experiments {code} scala> sql("SELECT map(1,2,1,3)").show ++ | map(1, 2, 1, 3)| ++ |[1 -> 2, 1 -> 3]| ++ scala> sql("SELECT map(1,2,1,3)[1]").show +--+ |map(1, 2, 1, 3)[1]| +--+ | 2| +--+ {code} So this is a long-standing problem that Spark allows duplicated map keys, and during lookup the first matched entry wins. I'm not sure we should change this behavior for now. Maybe we should find a place to document it. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25823: -- Description: This is not a regression because this occurs in new high-order functions like `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the duplication. If we want to allow this difference in new high-order functions, we had better add some warning about this different on these functions after RC4 voting pass at least. Otherwise, this will surprise Presto-based users. *Spark 2.4* {code:java} spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); spark-sql> SELECT * FROM t; {1:3} {1:2} {code} *Presto 0.212* {code:java} presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])) a); a | _col1 ---+--- {1=3} | {} {code} was: This is not a regression because this occurs in new high-order functions like `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the duplication. If we want to allow this difference in new high-order functions, we had better add some warning about this different on these functions after RC4 voting at least. Otherwise, this will surprise Presto-based users. *Spark 2.4* {code:java} spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); spark-sql> SELECT * FROM t; {1:3} {1:2} {code} *Presto 0.212* {code:java} presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])) a); a | _col1 ---+--- {1=3} | {} {code} > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting pass at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25823) map_filter can generate incorrect data
[ https://issues.apache.org/jira/browse/SPARK-25823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662583#comment-16662583 ] Dongjoon Hyun commented on SPARK-25823: --- cc [~cloud_fan], [~srowen], [~smilegator], [~hyukjin.kwon] and [~kabhwan]. > map_filter can generate incorrect data > -- > > Key: SPARK-25823 > URL: https://issues.apache.org/jira/browse/SPARK-25823 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > This is not a regression because this occurs in new high-order functions like > `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows > the duplication. If we want to allow this difference in new high-order > functions, we had better add some warning about this different on these > functions after RC4 voting at least. Otherwise, this will surprise > Presto-based users. > *Spark 2.4* > {code:java} > spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM > (SELECT map_concat(map(1,2), map(1,3)) m); > spark-sql> SELECT * FROM t; > {1:3} {1:2} > {code} > *Presto 0.212* > {code:java} > presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT > map_concat(map(array[1],array[2]), map(array[1],array[3])) a); >a | _col1 > ---+--- > {1=3} | {} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25822) Fix a race condition when releasing a Python worker
[ https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662582#comment-16662582 ] Apache Spark commented on SPARK-25822: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22816 > Fix a race condition when releasing a Python worker > --- > > Key: SPARK-25822 > URL: https://issues.apache.org/jira/browse/SPARK-25822 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > There is a race condition when releasing a Python worker. If > "ReaderIterator.handleEndOfDataSection" is not running in the task thread, > when a task is early terminated (such as "take(N)"), the task completion > listener may close the worker but "handleEndOfDataSection" can still put the > worker into the worker pool to reuse. > https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 > is a patch to reproduce this issue. > I also found a user reported this in the mail list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25822) Fix a race condition when releasing a Python worker
[ https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25822: Assignee: Shixiong Zhu (was: Apache Spark) > Fix a race condition when releasing a Python worker > --- > > Key: SPARK-25822 > URL: https://issues.apache.org/jira/browse/SPARK-25822 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > There is a race condition when releasing a Python worker. If > "ReaderIterator.handleEndOfDataSection" is not running in the task thread, > when a task is early terminated (such as "take(N)"), the task completion > listener may close the worker but "handleEndOfDataSection" can still put the > worker into the worker pool to reuse. > https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 > is a patch to reproduce this issue. > I also found a user reported this in the mail list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25822) Fix a race condition when releasing a Python worker
[ https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25822: Assignee: Apache Spark (was: Shixiong Zhu) > Fix a race condition when releasing a Python worker > --- > > Key: SPARK-25822 > URL: https://issues.apache.org/jira/browse/SPARK-25822 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > There is a race condition when releasing a Python worker. If > "ReaderIterator.handleEndOfDataSection" is not running in the task thread, > when a task is early terminated (such as "take(N)"), the task completion > listener may close the worker but "handleEndOfDataSection" can still put the > worker into the worker pool to reuse. > https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 > is a patch to reproduce this issue. > I also found a user reported this in the mail list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25822) Fix a race condition when releasing a Python worker
[ https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662580#comment-16662580 ] Apache Spark commented on SPARK-25822: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22816 > Fix a race condition when releasing a Python worker > --- > > Key: SPARK-25822 > URL: https://issues.apache.org/jira/browse/SPARK-25822 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > There is a race condition when releasing a Python worker. If > "ReaderIterator.handleEndOfDataSection" is not running in the task thread, > when a task is early terminated (such as "take(N)"), the task completion > listener may close the worker but "handleEndOfDataSection" can still put the > worker into the worker pool to reuse. > https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 > is a patch to reproduce this issue. > I also found a user reported this in the mail list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25823) map_filter can generate incorrect data
Dongjoon Hyun created SPARK-25823: - Summary: map_filter can generate incorrect data Key: SPARK-25823 URL: https://issues.apache.org/jira/browse/SPARK-25823 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dongjoon Hyun This is not a regression because this occurs in new high-order functions like `map_filter` and `map_concat`. The root cause is Spark's `CreateMap` allows the duplication. If we want to allow this difference in new high-order functions, we had better add some warning about this different on these functions after RC4 voting at least. Otherwise, this will surprise Presto-based users. *Spark 2.4* {code:java} spark-sql> CREATE TABLE t AS SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT map_concat(map(1,2), map(1,3)) m); spark-sql> SELECT * FROM t; {1:3} {1:2} {code} *Presto 0.212* {code:java} presto> SELECT a, map_filter(a, (k,v) -> v = 2) FROM (SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])) a); a | _col1 ---+--- {1=3} | {} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25822) Fix a race condition when releasing a Python worker
Shixiong Zhu created SPARK-25822: Summary: Fix a race condition when releasing a Python worker Key: SPARK-25822 URL: https://issues.apache.org/jira/browse/SPARK-25822 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.2 Reporter: Shixiong Zhu Assignee: Shixiong Zhu There is a race condition when releasing a Python worker. If "ReaderIterator.handleEndOfDataSection" is not running in the task thread, when a task is early terminated (such as "take(N)"), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25798) Internally document type conversion between Pandas data and SQL types in Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-25798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25798. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22795 [https://github.com/apache/spark/pull/22795] > Internally document type conversion between Pandas data and SQL types in > Pandas UDFs > > > Key: SPARK-25798 > URL: https://issues.apache.org/jira/browse/SPARK-25798 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Currently, UDF's type coercion is not cleanly defined. See also > https://github.com/apache/spark/pull/20163 and > https://github.com/apache/spark/pull/22610 > This JIRA targets to describe the type conversion logic internally. For > instance: > {code} > # > +--+--+---+++++-+-+-+-+---+-++--+---++ > # noqa > # |SQL Type \ Pandas Type|True(bool)|1(int8)|1(int16)| > 1(int32)| > 1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|a(object)|1970-01-01 > 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, > US/Eastern])|1.0(float64)|[1 2 3](object(array))|A(category)|1 days > 00:00:00(timedelta64[ns])| # noqa > # > +--+--+---+++++-+-+-+-+---+-++--+---++ > # noqa > # | boolean| True| True|True| > True|True|True| True| True| True|X| >False| >False| False| X| X| > False| # noqa > # | tinyint| 1| 1| 1| > 1| 1| X|X|X|X|X| > X| > X| 1| X| 0| >X| # noqa > # | smallint| 1| 1| 1| > 1| 1| 1|X|X|X|X| > X| > X| 1| X| X| >X| # noqa > # | int| 1| 1| 1| > 1| 1| 1|1|X|X|X| > X| > X| 1| X| X| >X| # noqa > # |bigint| 1| 1| 1| > 1| 1| 1|1|1|X|X| > 0| > 18| 1| X| X| > X| # noqa > # |string| u''|u'\x01'| u'\x01'| > u'\x01'| u'\x01'| u'\x01'| u'\x01'| u'\x01'| u'\x01'| > u'a'| X| >X| u''| X| X| > X| # noqa > # | date| X| X| > X|datetime.date(197...| X| X|X|X| >X|X| datetime.date(197...| > X| X| X| X| > X| # noqa > # | timestamp| X| X| X| > X|datetime.datetime...| X|X|X|X|X| > datetime.datetime...| > datetime.datetime...| X| X| X| > X| # noqa > # | float| 1.0|1.0| 1.0| > 1.0|