[jira] [Commented] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance
[ https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713576#comment-16713576 ] Apache Spark commented on SPARK-26312: -- User 'eatoncys' has created a pull request for this issue: https://github.com/apache/spark/pull/23262 > Converting converters in RDDConversions into arrays to improve their access > performance > --- > > Key: SPARK-26312 > URL: https://issues.apache.org/jira/browse/SPARK-26312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > `RDDConversions` would get disproportionately slower as the number of columns > in the query increased. > This PR converts the `converters` in `RDDConversions` into arrays to improve > their access performance, the type of `converters` before is > `scala.collection.immutable.::` which is a subtype of list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance
[ https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713577#comment-16713577 ] Apache Spark commented on SPARK-26312: -- User 'eatoncys' has created a pull request for this issue: https://github.com/apache/spark/pull/23262 > Converting converters in RDDConversions into arrays to improve their access > performance > --- > > Key: SPARK-26312 > URL: https://issues.apache.org/jira/browse/SPARK-26312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > `RDDConversions` would get disproportionately slower as the number of columns > in the query increased. > This PR converts the `converters` in `RDDConversions` into arrays to improve > their access performance, the type of `converters` before is > `scala.collection.immutable.::` which is a subtype of list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance
[ https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26312: Assignee: (was: Apache Spark) > Converting converters in RDDConversions into arrays to improve their access > performance > --- > > Key: SPARK-26312 > URL: https://issues.apache.org/jira/browse/SPARK-26312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Priority: Major > > `RDDConversions` would get disproportionately slower as the number of columns > in the query increased. > This PR converts the `converters` in `RDDConversions` into arrays to improve > their access performance, the type of `converters` before is > `scala.collection.immutable.::` which is a subtype of list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance
[ https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26312: Assignee: Apache Spark > Converting converters in RDDConversions into arrays to improve their access > performance > --- > > Key: SPARK-26312 > URL: https://issues.apache.org/jira/browse/SPARK-26312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: eaton >Assignee: Apache Spark >Priority: Major > > `RDDConversions` would get disproportionately slower as the number of columns > in the query increased. > This PR converts the `converters` in `RDDConversions` into arrays to improve > their access performance, the type of `converters` before is > `scala.collection.immutable.::` which is a subtype of list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance
eaton created SPARK-26312: - Summary: Converting converters in RDDConversions into arrays to improve their access performance Key: SPARK-26312 URL: https://issues.apache.org/jira/browse/SPARK-26312 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: eaton `RDDConversions` would get disproportionately slower as the number of columns in the query increased. This PR converts the `converters` in `RDDConversions` into arrays to improve their access performance, the type of `converters` before is `scala.collection.immutable.::` which is a subtype of list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713565#comment-16713565 ] Apache Spark commented on SPARK-26311: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/23260 > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status
[ https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713568#comment-16713568 ] Apache Spark commented on SPARK-23674: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23261 > Add Spark ML Listener for Tracking ML Pipeline Status > - > > Key: SPARK-23674 > URL: https://issues.apache.org/jira/browse/SPARK-23674 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Priority: Major > > Currently, Spark provides status monitoring for different components of > Spark, like spark history server, streaming listener, sql listener and etc. > The use case would be (1) front UI to track the status of training coverage > rate during iteration, then DS can understand how the job converge when > training, like K-means, Logistic and other linear regression model. (2) > tracking the data lineage for the input and output of training data. > In this proposal, we hope to provide Spark ML pipeline listener to track the > status of Spark ML pipeline status includes: > # ML pipeline create and saved > # ML pipeline model created, saved and load > # ML model training status monitoring -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status
[ https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713567#comment-16713567 ] Apache Spark commented on SPARK-23674: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23261 > Add Spark ML Listener for Tracking ML Pipeline Status > - > > Key: SPARK-23674 > URL: https://issues.apache.org/jira/browse/SPARK-23674 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Priority: Major > > Currently, Spark provides status monitoring for different components of > Spark, like spark history server, streaming listener, sql listener and etc. > The use case would be (1) front UI to track the status of training coverage > rate during iteration, then DS can understand how the job converge when > training, like K-means, Logistic and other linear regression model. (2) > tracking the data lineage for the input and output of training data. > In this proposal, we hope to provide Spark ML pipeline listener to track the > status of Spark ML pipeline status includes: > # ML pipeline create and saved > # ML pipeline model created, saved and load > # ML model training status monitoring -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26311: Assignee: (was: Apache Spark) > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26311: Assignee: Apache Spark > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713564#comment-16713564 ] Apache Spark commented on SPARK-26311: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/23260 > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
Jungtaek Lim created SPARK-26311: Summary: [YARN] New feature: custom log URL for stdout/stderr Key: SPARK-26311 URL: https://issues.apache.org/jira/browse/SPARK-26311 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.4.0 Reporter: Jungtaek Lim Spark has been setting static log URLs for YARN application, which points to NodeManager webapp. Normally it would work for both running apps and finished apps, but there're also other approaches on maintaining application logs, like having external log service which enables to avoid application log url to be a deadlink when NodeManager is not accessible. (Node decommissioned, elastic nodes, etc.) Spark can provide a new configuration for custom log url on YARN mode, which end users can set it properly to point application log to external log service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.
[ https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26224: - Component/s: (was: Spark Core) SQL > Results in stackOverFlowError when trying to add 3000 new columns using > withColumn function of dataframe. > - > > Key: SPARK-26224 > URL: https://issues.apache.org/jira/browse/SPARK-26224 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: On macbook, used Intellij editor. Ran the above sample > code as unit test. >Reporter: Dorjee Tsering >Priority: Minor > > Reproduction step: > Run this sample code on your laptop. I am trying to add 3000 new columns to a > base dataframe with 1 column. > > > {code:java} > import spark.implicits._ > val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new > StructField("field_" + i, DataTypes.LongType) > val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id") > val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => > df.withColumn(newColumn.name, lit(0))) > result.show(false) > > {code} > Ends up with following stacktrace: > java.lang.StackOverflowError > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713540#comment-16713540 ] Apache Spark commented on SPARK-26215: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/23259 > define reserved keywords after SQL standard > --- > > Key: SPARK-26215 > URL: https://issues.apache.org/jira/browse/SPARK-26215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved > keywords can't be used as identifiers. > In Spark SQL, we are too tolerant about non-reserved keywors. A lot of > keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a > problem when improving the INTERVAL syntax). > I think it will be better to just follow other databases or SQL standard to > define reserved keywords, so that we don't need to think very hard about how > to avoid ambiguity. > For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26215) define reserved keywords after SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26215: Assignee: Apache Spark > define reserved keywords after SQL standard > --- > > Key: SPARK-26215 > URL: https://issues.apache.org/jira/browse/SPARK-26215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > > There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved > keywords can't be used as identifiers. > In Spark SQL, we are too tolerant about non-reserved keywors. A lot of > keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a > problem when improving the INTERVAL syntax). > I think it will be better to just follow other databases or SQL standard to > define reserved keywords, so that we don't need to think very hard about how > to avoid ambiguity. > For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713538#comment-16713538 ] Apache Spark commented on SPARK-26215: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/23259 > define reserved keywords after SQL standard > --- > > Key: SPARK-26215 > URL: https://issues.apache.org/jira/browse/SPARK-26215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved > keywords can't be used as identifiers. > In Spark SQL, we are too tolerant about non-reserved keywors. A lot of > keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a > problem when improving the INTERVAL syntax). > I think it will be better to just follow other databases or SQL standard to > define reserved keywords, so that we don't need to think very hard about how > to avoid ambiguity. > For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26215) define reserved keywords after SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26215: Assignee: (was: Apache Spark) > define reserved keywords after SQL standard > --- > > Key: SPARK-26215 > URL: https://issues.apache.org/jira/browse/SPARK-26215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved > keywords can't be used as identifiers. > In Spark SQL, we are too tolerant about non-reserved keywors. A lot of > keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a > problem when improving the INTERVAL syntax). > I think it will be better to just follow other databases or SQL standard to > define reserved keywords, so that we don't need to think very hard about how > to avoid ambiguity. > For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713516#comment-16713516 ] Apache Spark commented on SPARK-23375: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23258 > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713514#comment-16713514 ] Apache Spark commented on SPARK-23375: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23258 > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.
[ https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713494#comment-16713494 ] Liang-Chi Hsieh commented on SPARK-26224: - I think it is not specified to withColumn. withColumn simply adds a projection on original dataframe. I think It is because you create a very deep query plan. So the analyzer or optimizer encounters when traversing down the query plan. Even it can traverse down such deep query plan, it might be not efficient to do that. I'd recommend not to create such deep query plan. This should not be a bug. > Results in stackOverFlowError when trying to add 3000 new columns using > withColumn function of dataframe. > - > > Key: SPARK-26224 > URL: https://issues.apache.org/jira/browse/SPARK-26224 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: On macbook, used Intellij editor. Ran the above sample > code as unit test. >Reporter: Dorjee Tsering >Priority: Minor > > Reproduction step: > Run this sample code on your laptop. I am trying to add 3000 new columns to a > base dataframe with 1 column. > > > {code:java} > import spark.implicits._ > val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new > StructField("field_" + i, DataTypes.LongType) > val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id") > val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => > df.withColumn(newColumn.name, lit(0))) > result.show(false) > > {code} > Ends up with following stacktrace: > java.lang.StackOverflowError > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon resolved SPARK-23734. -- Resolution: Fixed Fix Version/s: 2.3.1 > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 (Feb 22 2018) >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > Fix For: 2.3.1 > > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala) > -- This message was sent by
[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713403#comment-16713403 ] Stanley Poon commented on SPARK-23734: -- Just confirmed the problem is fixed in Spark 2.3.1. The test environment uses Scala 2.11.11. And there are no other dependency. I will close the case. > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 (Feb 22 2018) >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at >
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713388#comment-16713388 ] Dongjoon Hyun commented on SPARK-26282: --- Great, thanks again for this and email notifications. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)
[ https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19526. Resolution: Cannot Reproduce > Spark should raise an exception when it tries to read a Hive view but it > doesn't have read access on the corresponding table(s) > --- > > Key: SPARK-19526 > URL: https://issues.apache.org/jira/browse/SPARK-19526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0 >Reporter: Reza Safi >Priority: Major > > Spark sees a Hive views as a set of hdfs "files". So to read anything from a > Hive view, Spark needs access to all of the files that belongs to the > table(s) that the view queries them. In other words a Spark user cannot be > granted fine grained permissions at the levels of Hive columns or records. > Consider that there is a Spark job that contains a SQL query that tries to > read a Hive view. Currently the Spark job will finish successfully if the > user that runs the Spark job doesn't have proper read access permissions to > the tables that the Hive view has been built on top of them. It will just > return an empty result set. This can be confusing for the users, since the > job will be finishes without any exception or error. > Spark should raise an exception like AccessDenied when it tries to run a > Hive view query and its user doesn't have proper permissions to the tables > that the Hive view is created on top of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)
[ https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713368#comment-16713368 ] Reza Safi commented on SPARK-19526: --- It seems that this can be resolved since we can't reproduce the issue. Spark will give an error message if the user doesn't have proper access to the underlying table of a view. It won't just return null results. Thank you [~attilapiros] and [~vanzin] for verifying this. > Spark should raise an exception when it tries to read a Hive view but it > doesn't have read access on the corresponding table(s) > --- > > Key: SPARK-19526 > URL: https://issues.apache.org/jira/browse/SPARK-19526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0 >Reporter: Reza Safi >Priority: Major > > Spark sees a Hive views as a set of hdfs "files". So to read anything from a > Hive view, Spark needs access to all of the files that belongs to the > table(s) that the view queries them. In other words a Spark user cannot be > granted fine grained permissions at the levels of Hive columns or records. > Consider that there is a Spark job that contains a SQL query that tries to > read a Hive view. Currently the Spark job will finish successfully if the > user that runs the Spark job doesn't have proper read access permissions to > the tables that the Hive view has been built on top of them. It will just > return an empty result set. This can be confusing for the users, since the > job will be finishes without any exception or error. > Spark should raise an exception like AccessDenied when it tries to run a > Hive view query and its user doesn't have proper permissions to the tables > that the Hive view is created on top of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713353#comment-16713353 ] shane knapp commented on SPARK-26282: - test build passed! [https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-maven-hadoop-2.7-java-8.191/1/] deploying this now. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-26282. - Resolution: Fixed > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713359#comment-16713359 ] shane knapp commented on SPARK-26282: - done. about to email dev@ for a heads-up. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24333. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21465 [https://github.com/apache/spark/pull/21465] > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
[ https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26304: -- Assignee: Gabor Somogyi > Add default value to spark.kafka.sasl.kerberos.service.name parameter > - > > Key: SPARK-26304 > URL: https://issues.apache.org/jira/browse/SPARK-26304 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > The reasoning behind: > * Kafka's configuration guide suggest the same value: > https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig > * It would be easier for spark users by providing less configuration > * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
[ https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26304. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23254 [https://github.com/apache/spark/pull/23254] > Add default value to spark.kafka.sasl.kerberos.service.name parameter > - > > Key: SPARK-26304 > URL: https://issues.apache.org/jira/browse/SPARK-26304 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > The reasoning behind: > * Kafka's configuration guide suggest the same value: > https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig > * It would be easier for spark users by providing less configuration > * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24333: Assignee: Huaxin Gao > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Huaxin Gao >Priority: Major > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26310) Verification of JSON options
[ https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26310: Assignee: (was: Apache Spark) > Verification of JSON options > > > Key: SPARK-26310 > URL: https://issues.apache.org/jira/browse/SPARK-26310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > For JSON options used only in write, the following exception should be raised > if those options are used in read. The same exception should be raised in the > opposite case when read option is used in write: > {code} > java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is > not applicable in write. > {code} > The verification can be disabled via the SQL config: > {code} > spark.sql.verifyDataSourceOptions > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26310) Verification of JSON options
[ https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713285#comment-16713285 ] Apache Spark commented on SPARK-26310: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23257 > Verification of JSON options > > > Key: SPARK-26310 > URL: https://issues.apache.org/jira/browse/SPARK-26310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > For JSON options used only in write, the following exception should be raised > if those options are used in read. The same exception should be raised in the > opposite case when read option is used in write: > {code} > java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is > not applicable in write. > {code} > The verification can be disabled via the SQL config: > {code} > spark.sql.verifyDataSourceOptions > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26310) Verification of JSON options
[ https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26310: Assignee: Apache Spark > Verification of JSON options > > > Key: SPARK-26310 > URL: https://issues.apache.org/jira/browse/SPARK-26310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > For JSON options used only in write, the following exception should be raised > if those options are used in read. The same exception should be raised in the > opposite case when read option is used in write: > {code} > java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is > not applicable in write. > {code} > The verification can be disabled via the SQL config: > {code} > spark.sql.verifyDataSourceOptions > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26310) Verification of JSON options
[ https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713283#comment-16713283 ] Apache Spark commented on SPARK-26310: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23257 > Verification of JSON options > > > Key: SPARK-26310 > URL: https://issues.apache.org/jira/browse/SPARK-26310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > For JSON options used only in write, the following exception should be raised > if those options are used in read. The same exception should be raised in the > opposite case when read option is used in write: > {code} > java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is > not applicable in write. > {code} > The verification can be disabled via the SQL config: > {code} > spark.sql.verifyDataSourceOptions > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25696) The storage memory displayed on spark Application UI is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-25696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25696: - Docs Text: In Spark 3.0, the web UI and log statements now consistently report units in KiB, MiB, etc, (i.e. multiples of 1024) rather than KB and MB (i.e. multiples of 1000). For example, 1024000 bytes is now displayed as 1000 KiB rather than 1024 KB. Assignee: hantiantian Target Version/s: 3.0.0 Labels: release-notes (was: ) Component/s: Web UI Issue Type: Improvement (was: Bug) (I'm marking this much more of an improvement than fix, as I believe the displays were correct, but just in inconsistent units. There were a few log statements that were incorrect, but nothing functional, it appears.) > The storage memory displayed on spark Application UI is incorrect. > -- > > Key: SPARK-25696 > URL: https://issues.apache.org/jira/browse/SPARK-25696 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Labels: release-notes > > In the reported heartbeat information, the unit of the memory data is bytes, > which is converted by the formatBytes() function in the utils.js file before > being displayed in the interface. The cardinality of the unit conversion in > the formatBytes function is 1000, which should be 1024. > function formatBytes(bytes, type) > { if (type !== 'display') return bytes; if (bytes == 0) return '0.0 B'; > var k = 1000; var dm = 1; var sizes = ['B', 'KB', 'MB', 'GB', 'TB', > 'PB', 'EB', 'ZB', 'YB']; var i = Math.floor(Math.log(bytes) / > Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + > sizes[i]; } > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26310) Verification of JSON options
Maxim Gekk created SPARK-26310: -- Summary: Verification of JSON options Key: SPARK-26310 URL: https://issues.apache.org/jira/browse/SPARK-26310 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk For JSON options used only in write, the following exception should be raised if those options are used in read. The same exception should be raised in the opposite case when read option is used in write: {code} java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is not applicable in write. {code} The verification can be disabled via the SQL config: {code} spark.sql.verifyDataSourceOptions {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26196) Total tasks message in the stage is incorrect, when there are failed or killed tasks
[ https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26196. --- Resolution: Fixed Assignee: shahid Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark/pull/23160 > Total tasks message in the stage is incorrect, when there are failed or > killed tasks > > > Key: SPARK-26196 > URL: https://issues.apache.org/jira/browse/SPARK-26196 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > > Total tasks message in the stage page is incorrect when there are failed or > killed tasks. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26281. --- Resolution: Fixed Assignee: shahid Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark/pull/23160 > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26309) Verification of Data source options
Maxim Gekk created SPARK-26309: -- Summary: Verification of Data source options Key: SPARK-26309 URL: https://issues.apache.org/jira/browse/SPARK-26309 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, applicability of datasource options passed to DataFrameReader and DataFrameWriter are not checked fully. For example, If an option is used only in write, it will be silently ignored in read. Such behavior of built-in datasource usually confuses users. The ticket aims to implement additional verification of datasource option and detect option misusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26281: -- Priority: Minor (was: Major) Issue Type: Bug (was: Improvement) > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: shahid >Priority: Minor > Fix For: 3.0.0 > > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25299: -- Assignee: (was: Marcelo Vanzin) > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25299: -- Assignee: Marcelo Vanzin > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Assignee: Marcelo Vanzin >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26294. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23247 [https://github.com/apache/spark/pull/23247] > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: wangjiaochun >Priority: Trivial > Fix For: 3.0.0 > > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26294: - Assignee: wangjiaochun > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: wangjiaochun >Priority: Trivial > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24207) PrefixSpan: R API
[ https://issues.apache.org/jira/browse/SPARK-24207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713242#comment-16713242 ] Apache Spark commented on SPARK-24207: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/23256 > PrefixSpan: R API > - > > Key: SPARK-24207 > URL: https://issues.apache.org/jira/browse/SPARK-24207 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713231#comment-16713231 ] Gabor Somogyi commented on SPARK-26306: --- Tested it on my local machine in a loop and never appeared. > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In PR builder the following issue appeared: > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at
[jira] [Created] (SPARK-26308) Large BigDecimal value is converted to null when passed into a UDF
Jay Pranavamurthi created SPARK-26308: - Summary: Large BigDecimal value is converted to null when passed into a UDF Key: SPARK-26308 URL: https://issues.apache.org/jira/browse/SPARK-26308 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Jay Pranavamurthi We are loading a Hive table into a Spark DataFrame. The Hive table has a decimal(30, 0) column with values greater than Long.MAX_VALUE. The DataFrame loads correctly. We then use a UDF to convert the decimal type to a String value. For decimal values < Long.MAX_VALUE, this works fine, but when the decimal value > Long.MAX_VALUE, the input to the UDF is a *null*. Hive table schema and data: {code:java} create table decimal_test (col1 decimal(30, 0), col2 decimal(10, 0), col3 int, col4 string); insert into decimal_test values(20110002456556, 123456789, 10, 'test1'); {code} Execution in spark-shell: _(Note that the first column in the final output is null, it should have been "20110002456556")_ {code:java} scala> val df1 = spark.sqlContext.sql("select * from decimal_test") df1: org.apache.spark.sql.DataFrame = [col1: decimal(30,0), col2: decimal(10,0) ... 2 more fields] scala> df1.show ++-++-+ | col1| col2|col3| col4| ++-++-+ |201100024...|123456789| 10|test1| ++-++-+ scala> val decimalToString = (value: java.math.BigDecimal) => if (value == null) null else { value.toBigInteger().toString } decimalToString: java.math.BigDecimal => String = scala> val udf1 = org.apache.spark.sql.functions.udf(decimalToString) udf1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,StringType,Some(List(DecimalType(38,18 scala> val df2 = df1.withColumn("col1", udf1(df1.col("col1"))) df2: org.apache.spark.sql.DataFrame = [col1: string, col2: decimal(10,0) ... 2 more fields] scala> df2.show ++-++-+ |col1| col2|col3| col4| ++-++-+ |null|123456789| 10|test1| ++-++-+ {code} Oddly this works if we change the "decimalToString" udf to take an "Any" instead of a "java.math.BigDecimal" {code:java} scala> val decimalToString = (value: Any) => if (value == null) null else { if (value.isInstanceOf[java.math.BigDecimal]) value.asInstanceOf[java.math.BigDecimal].toBigInteger().toString else null } decimalToString: Any => String = scala> val udf1 = org.apache.spark.sql.functions.udf(decimalToString) udf1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,StringType,None) scala> val df2 = df1.withColumn("col1", udf1(df1.col("col1"))) df2: org.apache.spark.sql.DataFrame = [col1: string, col2: decimal(10,0) ... 2 more fields] scala> df2.show ++-++-+ | col1| col2|col3| col4| ++-++-+ |201100024...|123456789| 10|test1| ++-++-+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle
[ https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24243: -- Assignee: Sahil Takiar > Expose exceptions from InProcessAppHandle > - > > Key: SPARK-24243 > URL: https://issues.apache.org/jira/browse/SPARK-24243 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Fix For: 3.0.0 > > > {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any > exceptions thrown are logged and then the state is set to {{FAILED}}. It > would be nice to expose the {{Throwable}} object to the application rather > than logging it and dropping it. Applications may want to manipulate the > underlying {{Throwable}} / control its logging at a finer granularity. For > example, the app might want to call > {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to > the app users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24243) Expose exceptions from InProcessAppHandle
[ https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24243. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23221 [https://github.com/apache/spark/pull/23221] > Expose exceptions from InProcessAppHandle > - > > Key: SPARK-24243 > URL: https://issues.apache.org/jira/browse/SPARK-24243 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Fix For: 3.0.0 > > > {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any > exceptions thrown are logged and then the state is set to {{FAILED}}. It > would be nice to expose the {{Throwable}} object to the application rather > than logging it and dropping it. Applications may want to manipulate the > underlying {{Throwable}} / control its logging at a finer granularity. For > example, the app might want to call > {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to > the app users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713159#comment-16713159 ] Dongjoon Hyun commented on SPARK-26282: --- Thank you for sharing, [~shaneknapp]! > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713142#comment-16713142 ] shane knapp commented on SPARK-26282: - btw, all of the compile and lint jobs have been running on java 8 191 for the past couple of days, and are happy and green: [https://amplab.cs.berkeley.edu/jenkins/label/ubuntu/] > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde
[ https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713140#comment-16713140 ] Apache Spark commented on SPARK-26307: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/23255 > Fix CTAS when INSERT a partitioned table using Hive serde > - > > Key: SPARK-26307 > URL: https://issues.apache.org/jira/browse/SPARK-26307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {code:java} > withTable("hive_test") { > withSQLConf( > "hive.exec.dynamic.partition.mode" -> "nonstrict") { > val df = Seq(("a", 100)).toDF("part", "id") > df.write.format("hive").partitionBy("part") > .mode("overwrite").saveAsTable("hive_test") > df.write.format("hive").partitionBy("part") > .mode("append").saveAsTable("hive_test") > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde
[ https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26307: Assignee: Apache Spark (was: Xiao Li) > Fix CTAS when INSERT a partitioned table using Hive serde > - > > Key: SPARK-26307 > URL: https://issues.apache.org/jira/browse/SPARK-26307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > {code:java} > withTable("hive_test") { > withSQLConf( > "hive.exec.dynamic.partition.mode" -> "nonstrict") { > val df = Seq(("a", 100)).toDF("part", "id") > df.write.format("hive").partitionBy("part") > .mode("overwrite").saveAsTable("hive_test") > df.write.format("hive").partitionBy("part") > .mode("append").saveAsTable("hive_test") > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde
[ https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26307: Assignee: Xiao Li (was: Apache Spark) > Fix CTAS when INSERT a partitioned table using Hive serde > - > > Key: SPARK-26307 > URL: https://issues.apache.org/jira/browse/SPARK-26307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {code:java} > withTable("hive_test") { > withSQLConf( > "hive.exec.dynamic.partition.mode" -> "nonstrict") { > val df = Seq(("a", 100)).toDF("part", "id") > df.write.format("hive").partitionBy("part") > .mode("overwrite").saveAsTable("hive_test") > df.write.format("hive").partitionBy("part") > .mode("append").saveAsTable("hive_test") > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26267) Kafka source may reprocess data
[ https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26267: - Priority: Blocker (was: Major) > Kafka source may reprocess data > --- > > Key: SPARK-26267 > URL: https://issues.apache.org/jira/browse/SPARK-26267 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: correctness > > Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it > may get an earliest offset, and then it will reprocess messages that have > been processed when it gets the correct latest offset in the next batch. > This usually happens when restarting a streaming query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26267) Kafka source may reprocess data
[ https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26267: - Labels: correctness (was: ) > Kafka source may reprocess data > --- > > Key: SPARK-26267 > URL: https://issues.apache.org/jira/browse/SPARK-26267 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Blocker > Labels: correctness > > Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it > may get an earliest offset, and then it will reprocess messages that have > been processed when it gets the correct latest offset in the next batch. > This usually happens when restarting a streaming query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713134#comment-16713134 ] shane knapp commented on SPARK-26282: - test build now running: https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-maven-hadoop-2.7-java-8.191/1 > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde
Xiao Li created SPARK-26307: --- Summary: Fix CTAS when INSERT a partitioned table using Hive serde Key: SPARK-26307 URL: https://issues.apache.org/jira/browse/SPARK-26307 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2 Reporter: Xiao Li Assignee: Xiao Li {code:java} withTable("hive_test") { withSQLConf( "hive.exec.dynamic.partition.mode" -> "nonstrict") { val df = Seq(("a", 100)).toDF("part", "id") df.write.format("hive").partitionBy("part") .mode("overwrite").saveAsTable("hive_test") df.write.format("hive").partitionBy("part") .mode("append").saveAsTable("hive_test") } }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713118#comment-16713118 ] shane knapp commented on SPARK-26282: - i'm waiting on someone from databricks to merge. i pinged the PR and it should hopefully happen today. after the new years, i am planning on moving these configs to the spark repo. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713093#comment-16713093 ] Dongjoon Hyun commented on SPARK-26282: --- Hi, [~shaneknapp]. Is there any update on your PR to update the jenkins job? > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26283: --- Priority: Major (was: Minor) > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713030#comment-16713030 ] Mihaly Toth commented on SPARK-25331: - I have closed my PR. I guess it should be documented that we expect the user to read only files that have their name written to manifest files. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26305) Breakthrough the memory limitation of broadcast join
[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712991#comment-16712991 ] Dongjoon Hyun edited comment on SPARK-26305 at 12/7/18 3:43 PM: +1 for the issue. I'll take a look when the design doc is given. was (Author: dongjoon): +1 for the idea. > Breakthrough the memory limitation of broadcast join > > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Lantao Jin >Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 1 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is data skewing, the sortmerge > join always failed. > Hive has a skew join hint which could trigger two-stage task to handle the > skew key and normal key separately. I guess Databricks Runtime has the > similar implementation. However, the skew join hint needs SQL users know the > data in table like their children. They must know which key is skewing in a > join. It's very hard to know since the data is changing day by day and the > join key isn't fixed in different queries. The users have to set a huge > partition number to try their luck. > So, do we have a simple, rude and efficient way to resolve it? Back to the > limitation, if the broadcasted table no needs to fill to memory, in other > words, driver/executor stores the broadcasted table to disk only. The problem > mentioned above could be resolved. > A new hint like BROADCAST_DISK or an additional parameter in original > BROADCAST hint will be introduced to cover this case. The original broadcast > behavior won’t be changed. > I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26305) Breakthrough the memory limitation of broadcast join
[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712991#comment-16712991 ] Dongjoon Hyun commented on SPARK-26305: --- +1 for the idea. > Breakthrough the memory limitation of broadcast join > > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Lantao Jin >Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 1 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is data skewing, the sortmerge > join always failed. > Hive has a skew join hint which could trigger two-stage task to handle the > skew key and normal key separately. I guess Databricks Runtime has the > similar implementation. However, the skew join hint needs SQL users know the > data in table like their children. They must know which key is skewing in a > join. It's very hard to know since the data is changing day by day and the > join key isn't fixed in different queries. The users have to set a huge > partition number to try their luck. > So, do we have a simple, rude and efficient way to resolve it? Back to the > limitation, if the broadcasted table no needs to fill to memory, in other > words, driver/executor stores the broadcasted table to disk only. The problem > mentioned above could be resolved. > A new hint like BROADCAST_DISK or an additional parameter in original > BROADCAST hint will be introduced to cover this case. The original broadcast > behavior won’t be changed. > I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26305) Breakthrough the memory limitation of broadcast join
[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-26305: --- Description: If the join between a big table and a small one faces data skewing issue, we usually use a broadcast hint in SQL to resolve it. However, current broadcast join has many limitations. The primary restriction is memory. The small table which is broadcasted must be fulfilled to memory in driver/executors side. Although it will spill to disk when the memory is insufficient, it still causes OOM if the small table actually is not absolutely small, it's relatively small. In our company, we have many real big data SQL analysis jobs which handle dozens of hundreds terabytes join and shuffle. For example, the size of large table is 100TB, and the small one is 1 times less, still 10GB. In this case, broadcast join couldn't be finished since the small one is still larger than expected. If the join is data skewing, the sortmerge join always failed. Hive has a skew join hint which could trigger two-stage task to handle the skew key and normal key separately. I guess Databricks Runtime has the similar implementation. However, the skew join hint needs SQL users know the data in table like their children. They must know which key is skewing in a join. It's very hard to know since the data is changing day by day and the join key isn't fixed in different queries. The users have to set a huge partition number to try their luck. So, do we have a simple, rude and efficient way to resolve it? Back to the limitation, if the broadcasted table no needs to fill to memory, in other words, driver/executor stores the broadcasted table to disk only. The problem mentioned above could be resolved. A new hint like BROADCAST_DISK or an additional parameter in original BROADCAST hint will be introduced to cover this case. The original broadcast behavior won’t be changed. I will offer a design doc if you have same feeling about it. was: If the join between a big table and a small one faces data skewing issue, we usually use a broadcast hint in SQL to resolve it. However, current broadcast join has many limitations. The primary restriction is memory. The small table which is broadcasted must be fulfilled to memory in driver/executors side. Although it will spill to disk when the memory is insufficient, it still causes OOM if the small table actually is not absolutely small, it's relatively small. In our company, we have many real big data SQL analysis jobs which handle dozens of hundreds terabytes join and shuffle. For example, the size of large table is 100TB, and the small one is 1 times less, still 10GB. In this case, broadcast join couldn't be finished since the small one is still larger than expected. If the join is data skewing, the sortmerge join always failed. Hive has a skew join hint which could trigger two-stage task to handle the skew key and normal key separately. I guess Databricks Runtime has the similar implementation. However, the skew join hint needs SQL users know the data in table like their children. They must know which key is skewing in a join. It's very hard to know since the data is changing day by day and the join key isn't fixed in different queries. The users have to set a huge partition number to try their luck. So, do we have a simple, rude and efficient way to resolve it? Back to the limitation, if the broadcasted table no needs to fill to memory, in other words, driver/executor stores the broadcasted table to disk only. The problem mentioned above could be resolved. I will offer a design doc if you have same feeling about it. > Breakthrough the memory limitation of broadcast join > > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Lantao Jin >Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 1 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is
[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712975#comment-16712975 ] Gabor Somogyi commented on SPARK-26306: --- No idea, I've seen it only in PR builder and thought file it to help others. > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In PR builder the following issue appeared: > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at
[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712970#comment-16712970 ] Liang-Chi Hsieh commented on SPARK-26306: - Besides above build, is there any build that this test fails too? > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In PR builder the following issue appeared: > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at
[jira] [Updated] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-26306: -- Component/s: (was: Spark Core) Tests > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In PR builder the following issue appeared: > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) >
[jira] [Created] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
Gabor Somogyi created SPARK-26306: - Summary: Flaky test: org.apache.spark.util.collection.SorterSuite Key: SPARK-26306 URL: https://issues.apache.org/jira/browse/SPARK-26306 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Gabor Somogyi In PR builder the following issue appeared: {code:java} [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 seconds, 225 milliseconds) [info] java.lang.OutOfMemoryError: Java heap space [info] at org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) [info] at org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) [info] at org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) [info] at org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown Source) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown Source) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) [info] at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown Source) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown Source) [info] at scala.collection.immutable.List.foreach(List.scala:388) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) [info] at org.scalatest.Suite.run(Suite.scala:1147) [info] at org.scalatest.Suite.run$(Suite.scala:1129) [error] Uncaught exception when running org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: Java heap space sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) at org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) at org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) at org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown Source) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown Source) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712905#comment-16712905 ] qian han commented on SPARK-26265: -- Okay > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Priority: Major > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:281) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1995) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270) > "Executor task launch worker for task 18899": at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.spill(BytesToBytesMap.java:345) > - waiting to lock <0x000302faa3b0> (a >
[jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join
[ https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712873#comment-16712873 ] Wang, Gang commented on SPARK-25401: Yeah. I think so. And please make sure the outputOrdering of SortMergeJoin is align with the reordered keys. > Reorder the required ordering to match the table's output ordering for bucket > join > -- > > Key: SPARK-25401 > URL: https://issues.apache.org/jira/browse/SPARK-25401 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > > Currently, we check if SortExec is needed between a operator and its child > operator in method orderingSatisfies, and method orderingSatisfies require > the order in the SortOrders are all the same. > While, take the following case into consideration. > * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is > 200. > * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is > 200. > * Table a join table b on (a1=b1, a2=b2) > In this case, if the join is sort merge join, the query planner won't add > exchange on both sides, while, sort will be added on both sides. Actually, > sort is also unnecessary, since in the same bucket, like bucket 1 of table a, > and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26305) Breakthrough the memory limitation of broadcast join
[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712855#comment-16712855 ] Lantao Jin commented on SPARK-26305: CC [~jiangxb1987] [~cloud_fan] [~dongjoon] [~hyukjin.kwon], thoughts? > Breakthrough the memory limitation of broadcast join > > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Lantao Jin >Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 1 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is data skewing, the sortmerge > join always failed. > Hive has a skew join hint which could trigger two-stage task to handle the > skew key and normal key separately. I guess Databricks Runtime has the > similar implementation. However, the skew join hint needs SQL users know the > data in table like their children. They must know which key is skewing in a > join. It's very hard to know since the data is changing day by day and the > join key isn't fixed in different queries. The users have to set a huge > partition number to try their luck. > So, do we have a simple, rude and efficient way to resolve it? Back to the > limitation, if the broadcasted table no needs to fill to memory, in other > words, driver/executor stores the broadcasted table to disk only. The problem > mentioned above could be resolved. > I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26305) Breakthrough the memory limitation of broadcast join
Lantao Jin created SPARK-26305: -- Summary: Breakthrough the memory limitation of broadcast join Key: SPARK-26305 URL: https://issues.apache.org/jira/browse/SPARK-26305 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Lantao Jin If the join between a big table and a small one faces data skewing issue, we usually use a broadcast hint in SQL to resolve it. However, current broadcast join has many limitations. The primary restriction is memory. The small table which is broadcasted must be fulfilled to memory in driver/executors side. Although it will spill to disk when the memory is insufficient, it still causes OOM if the small table actually is not absolutely small, it's relatively small. In our company, we have many real big data SQL analysis jobs which handle dozens of hundreds terabytes join and shuffle. For example, the size of large table is 100TB, and the small one is 1 times less, still 10GB. In this case, broadcast join couldn't be finished since the small one is still larger than expected. If the join is data skewing, the sortmerge join always failed. Hive has a skew join hint which could trigger two-stage task to handle the skew key and normal key separately. I guess Databricks Runtime has the similar implementation. However, the skew join hint needs SQL users know the data in table like their children. They must know which key is skewing in a join. It's very hard to know since the data is changing day by day and the join key isn't fixed in different queries. The users have to set a huge partition number to try their luck. So, do we have a simple, rude and efficient way to resolve it? Back to the limitation, if the broadcasted table no needs to fill to memory, in other words, driver/executor stores the broadcasted table to disk only. The problem mentioned above could be resolved. I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712843#comment-16712843 ] Steve Loughran commented on SPARK-26254: maybe ask the Kafka people for opinions [~jkreps] can probably nominate someone bq. token-providers provided dependency to kafka-sql project => It's kinda' weird but at the moment looks the least problematic probably makes sense then > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26266) Update to Scala 2.12.8
[ https://issues.apache.org/jira/browse/SPARK-26266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26266: -- Docs Text: Use Spark with the latest maintenance release of Java, for security and bug fixes, and to ensure compatibility with Scala. Labels: release-notes (was: ) > Update to Scala 2.12.8 > -- > > Key: SPARK-26266 > URL: https://issues.apache.org/jira/browse/SPARK-26266 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Yuming Wang >Priority: Minor > Labels: release-notes > > [~yumwang] notes that Scala 2.12.8 is out and fixes two minor issues: > Don't reject views with result types which are TypeVars (#7295) > Don't emit static forwarders (which simplify the use of methods in top-level > objects from Java) for bridge methods (#7469) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join
[ https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712350#comment-16712350 ] David Vrba edited comment on SPARK-25401 at 12/7/18 10:59 AM: -- I was looking at it and i believe that in the class EnsureRequirements we could reorder the join predicates for SortMergeJoin once more - just before we check if child outputOrdering satisfies the requiredOrdering - and we can align the predicate keys with the child outputOrdering. In such case it is not going to add the unnecessary SortExec and also it is not going to add unnecessary Exchange either, because Exchange is handled before. What do you guys think? Is it a good approach? (Please be patient with me, this is my first Jira on Spark) was (Author: vrbad): I was looking at it and i believe that it the class EnsureRequirements we could reorder the join predicates for SortMergeJoin once more - just before we check if child outputOrdering satisfies the requiredOrdering - and we can align the predicate keys with the child outputOrdering. In such case it is not going to add the unnecessary SortExec and also it is not going to add unnecessary Exchange either, because Exchange is handled before. What do you guys think? Is it a good approach? (Please be patient with me, this is my first Jira on Spark) > Reorder the required ordering to match the table's output ordering for bucket > join > -- > > Key: SPARK-25401 > URL: https://issues.apache.org/jira/browse/SPARK-25401 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > > Currently, we check if SortExec is needed between a operator and its child > operator in method orderingSatisfies, and method orderingSatisfies require > the order in the SortOrders are all the same. > While, take the following case into consideration. > * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is > 200. > * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is > 200. > * Table a join table b on (a1=b1, a2=b2) > In this case, if the join is sort merge join, the query planner won't add > exchange on both sides, while, sort will be added on both sides. Actually, > sort is also unnecessary, since in the same bucket, like bucket 1 of table a, > and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
[ https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26304: Assignee: (was: Apache Spark) > Add default value to spark.kafka.sasl.kerberos.service.name parameter > - > > Key: SPARK-26304 > URL: https://issues.apache.org/jira/browse/SPARK-26304 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > The reasoning behind: > * Kafka's configuration guide suggest the same value: > https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig > * It would be easier for spark users by providing less configuration > * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
[ https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712639#comment-16712639 ] Apache Spark commented on SPARK-26304: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/23254 > Add default value to spark.kafka.sasl.kerberos.service.name parameter > - > > Key: SPARK-26304 > URL: https://issues.apache.org/jira/browse/SPARK-26304 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > The reasoning behind: > * Kafka's configuration guide suggest the same value: > https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig > * It would be easier for spark users by providing less configuration > * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
[ https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26304: Assignee: Apache Spark > Add default value to spark.kafka.sasl.kerberos.service.name parameter > - > > Key: SPARK-26304 > URL: https://issues.apache.org/jira/browse/SPARK-26304 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > The reasoning behind: > * Kafka's configuration guide suggest the same value: > https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig > * It would be easier for spark users by providing less configuration > * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter
Gabor Somogyi created SPARK-26304: - Summary: Add default value to spark.kafka.sasl.kerberos.service.name parameter Key: SPARK-26304 URL: https://issues.apache.org/jira/browse/SPARK-26304 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi The reasoning behind: * Kafka's configuration guide suggest the same value: https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig * It would be easier for spark users by providing less configuration * Other streaming engines are doing the same -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26290) [K8s] Driver Pods no mounted volumes on submissions from older spark versions
[ https://issues.apache.org/jira/browse/SPARK-26290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Buchleitner updated SPARK-26290: --- Environment: Kuberentes: 1.10.6 Container: Spark 2.4.0 Spark containers are built from the archive served by [www.apache.org/dist/spark/|http://www.apache.org/dist/spark/] Submission done by older spark versions integrated e.g. in livy was: Kuberentes 1.10.6 Spark 2.4.0 Spark containers are built from the archive served by [www.apache.org/dist/spark/|http://www.apache.org/dist/spark/] Description: I want to use the volume feature to mount an existing PVC as readonly volume into the driver and also executor. The executor gets the PVC mounted, but the driver is missing the mount {code:java} /opt/spark/bin/spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --conf spark.app.name=spark-pi \ --conf spark.executor.instances=4 \ --conf spark.kubernetes.namespace=spark-demo \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image=kube-spark:2.4.0 \ --conf spark.master=k8s://https:// \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=nfs-pvc \ /srv/spark-examples_2.11-2.4.0.jar {code} When i use the jar included in the container {code:java} local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar {code} the call works and i can review the pod descriptions to review the behavior *Driver description* {code:java} Name: spark-pi-1544018157391-driver [...] Containers: spark-kubernetes-driver: Container ID: docker://3a31d867c140183247cb296e13a8b35d03835f7657dd7e625c59083024e51e28 Image: kube-spark:2.4.0 Image ID: [...] Port: Host Port: State: Terminated Reason: Completed Exit Code:0 Started: Wed, 05 Dec 2018 14:55:59 +0100 Finished: Wed, 05 Dec 2018 14:56:08 +0100 Ready: False Restart Count: 0 Limits: memory: 1408Mi Requests: cpu: 1 memory: 1Gi Environment: SPARK_DRIVER_MEMORY:1g SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi SPARK_DRIVER_ARGS: SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_MOUNTED_CLASSPATH: /opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar SPARK_JAVA_OPT_1: -Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv SPARK_JAVA_OPT_3: -Dspark.app.name=spark-pi SPARK_JAVA_OPT_4: -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv SPARK_JAVA_OPT_5: -Dspark.submit.deployMode=cluster SPARK_JAVA_OPT_6: -Dspark.driver.blockManager.port=7079 SPARK_JAVA_OPT_7: -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true SPARK_JAVA_OPT_8: -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark SPARK_JAVA_OPT_9: -Dspark.driver.host=spark-pi-1544018157391-driver-svc.spark-demo.svc.cluster.local SPARK_JAVA_OPT_10: -Dspark.kubernetes.driver.pod.name=spark-pi-1544018157391-driver SPARK_JAVA_OPT_11: -Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc SPARK_JAVA_OPT_12: -Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true SPARK_JAVA_OPT_13: -Dspark.driver.port=7078 SPARK_JAVA_OPT_14: -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar SPARK_JAVA_OPT_15: -Dspark.kubernetes.executor.podNamePrefix=spark-pi-1544018157391 SPARK_JAVA_OPT_16: -Dspark.local.dir=/tmp/spark-local SPARK_JAVA_OPT_17: -Dspark.master=k8s://https:// SPARK_JAVA_OPT_18: -Dspark.app.id=spark-89420bd5fa8948c3aa9d14a4eb6ecfca SPARK_JAVA_OPT_19: -Dspark.kubernetes.namespace=spark-demo SPARK_JAVA_OPT_21: -Dspark.executor.instances=4 SPARK_JAVA_OPT_22: -Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=nfs-pvc SPARK_JAVA_OPT_23: -Dspark.kubernetes.container.image=kube-spark:2.4.0 SPARK_JAVA_OPT_24:
[jira] [Assigned] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26303: Assignee: (was: Apache Spark) > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26303) Return partial results for bad JSON records
Maxim Gekk created SPARK-26303: -- Summary: Return partial results for bad JSON records Key: SPARK-26303 URL: https://issues.apache.org/jira/browse/SPARK-26303 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, JSON datasource and JSON functions return row with all null for a malformed JSON string in the PERMISSIVE mode when specified schema has the struct type. All nulls are returned even some of fields were parsed and converted to desired types successfully. The ticket aims to solve the problem by returning already parsed fields. The corrupted column specified via JSON option `columnNameOfCorruptRecord` or SQL config should contain whole original JSON string. For example, if the input has one JSON string: {code:json} {"a":0.1,"b":{},"c":"def"} {code} and specified schema is: {code:sql} a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN {code} expected output of `from_json` in the PERMISSIVE mode: {code} +---++---+--+ |a |b |c |_corrupt_record | +---++---+--+ |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| +---++---+--+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712603#comment-16712603 ] Apache Spark commented on SPARK-26303: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23253 > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26303: Assignee: Apache Spark > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712600#comment-16712600 ] Apache Spark commented on SPARK-26303: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23253 > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26302) retainedBatches configuration can cause memory leak
Behroz Sikander created SPARK-26302: --- Summary: retainedBatches configuration can cause memory leak Key: SPARK-26302 URL: https://issues.apache.org/jira/browse/SPARK-26302 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.4.0 Reporter: Behroz Sikander Attachments: heap_dump_detail.png The documentation for configuration "spark.streaming.ui.retainedBatches" says "How many batches the Spark Streaming UI and status APIs remember before garbage collecting" The default for this configuration is 1000. >From our experience, the documentation is incomplete and we found it the hard >way. The size of a single BatchUIData is around 750KB. Increasing this value to something like 5000 increases the total size to ~4GB. If your driver heap is not big enough, the job starts to slow down, has frequent GCs and has long scheduling days. Once the heap is full, the job cannot be recovered. A note of caution should be added to the documentation to let users know the impact of this seemingly harmless configuration property. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26302) retainedBatches configuration can cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-26302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712559#comment-16712559 ] Behroz Sikander commented on SPARK-26302: - I am willing to do a PR for documentation once someone can give a go ahead. > retainedBatches configuration can cause memory leak > --- > > Key: SPARK-26302 > URL: https://issues.apache.org/jira/browse/SPARK-26302 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Behroz Sikander >Priority: Minor > Attachments: heap_dump_detail.png > > > The documentation for configuration "spark.streaming.ui.retainedBatches" says > "How many batches the Spark Streaming UI and status APIs remember before > garbage collecting" > The default for this configuration is 1000. > From our experience, the documentation is incomplete and we found it the hard > way. > The size of a single BatchUIData is around 750KB. Increasing this value to > something like 5000 increases the total size to ~4GB. > If your driver heap is not big enough, the job starts to slow down, has > frequent GCs and has long scheduling days. Once the heap is full, the job > cannot be recovered. > A note of caution should be added to the documentation to let users know the > impact of this seemingly harmless configuration property. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26302) retainedBatches configuration can cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-26302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Behroz Sikander updated SPARK-26302: Attachment: heap_dump_detail.png > retainedBatches configuration can cause memory leak > --- > > Key: SPARK-26302 > URL: https://issues.apache.org/jira/browse/SPARK-26302 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Behroz Sikander >Priority: Minor > Attachments: heap_dump_detail.png > > > The documentation for configuration "spark.streaming.ui.retainedBatches" says > "How many batches the Spark Streaming UI and status APIs remember before > garbage collecting" > The default for this configuration is 1000. > From our experience, the documentation is incomplete and we found it the hard > way. > The size of a single BatchUIData is around 750KB. Increasing this value to > something like 5000 increases the total size to ~4GB. > If your driver heap is not big enough, the job starts to slow down, has > frequent GCs and has long scheduling days. Once the heap is full, the job > cannot be recovered. > A note of caution should be added to the documentation to let users know the > impact of this seemingly harmless configuration property. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-26295: -- Description: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with {{spark.kubernetes.authenticate.driver}} being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: {{spark.kubernetes.authenticate.serviceAccountName}} This is the exception: {noformat} Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"{noformat} The expectation was to see the user {{mynamespace:spark}} based on my submit command. My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. was: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {noformat} Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"{noformat} The expectation was to see the user `mynamespace:spark` based on my submit command. My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > {{spark.kubernetes.authenticate.driver}} being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > {{spark.kubernetes.authenticate.serviceAccountName}} > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user {{mynamespace:spark}} based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712493#comment-16712493 ] Adrian Tanase commented on SPARK-26295: --- [~vanzin] I'm not sure how it applies. I'd be happy to give that a shot, except, as I also commented on that PR, I can't see how kubectl and kube context are relevant in the client mode, where spark submit is being called from docker, inside the driver pod. If there is code that propagates the kube context along this path, I'm not aware of it, would love to see some documentation: {noformat} laptop with kubectl and context > k apply -f spark-driver-client-mode.yaml -> deployment starts 1 instance of driver pod in arbitrary namespace -> spark submit from start.sh inside the docker container -> ... {noformat} Also, I'd rather not make any assumptions about "implicit" configuration that may vary from computer to computer. Ideally the yaml templates are self sufficient (including config maps, env vars, etc) and, apart from your cluster credentials, you shouldn't need anything else on your machine. Thanks for looking at the issue. > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user `mynamespace:spark` based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org