[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23265: --- Description: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that for this transformer, it is acceptable to set the single-column param for {{numBuckets }}when transforming multiple columns, since that is then applied to all columns. was: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that for this transformer, it is acceptable to set the single-column param for {{numBuckets}}, since that is then applied to all columns. > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344604#comment-16344604 ] Nick Pentreath commented on SPARK-23265: cc [~huaxing] > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23265: --- Issue Type: Improvement (was: Documentation) > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
Nick Pentreath created SPARK-23265: -- Summary: Update multi-column error handling logic in QuantileDiscretizer Key: SPARK-23265 URL: https://issues.apache.org/jira/browse/SPARK-23265 Project: Spark Issue Type: Documentation Components: ML Affects Versions: 2.3.0 Reporter: Nick Pentreath SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23265: --- Description: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that for this transformer, it is acceptable to set the single-column param for {{numBuckets}}, since that is then applied to all columns. was: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets}}, since that is then applied to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23138. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20332 [https://github.com/apache/spark/pull/20332] > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.3.0 > > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary
[ https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23138: -- Assignee: Seth Hendrickson > Add user guide example for multiclass logistic regression summary > - > > Key: SPARK-23138 > URL: https://issues.apache.org/jira/browse/SPARK-23138 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.3.0 > > > We haven't updated the user guide to reflect the multiclass logistic > regression summary added in SPARK-17139. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344565#comment-16344565 ] liweisheng commented on SPARK-20928: What about introducing a new way of non-block shuffle,I mean shuffle reader and shuffle writer work at same time, shuffle reader fetch datas from the writer as long as there are datas been produced. > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Assignee: Jose Torres >Priority: Major > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23264) Support interval values without INTERVAL clauses
[ https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23264: Assignee: (was: Apache Spark) > Support interval values without INTERVAL clauses > > > Key: SPARK-23264 > URL: https://issues.apache.org/jira/browse/SPARK-23264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The master currently cannot parse a SQL query below; > {code:java} > SELECT cast('2017-08-04' as date) + 1 days; > {code} > Since other dbms-like systems support this syntax (e.g., hive and mysql), it > might help to support in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23264) Support interval values without INTERVAL clauses
[ https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23264: Assignee: Apache Spark > Support interval values without INTERVAL clauses > > > Key: SPARK-23264 > URL: https://issues.apache.org/jira/browse/SPARK-23264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > The master currently cannot parse a SQL query below; > {code:java} > SELECT cast('2017-08-04' as date) + 1 days; > {code} > Since other dbms-like systems support this syntax (e.g., hive and mysql), it > might help to support in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23264) Support interval values without INTERVAL clauses
[ https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344561#comment-16344561 ] Apache Spark commented on SPARK-23264: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20433 > Support interval values without INTERVAL clauses > > > Key: SPARK-23264 > URL: https://issues.apache.org/jira/browse/SPARK-23264 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The master currently cannot parse a SQL query below; > {code:java} > SELECT cast('2017-08-04' as date) + 1 days; > {code} > Since other dbms-like systems support this syntax (e.g., hive and mysql), it > might help to support in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23157. - Resolution: Invalid > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23264) Support interval values without INTERVAL clauses
Takeshi Yamamuro created SPARK-23264: Summary: Support interval values without INTERVAL clauses Key: SPARK-23264 URL: https://issues.apache.org/jira/browse/SPARK-23264 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1 Reporter: Takeshi Yamamuro The master currently cannot parse a SQL query below; {code:java} SELECT cast('2017-08-04' as date) + 1 days; {code} Since other dbms-like systems support this syntax (e.g., hive and mysql), it might help to support in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23174) Fix pep8 to latest official version
[ https://issues.apache.org/jira/browse/SPARK-23174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344553#comment-16344553 ] Apache Spark commented on SPARK-23174: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20432 > Fix pep8 to latest official version > --- > > Key: SPARK-23174 > URL: https://issues.apache.org/jira/browse/SPARK-23174 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.1 >Reporter: Rekha Joshi >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.4.0 > > > As per discussion with [~hyukjin.kwon] , this Jira to fix python code style > to latest official version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23222) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23222: Assignee: Apache Spark > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23222 > URL: https://issues.apache.org/jira/browse/SPARK-23222 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > I've seen this test fail a few times in unrelated PRs. e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/ > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: Expected exception > org.apache.spark.SparkException to be thrown, but no exception was thrown > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168) > at > org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176) > at > org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167) > at > org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23222) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23222: Assignee: (was: Apache Spark) > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23222 > URL: https://issues.apache.org/jira/browse/SPARK-23222 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > I've seen this test fail a few times in unrelated PRs. e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/ > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: Expected exception > org.apache.spark.SparkException to be thrown, but no exception was thrown > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168) > at > org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176) > at > org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167) > at > org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23222) Flaky test: DataFrameRangeSuite
[ https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344548#comment-16344548 ] Apache Spark commented on SPARK-23222: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20431 > Flaky test: DataFrameRangeSuite > --- > > Key: SPARK-23222 > URL: https://issues.apache.org/jira/browse/SPARK-23222 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > I've seen this test fail a few times in unrelated PRs. e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/ > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: Expected exception > org.apache.spark.SparkException to be thrown, but no exception was thrown > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > Expected exception org.apache.spark.SparkException to be thrown, but no > exception was thrown > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at org.scalatest.Assertions$class.intercept(Assertions.scala:822) > at org.scalatest.FunSuite.intercept(FunSuite.scala:1560) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168) > at > org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176) > at > org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167) > at > org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > at > org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurav Garg updated SPARK-18016: Comment: was deleted (was: [~kiszk], this programs also gives the Constant pool error in my environment. Have you set any extra spark property? I have tested the code in single node having 64g RAM and 4 cores of CPU, as well as in cluster mode where I have 6 nodes each having the same configurations. It seems not the problem of hardware but, issue somewhere else. Observations when I run the same in single node: - Ram consumed only 23g, when it throws constant pool exception. - Have tried the same logic using RDDs and not DataFrame, it works fine. ) > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.ja
[jira] [Comment Edited] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked
[ https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344484#comment-16344484 ] Bang Xiao edited comment on SPARK-23252 at 1/30/18 4:30 AM: After the executor and NodeManager is killed, failure tasks never relaunched because of reason is not yet known. {code:java} CoarseGrainedSchedulerBackend.scala: protected def disableExecutor(executorId: String): Boolean = { val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized { if (executorIsAlive(executorId)) { executorsPendingLossReason += executorId true } else { // Returns true for explicitly killed executors, we also need to get pending loss reasons; // For others return false. executorsPendingToRemove.contains(executorId) } } if (shouldDisable) { logInfo(s"Disabling executor $executorId.") scheduler.executorLost(executorId, LossReasonPending) } shouldDisable }{code} TaskSchedulerImpl will handle executorLost and removeExecutor {code:java} TaskSchedulerImpl.scala: private def removeExecutor(executorId: String, reason: ExecutorLossReason) { // The tasks on the lost executor may not send any more status updates (because the executor // has been lost), so they should be cleaned up here. executorIdToRunningTaskIds.remove(executorId).foreach { taskIds => logDebug("Cleaning up TaskScheduler state for tasks " + s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId") // We do not notify the TaskSetManager of the task failures because that will // happen below in the rootPool.executorLost() call. taskIds.foreach(cleanupTaskState) } val host = executorIdToHost(executorId) val execs = hostToExecutors.getOrElse(host, new HashSet) execs -= executorId if (execs.isEmpty) { hostToExecutors -= host for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) { hosts -= host if (hosts.isEmpty) { hostsByRack -= rack } } } if (reason != LossReasonPending) { executorIdToHost -= executorId rootPool.executorLost(executorId, host, reason) } blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId)) } {code} but if the reason is LossReasonPending, it will not trigger lost tasks relaunched. This is consistent with what I've observed from the log. was (Author: chopinxb): After the executor and NodeManager is killed, failure tasks never relaunched because of reason is not yet known. {code:java} CoarseGrainedSchedulerBackend.scala: protected def disableExecutor(executorId: String): Boolean = { val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized { if (executorIsAlive(executorId)) { executorsPendingLossReason += executorId true } else { // Returns true for explicitly killed executors, we also need to get pending loss reasons; // For others return false. executorsPendingToRemove.contains(executorId) } } if (shouldDisable) { logInfo(s"Disabling executor $executorId.") scheduler.executorLost(executorId, LossReasonPending) } shouldDisable }{code} TaskSchedulerImpl will handle executorLost and removeExecutor {code:java} TaskSchedulerImpl.scala: private def removeExecutor(executorId: String, reason: ExecutorLossReason) { // The tasks on the lost executor may not send any more status updates (because the executor // has been lost), so they should be cleaned up here. executorIdToRunningTaskIds.remove(executorId).foreach { taskIds => logDebug("Cleaning up TaskScheduler state for tasks " + s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId") // We do not notify the TaskSetManager of the task failures because that will // happen below in the rootPool.executorLost() call. taskIds.foreach(cleanupTaskState) } val host = executorIdToHost(executorId) val execs = hostToExecutors.getOrElse(host, new HashSet) execs -= executorId if (execs.isEmpty) { hostToExecutors -= host for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) { hosts -= host if (hosts.isEmpty) { hostsByRack -= rack } } } if (reason != LossReasonPending) { executorIdToHost -= executorId rootPool.executorLost(executorId, host, reason) } blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId)) } {code} but if the reason is LossReasonPending, it will not trigger lost tasks relaunched. This is consistent with what I've observed from the log. > When NodeManager and CoarseGrainedExecutorBackend processes are killed, the > job will be blocked > --- > > Key: SPARK-23252 > URL: https://issues.apac
[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked
[ https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344484#comment-16344484 ] Bang Xiao commented on SPARK-23252: --- After the executor and NodeManager is killed, failure tasks never relaunched because of reason is not yet known. {code:java} CoarseGrainedSchedulerBackend.scala: protected def disableExecutor(executorId: String): Boolean = { val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized { if (executorIsAlive(executorId)) { executorsPendingLossReason += executorId true } else { // Returns true for explicitly killed executors, we also need to get pending loss reasons; // For others return false. executorsPendingToRemove.contains(executorId) } } if (shouldDisable) { logInfo(s"Disabling executor $executorId.") scheduler.executorLost(executorId, LossReasonPending) } shouldDisable }{code} TaskSchedulerImpl will handle executorLost and removeExecutor {code:java} TaskSchedulerImpl.scala: private def removeExecutor(executorId: String, reason: ExecutorLossReason) { // The tasks on the lost executor may not send any more status updates (because the executor // has been lost), so they should be cleaned up here. executorIdToRunningTaskIds.remove(executorId).foreach { taskIds => logDebug("Cleaning up TaskScheduler state for tasks " + s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId") // We do not notify the TaskSetManager of the task failures because that will // happen below in the rootPool.executorLost() call. taskIds.foreach(cleanupTaskState) } val host = executorIdToHost(executorId) val execs = hostToExecutors.getOrElse(host, new HashSet) execs -= executorId if (execs.isEmpty) { hostToExecutors -= host for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) { hosts -= host if (hosts.isEmpty) { hostsByRack -= rack } } } if (reason != LossReasonPending) { executorIdToHost -= executorId rootPool.executorLost(executorId, host, reason) } blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId)) } {code} but if the reason is LossReasonPending, it will not trigger lost tasks relaunched. This is consistent with what I've observed from the log. > When NodeManager and CoarseGrainedExecutorBackend processes are killed, the > job will be blocked > --- > > Key: SPARK-23252 > URL: https://issues.apache.org/jira/browse/SPARK-23252 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Bang Xiao >Priority: Major > > This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We > use Yarn as our resource manager. > 1,spark-submit "JavaWordCount" application in yarn-client mode > 2, Kill NodeManager and CoarseGrainedExecutorBackend processes in one node > when the job is in stage 0 > if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager > will pending the failure task to resubmit. but if the NodeManager and > CoarseGrainedExecutorBackend processes killed simultaneously,the whole job > will be blocked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled
[ https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23263: Assignee: (was: Apache Spark) > create table stored as parquet should update table size if automatic update > table size is enabled > - > > Key: SPARK-23263 > URL: https://issues.apache.org/jira/browse/SPARK-23263 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {noformat} > bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true > {noformat} > {code:sql} > spark-sql> create table test_create_parquet stored as parquet as select 1; > spark-sql> desc extended test_create_parquet; > {code} > The table statistics will not exists. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled
[ https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344460#comment-16344460 ] Apache Spark commented on SPARK-23263: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20430 > create table stored as parquet should update table size if automatic update > table size is enabled > - > > Key: SPARK-23263 > URL: https://issues.apache.org/jira/browse/SPARK-23263 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {noformat} > bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true > {noformat} > {code:sql} > spark-sql> create table test_create_parquet stored as parquet as select 1; > spark-sql> desc extended test_create_parquet; > {code} > The table statistics will not exists. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled
[ https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23263: Assignee: Apache Spark > create table stored as parquet should update table size if automatic update > table size is enabled > - > > Key: SPARK-23263 > URL: https://issues.apache.org/jira/browse/SPARK-23263 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > How to reproduce: > {noformat} > bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true > {noformat} > {code:sql} > spark-sql> create table test_create_parquet stored as parquet as select 1; > spark-sql> desc extended test_create_parquet; > {code} > The table statistics will not exists. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled
Yuming Wang created SPARK-23263: --- Summary: create table stored as parquet should update table size if automatic update table size is enabled Key: SPARK-23263 URL: https://issues.apache.org/jira/browse/SPARK-23263 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang How to reproduce: {noformat} bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true {noformat} {code:sql} spark-sql> create table test_create_parquet stored as parquet as select 1; spark-sql> desc extended test_create_parquet; {code} The table statistics will not exists. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared
[ https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MBA Learns to Code updated SPARK-23246: --- Description: I am having consistent OOM crashes when trying to use PySpark for iterative algorithms in which I create new DataFrames per iteration (e.g. by sampling from a "mother" DataFrame), do something with such DataFrames, and never need such DataFrames ever in future iterations. The below script simulates such OOM failures. Even when one tries explicitly .unpersist() the temporary DataFrames (by using the --unpersist flag below) and/or deleting and garbage-collecting the Python objects (by using the --py-gc flag below), the Java objects seem to stay on and accumulate until they exceed the JVM/driver memory. The more complex the temporary DataFrames in each iteration (illustrated by the --n-partitions flag below), the faster OOM occurs. The typical error messages include: - "java.lang.OutOfMemoryError : GC overhead limit exceeded" - "Java heap space" - "ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6053742323219781 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to /; closing connection" Please suggest how I may overcome this so that we can have long-running iterative programs using Spark that uses resources only up to a bounded, controllable limit. {code:java} from __future__ import print_function import argparse import gc import pandas import pyspark arg_parser = argparse.ArgumentParser() arg_parser.add_argument('--unpersist', action='store_true') arg_parser.add_argument('--py-gc', action='store_true') arg_parser.add_argument('--n-partitions', type=int, default=1000) args = arg_parser.parse_args() # create SparkSession (*** set spark.driver.memory to 512m in spark-defaults.conf ***) spark = pyspark.sql.SparkSession.builder \ .config('spark.executor.instances', 2) \ .config('spark.executor.cores', 2) \ .config('spark.executor.memory', '512m') \ .config('spark.ui.enabled', False) \ .config('spark.ui.retainedJobs', 10) \ .config('spark.ui.retainedStages', 10) \ .config('spark.ui.retainedTasks', 10) \ .enableHiveSupport() \ .getOrCreate() # create Parquet file for subsequent repeated loading df = spark.createDataFrame( pandas.DataFrame( dict( row=range(args.n_partitions), x=args.n_partitions * [0] ) ) ) parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions) df.write.parquet( path=parquet_path, partitionBy='row', mode='overwrite' ) i = 0 # the below loop simulates an iterative algorithm that creates new DataFrames in each iteration (e.g. sampling from a "mother" DataFrame), do something, and never need those DataFrames again in future iterations # we are having a problem cleaning up the built-up metadata # hence the program will crash after while because of OOM while True: _df = spark.read.parquet(parquet_path) if args.unpersist: _df.unpersist() if args.py_gc: del _df gc.collect() i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i)) {code} was: I am having consistent OOM crashes when trying to use PySpark for iterative algorithms in which I create new DataFrames per iteration (e.g. by sampling from a "mother" DataFrame), do something with such DataFrames, and never need such DataFrames ever in future iterations. The below script simulates such OOM failures. Even when one tries explicitly .unpersist() the temporary DataFrames (by using the --unpersist flag below) and/or deleting and garbage-collecting the Python objects (by using the --py-gc flag below), the Java objects seem to stay on and accumulate until they exceed the JVM/driver memory. The more complex the temporary DataFrames in each iteration (illustrated by the --n-partitions flag below), the faster OOM occurs. The typical error messages include: - "java.lang.OutOfMemoryError : GC overhead limit exceeded" - "Java heap space" - "ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=6053742323219781 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to /; closing connection" Please suggest how I may overcome this so that we can have long-running iterative programs using Spark that uses resources only up to a bounded, controllable limit. {code:java} from __future__ import print_function import argparse import gc import pandas import pyspark arg_parser = argparse.ArgumentParser() arg_parser.add_argument('--unpersist', action='store_true') arg_parser.add_argument('--py-gc', action='store_true') arg_parser.add_argument('--n-partitions', type=int, default=1000) args = arg_parser.parse_args() # create SparkSession (*** set spark.driver.memory to 512m in spark-defaults.conf ***) spark = pyspark.sql.Spark
[jira] [Resolved] (SPARK-23088) History server not showing incomplete/running applications
[ https://issues.apache.org/jira/browse/SPARK-23088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-23088. - Resolution: Fixed Assignee: paul mackles Fix Version/s: 2.4.0 Issue resolved by pull request 20335 https://github.com/apache/spark/pull/20335 > History server not showing incomplete/running applications > -- > > Key: SPARK-23088 > URL: https://issues.apache.org/jira/browse/SPARK-23088 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.1.2, 2.2.1 >Reporter: paul mackles >Assignee: paul mackles >Priority: Minor > Fix For: 2.4.0 > > > History server not showing incomplete/running applications when > _spark.history.ui.maxApplications_ property is set to a value that is smaller > than the total number of applications. > I believe this is because any applications where completed=false wind up at > the end of the list of apps returned by the /applications endpoint and when > _spark.history.ui.maxApplications_ is set, that list gets truncated and the > running apps are never returned. > The fix I have in mind is to modify the history template to start passing the > _status_ parameter when calling the /applications endpoint (status=completed > is the default). > I am running Spark in a Mesos environment but I don't think that is relevant > to this issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks
[ https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344434#comment-16344434 ] Imran Rashid commented on SPARK-23237: -- Can you expand a bit about what you are worried about? Confusing UI? Too tempting for users to refresh it constantly, not realizing the impact it has? If its just the UI, I'd be fine with only making it an endpoint in the api. And on the expense – well, the existing thread dump page already would have that problem. Perhaps its just so inconvenient that nobody has been tempted to abuse it :P. But I think its better to warn in the docs and then the user is allowed to shoot themselves in the foot. > Add UI / endpoint for threaddumps for executors with active tasks > - > > Key: SPARK-23237 > URL: https://issues.apache.org/jira/browse/SPARK-23237 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Frequently, when there are a handful of straggler tasks, users want to know > what is going on in those executors running the stragglers. Currently, that > is a bit of a pain to do: you have to go to the page for your active stage, > find the task, figure out which executor its on, then go to the executors > page, and get the thread dump. Or maybe you just go to the executors page, > find the executor with an active task, and then click on that, but that > doesn't work if you've got multiple stages running. > Users could figure this by extracting the info from the stage rest endpoint, > but it's such a common thing to do that we should make it easy. > I realize that figuring out a good way to do this is a little tricky. We > don't want to make it easy to end up pulling thread dumps from 1000 executors > back to the driver. So we've got to come up with a reasonable heuristic for > choosing which executors to poll. And we've also got to find a suitable > place to put this. > My suggestion is that the stage page always has a link to the thread dumps > for the *one* executor with the longest running task. And there would be a > corresponding endpoint in the rest api with the same info, maybe at > {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23236) Make it easier to find the rest API, especially in local mode
[ https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344433#comment-16344433 ] Imran Rashid edited comment on SPARK-23236 at 1/30/18 3:09 AM: --- {quote} 1. /api and /api/v1 to give the same results (a redirect) as /api/v1/applications Based on that I'm on the fence about #1, it may be ok, but I'm not sure if it's best. {quote} actually, I want them to be a very simple html page, with just a single link. I don't think they should be a redirect, as I dont' think they should appear to actually be part of the api. The html page could even have a little disclaimer "Are you looking for the rest api? There is no endpoint here in this version, maybe you want ..." {quote} 2. You want /api/v1/applications/{app-id} to include a list of rest api end points As for #2 I'm not sure how you're envisioning it, but it seems like a bad idea in mu head. {quote} So to be clear, really I just want a little help with my super lazy, forgetful side, which can't remember the exact endpoint off the top of my head. There are many variations which would satisfy this. The entire UI could *just* provide a link to {{/api/v1/applications/[app-id]}} ... but thats a little weird since there actually isn't anything there. That's why I was suggesting putting something there, even if its just another simple html page, "Nothing here, maybe you want ...". We could go even simpler -- have every page in the UI have a link to some random app-specific endpoint in the UI, eg. {{/api/v1/applications/[app-id]/jobs}} . It would be a little weird if you're on the stage page in the UI, and you follow a link for the REST api, and it takes you to the jobs data ... but it would at least make it easier to get the base URL right. I also don't really care if {{/api/v1/applications/[app-id]}} actually lists *every* endpoint below that. I think its totally fine if I need to remember the choices below that -- that is actually the important part of the choice I need to make. I want something to fill in the rest of the automatic stuff for me. The only reason I suggested putting the list of endpoints is, I want to put *something* there so the UI can link to it. {quote} 3. You want UI pages to include links to the rest api calls that they get their info from And for #3, I would be ok with this, but again I'm not sure how its useful. {quote} yeah this is the least important, but perhaps you can see where I'm going from this after #2, it just feels like a natural extension. Certainly not worth a huge amount of effort. was (Author: irashid): bq. 1. /api and /api/v1 to give the same results (a redirect) as /api/v1/applications bq. Based on that I'm on the fence about #1, it may be ok, but I'm not sure if it's best. actually, I want them to be a very simple html page, with just a single link. I don't think they should be a redirect, as I dont' think they should appear to actually be part of the api. The html page could even have a little disclaimer "Are you looking for the rest api? There is no endpoint here in this version, maybe you want ..." bq. 2. You want /api/v1/applications/{app-id} to include a list of rest api end points bq. As for #2 I'm not sure how you're envisioning it, but it seems like a bad idea in mu head. So to be clear, really I just want a little help with my super lazy, forgetful side, which can't remember the exact endpoint off the top of my head. There are many variations which would satisfy this. The entire UI could *just* provide a link to /api/v1/applications/{app-id} ... but thats a little weird since there actually isn't anything there. That's why I was suggesting putting something there, even if its just another simple html page, "Nothing here, maybe you want ...". We could go even simpler -- have every page in the UI have a link to some random app-specific endpoint in the UI, eg. /api/v1/applications/{app-id}/jobs . It would be a little weird if you're on the stage page in the UI, and you follow a link for the REST api, and it takes you to the jobs data ... but it would at least make it easier to get the base URL right. I also don't really care if /api/v1/applications/{app-id} actually lists every endpoint below that. I think its totally fine if I need to remember the choices below that -- that is actually the important part of the choice I need to make. I want something to fill in the rest of the automatic stuff for me. bq. 3. You want UI pages to include links to the rest api calls that they get their info from bq. And for #3, I would be ok with this, but again I'm not sure how its useful. yeah this is the least important, but perhaps you can see where I'm going from this after #2, it just feels like a natural extension. Certainly not worth a huge amount of effort. > Make it easier to find th
[jira] [Commented] (SPARK-23236) Make it easier to find the rest API, especially in local mode
[ https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344433#comment-16344433 ] Imran Rashid commented on SPARK-23236: -- bq. 1. /api and /api/v1 to give the same results (a redirect) as /api/v1/applications bq. Based on that I'm on the fence about #1, it may be ok, but I'm not sure if it's best. actually, I want them to be a very simple html page, with just a single link. I don't think they should be a redirect, as I dont' think they should appear to actually be part of the api. The html page could even have a little disclaimer "Are you looking for the rest api? There is no endpoint here in this version, maybe you want ..." bq. 2. You want /api/v1/applications/{app-id} to include a list of rest api end points bq. As for #2 I'm not sure how you're envisioning it, but it seems like a bad idea in mu head. So to be clear, really I just want a little help with my super lazy, forgetful side, which can't remember the exact endpoint off the top of my head. There are many variations which would satisfy this. The entire UI could *just* provide a link to /api/v1/applications/{app-id} ... but thats a little weird since there actually isn't anything there. That's why I was suggesting putting something there, even if its just another simple html page, "Nothing here, maybe you want ...". We could go even simpler -- have every page in the UI have a link to some random app-specific endpoint in the UI, eg. /api/v1/applications/{app-id}/jobs . It would be a little weird if you're on the stage page in the UI, and you follow a link for the REST api, and it takes you to the jobs data ... but it would at least make it easier to get the base URL right. I also don't really care if /api/v1/applications/{app-id} actually lists every endpoint below that. I think its totally fine if I need to remember the choices below that -- that is actually the important part of the choice I need to make. I want something to fill in the rest of the automatic stuff for me. bq. 3. You want UI pages to include links to the rest api calls that they get their info from bq. And for #3, I would be ok with this, but again I'm not sure how its useful. yeah this is the least important, but perhaps you can see where I'm going from this after #2, it just feels like a natural extension. Certainly not worth a huge amount of effort. > Make it easier to find the rest API, especially in local mode > - > > Key: SPARK-23236 > URL: https://issues.apache.org/jira/browse/SPARK-23236 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Trivial > Labels: newbie > > This is really minor, but it always takes me a little bit to figure out how > to get from the UI to the rest api. Its especially a pain in local-mode, > where you need the app-id, though in general I don't know the app-id, so have > to either look in logs or go to another endpoint first in the ui just to find > the app-id. While it wouldn't really help anybody accessing the endpoints > programmatically, we could make it easier for someone doing exploration via > their browser. > Some things which could be improved: > * /api/v1 just provides a link to "/api/v1/applications" > * /api provides a link to "/api/v1/applications" > * /api/v1/applications/[app-id] gives a list of links for the other endpoints > * on the UI, there is a link to at least /api/v1/applications/[app-id] -- > better still if each UI page links to the corresponding endpoint, eg. the all > jobs page would link to /api/v1/applications/[app-id]/jobs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in
[ https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344396#comment-16344396 ] Apache Spark commented on SPARK-23262: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20427 > mix-in interface should extend the interface it aimed to mix in > --- > > Key: SPARK-23262 > URL: https://issues.apache.org/jira/browse/SPARK-23262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in
[ https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23262: Assignee: Wenchen Fan (was: Apache Spark) > mix-in interface should extend the interface it aimed to mix in > --- > > Key: SPARK-23262 > URL: https://issues.apache.org/jira/browse/SPARK-23262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in
[ https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23262: Assignee: Apache Spark (was: Wenchen Fan) > mix-in interface should extend the interface it aimed to mix in > --- > > Key: SPARK-23262 > URL: https://issues.apache.org/jira/browse/SPARK-23262 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in
Wenchen Fan created SPARK-23262: --- Summary: mix-in interface should extend the interface it aimed to mix in Key: SPARK-23262 URL: https://issues.apache.org/jira/browse/SPARK-23262 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344339#comment-16344339 ] Alex Bozarth commented on SPARK-18085: -- [~vanzin] since this is complete and going into 2.3 I was hoping you could help write a short overview of the SPIP, like what's changed and why. Given how much this evolved during the project I'm not sure if the original pitch is the best description anymore and I would like to have a nice summary to describe the project's impact. Thanks > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Major > Labels: SPIP > Fix For: 2.3.0 > > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21664) Use the column name as the file name.
[ https://issues.apache.org/jira/browse/SPARK-21664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jifei_yang closed SPARK-21664. -- We can use the partition to save the column names, such as: {code:java} case class UserInfo(name:String,favorite_number:Int,favorite_color:String) extends Serializable{} def mainSaveAsParquet(args: Array[String]) { val fileName=new Random().nextInt(43952858) val outPath = s"G:/project/idea15/xlwl/bigdata002/bigdata/sparkmvn/outpath/user/spark/parquet/temp/$fileName" val sparkConf = new SparkConf().setAppName("Spark Avro Test").setMaster("local[4]") MyKryoRegistrator.register(sparkConf) val sc = new SparkContext(sparkConf) val sqlContext=new SQLContext(sc) val array=new Array[UserInfo](3001) for(i <- 0 to 3000){ val choose=i % 2 choose match { case 0 =>array(i)= UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", 256+(i/102), "blue") case 1 =>array(i)= UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063", 256+i, "blue") } } import sqlContext.implicits._ val records: DataFrame = sc.parallelize(array).toDF() records.repartition(1).write.partitionBy("name","favorite_number").format("parquet").mode(SaveMode.ErrorIfExists).save(outPath) sc.stop() } {code} This will handle the column name and favorite_number as input fields. > Use the column name as the file name. > -- > > Key: SPARK-21664 > URL: https://issues.apache.org/jira/browse/SPARK-21664 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: jifei_yang >Priority: Major > > When we save the dataframe, we want to use the column name as the file name. > PairRDDFunctions are achievable. Can Dataframe be implemented? Thank you. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared
[ https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23246. --- Resolution: Not A Problem Yes, did you have a look? It's dominated by things like {{class org.apache.spark.ui.jobs.UIData$TaskMetricsUIData}}. Turn down the values of the "retained*" options you see at https://spark.apache.org/docs/latest/configuration.html#spark-ui > (Py)Spark OOM because of iteratively accumulated metadata that cannot be > cleared > > > Key: SPARK-23246 > URL: https://issues.apache.org/jira/browse/SPARK-23246 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: MBA Learns to Code >Priority: Critical > Attachments: SparkProgramHeapDump.bin.tar.xz > > > I am having consistent OOM crashes when trying to use PySpark for iterative > algorithms in which I create new DataFrames per iteration (e.g. by sampling > from a "mother" DataFrame), do something with such DataFrames, and never need > such DataFrames ever in future iterations. > The below script simulates such OOM failures. Even when one tries explicitly > .unpersist() the temporary DataFrames (by using the --unpersist flag below) > and/or deleting and garbage-collecting the Python objects (by using the > --py-gc flag below), the Java objects seem to stay on and accumulate until > they exceed the JVM/driver memory. > The more complex the temporary DataFrames in each iteration (illustrated by > the --n-partitions flag below), the faster OOM occurs. > The typical error messages include: > - "java.lang.OutOfMemoryError : GC overhead limit exceeded" > - "Java heap space" > - "ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=6053742323219781 > 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 > cap=64]}} to /; closing connection" > Please suggest how I may overcome this so that we can have long-running > iterative programs using Spark that uses resources only up to a bounded, > controllable limit. > > {code:java} > from __future__ import print_function > import argparse > import gc > import pandas > import pyspark > arg_parser = argparse.ArgumentParser() > arg_parser.add_argument('--unpersist', action='store_true') > arg_parser.add_argument('--py-gc', action='store_true') > arg_parser.add_argument('--n-partitions', type=int, default=1000) > args = arg_parser.parse_args() > # create SparkSession (*** set spark.driver.memory to 512m in > spark-defaults.conf ***) > spark = pyspark.sql.SparkSession.builder \ > .config('spark.executor.instances', 2) \ > .config('spark.executor.cores', 2) \ > .config('spark.executor.memory', '512m') \ > .config('spark.ui.enabled', False) \ > .config('spark.ui.retainedJobs', 10) \ > .enableHiveSupport() \ > .getOrCreate() > # create Parquet file for subsequent repeated loading > df = spark.createDataFrame( > pandas.DataFrame( > dict( > row=range(args.n_partitions), > x=args.n_partitions * [0] > ) > ) > ) > parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions) > df.write.parquet( > path=parquet_path, > partitionBy='row', > mode='overwrite' > ) > i = 0 > # the below loop simulates an iterative algorithm that creates new DataFrames > in each iteration (e.g. sampling from a "mother" DataFrame), do something, > and never need those DataFrames again in future iterations > # we are having a problem cleaning up the built-up metadata > # hence the program will crash after while because of OOM > while True: > _df = spark.read.parquet(parquet_path) > if args.unpersist: > _df.unpersist() > if args.py_gc: > del _df > gc.collect() > i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23235) Add executor Threaddump to api
[ https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344168#comment-16344168 ] Alex Bozarth commented on SPARK-23235: -- Your discussion clarified my concern for me. I think I wanted to see how he was going to do it first, but based on your implementation comments, this looks like a good add. > Add executor Threaddump to api > -- > > Key: SPARK-23235 > URL: https://issues.apache.org/jira/browse/SPARK-23235 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Minor > Labels: newbie > > It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} > is only available in the UI, not in the rest api at all. This is especially > a pain because that page in the UI has extra formatting which makes it a pain > to send the output to somebody else (most likely you click "expand all" and > then copy paste that, which is OK, but is formatted weirdly). We might also > just want a "format=raw" option even on the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23235) Add executor Threaddump to api
[ https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344140#comment-16344140 ] Imran Rashid commented on SPARK-23235: -- [~ajbozarth] can you explain your concern? [~attilapiros] definitely a new endpoint. taking a thread dump is somewhat expensive, don't want it to be part of the regular info. > Add executor Threaddump to api > -- > > Key: SPARK-23235 > URL: https://issues.apache.org/jira/browse/SPARK-23235 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Minor > Labels: newbie > > It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} > is only available in the UI, not in the rest api at all. This is especially > a pain because that page in the UI has extra formatting which makes it a pain > to send the output to somebody else (most likely you click "expand all" and > then copy paste that, which is OK, but is formatted weirdly). We might also > just want a "format=raw" option even on the UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath
[ https://issues.apache.org/jira/browse/SPARK-23209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23209: Assignee: Marcelo Vanzin > HiveDelegationTokenProvider throws an exception if Hive jars are not the > classpath > -- > > Key: SPARK-23209 > URL: https://issues.apache.org/jira/browse/SPARK-23209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: OSX, Java(TM) SE Runtime Environment (build > 1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode) >Reporter: Sahil Takiar >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > > While doing some Hive-on-Spark testing against the Spark 2.3.0 release > candidates we came across a bug (see HIVE-18436). > Stack-trace: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/conf/HiveConf > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68) > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54) > at > org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44) > at org.apache.spark.deploy.yarn.Client.(Client.scala:123) > at > org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.conf.HiveConf > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 10 more > {code} > Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed > {{HiveDelegationTokenProvider}} so that it constructs > {{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} > rather than trying to manually load the class via the class loader. Looks > like with the new code the JVM tries to load {{HiveConf}} as soon as > {{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch > around the construction of {{HiveDelegationTokenProvider}} a > {{ClassNotFoundException}} is thrown, which causes spark-submit to crash. > Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if > Hive is on the classpath". This behavior would seem to contradict that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath
[ https://issues.apache.org/jira/browse/SPARK-23209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23209. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20399 [https://github.com/apache/spark/pull/20399] > HiveDelegationTokenProvider throws an exception if Hive jars are not the > classpath > -- > > Key: SPARK-23209 > URL: https://issues.apache.org/jira/browse/SPARK-23209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: OSX, Java(TM) SE Runtime Environment (build > 1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode) >Reporter: Sahil Takiar >Priority: Blocker > Fix For: 2.3.0 > > > While doing some Hive-on-Spark testing against the Spark 2.3.0 release > candidates we came across a bug (see HIVE-18436). > Stack-trace: > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/conf/HiveConf > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68) > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54) > at > org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44) > at org.apache.spark.deploy.yarn.Client.(Client.scala:123) > at > org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.conf.HiveConf > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 10 more > {code} > Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed > {{HiveDelegationTokenProvider}} so that it constructs > {{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} > rather than trying to manually load the class via the class loader. Looks > like with the new code the JVM tries to load {{HiveConf}} as soon as > {{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch > around the construction of {{HiveDelegationTokenProvider}} a > {{ClassNotFoundException}} is thrown, which causes spark-submit to crash. > Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if > Hive is on the classpath". This behavior would seem to contradict that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23157: Assignee: Apache Spark > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Assignee: Apache Spark >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23157: Assignee: (was: Apache Spark) > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344078#comment-16344078 ] Apache Spark commented on SPARK-23157: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20429 > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23261: Assignee: Apache Spark (was: Xiao Li) > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344076#comment-16344076 ] Apache Spark commented on SPARK-23261: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20428 > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23261: Assignee: Xiao Li (was: Apache Spark) > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared
[ https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344068#comment-16344068 ] MBA Learns to Code edited comment on SPARK-23246 at 1/29/18 9:45 PM: - [~srowen] the Java driver heap dump is attached above for your review. I ran the job with Spark UI disabled (spark.ui.enabled = 'false'). was (Author: mbalearnstocode): [~srowen] the Java driver heap dump is attached above for your review. > (Py)Spark OOM because of iteratively accumulated metadata that cannot be > cleared > > > Key: SPARK-23246 > URL: https://issues.apache.org/jira/browse/SPARK-23246 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: MBA Learns to Code >Priority: Critical > Attachments: SparkProgramHeapDump.bin.tar.xz > > > I am having consistent OOM crashes when trying to use PySpark for iterative > algorithms in which I create new DataFrames per iteration (e.g. by sampling > from a "mother" DataFrame), do something with such DataFrames, and never need > such DataFrames ever in future iterations. > The below script simulates such OOM failures. Even when one tries explicitly > .unpersist() the temporary DataFrames (by using the --unpersist flag below) > and/or deleting and garbage-collecting the Python objects (by using the > --py-gc flag below), the Java objects seem to stay on and accumulate until > they exceed the JVM/driver memory. > The more complex the temporary DataFrames in each iteration (illustrated by > the --n-partitions flag below), the faster OOM occurs. > The typical error messages include: > - "java.lang.OutOfMemoryError : GC overhead limit exceeded" > - "Java heap space" > - "ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=6053742323219781 > 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 > cap=64]}} to /; closing connection" > Please suggest how I may overcome this so that we can have long-running > iterative programs using Spark that uses resources only up to a bounded, > controllable limit. > > {code:java} > from __future__ import print_function > import argparse > import gc > import pandas > import pyspark > arg_parser = argparse.ArgumentParser() > arg_parser.add_argument('--unpersist', action='store_true') > arg_parser.add_argument('--py-gc', action='store_true') > arg_parser.add_argument('--n-partitions', type=int, default=1000) > args = arg_parser.parse_args() > # create SparkSession (*** set spark.driver.memory to 512m in > spark-defaults.conf ***) > spark = pyspark.sql.SparkSession.builder \ > .config('spark.executor.instances', 2) \ > .config('spark.executor.cores', 2) \ > .config('spark.executor.memory', '512m') \ > .config('spark.ui.enabled', False) \ > .config('spark.ui.retainedJobs', 10) \ > .enableHiveSupport() \ > .getOrCreate() > # create Parquet file for subsequent repeated loading > df = spark.createDataFrame( > pandas.DataFrame( > dict( > row=range(args.n_partitions), > x=args.n_partitions * [0] > ) > ) > ) > parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions) > df.write.parquet( > path=parquet_path, > partitionBy='row', > mode='overwrite' > ) > i = 0 > # the below loop simulates an iterative algorithm that creates new DataFrames > in each iteration (e.g. sampling from a "mother" DataFrame), do something, > and never need those DataFrames again in future iterations > # we are having a problem cleaning up the built-up metadata > # hence the program will crash after while because of OOM > while True: > _df = spark.read.parquet(parquet_path) > if args.unpersist: > _df.unpersist() > if args.py_gc: > del _df > gc.collect() > i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared
[ https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344068#comment-16344068 ] MBA Learns to Code commented on SPARK-23246: [~srowen] the Java driver heap dump is attached above for your review. > (Py)Spark OOM because of iteratively accumulated metadata that cannot be > cleared > > > Key: SPARK-23246 > URL: https://issues.apache.org/jira/browse/SPARK-23246 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: MBA Learns to Code >Priority: Critical > Attachments: SparkProgramHeapDump.bin.tar.xz > > > I am having consistent OOM crashes when trying to use PySpark for iterative > algorithms in which I create new DataFrames per iteration (e.g. by sampling > from a "mother" DataFrame), do something with such DataFrames, and never need > such DataFrames ever in future iterations. > The below script simulates such OOM failures. Even when one tries explicitly > .unpersist() the temporary DataFrames (by using the --unpersist flag below) > and/or deleting and garbage-collecting the Python objects (by using the > --py-gc flag below), the Java objects seem to stay on and accumulate until > they exceed the JVM/driver memory. > The more complex the temporary DataFrames in each iteration (illustrated by > the --n-partitions flag below), the faster OOM occurs. > The typical error messages include: > - "java.lang.OutOfMemoryError : GC overhead limit exceeded" > - "Java heap space" > - "ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=6053742323219781 > 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 > cap=64]}} to /; closing connection" > Please suggest how I may overcome this so that we can have long-running > iterative programs using Spark that uses resources only up to a bounded, > controllable limit. > > {code:java} > from __future__ import print_function > import argparse > import gc > import pandas > import pyspark > arg_parser = argparse.ArgumentParser() > arg_parser.add_argument('--unpersist', action='store_true') > arg_parser.add_argument('--py-gc', action='store_true') > arg_parser.add_argument('--n-partitions', type=int, default=1000) > args = arg_parser.parse_args() > # create SparkSession (*** set spark.driver.memory to 512m in > spark-defaults.conf ***) > spark = pyspark.sql.SparkSession.builder \ > .config('spark.executor.instances', 2) \ > .config('spark.executor.cores', 2) \ > .config('spark.executor.memory', '512m') \ > .config('spark.ui.enabled', False) \ > .config('spark.ui.retainedJobs', 10) \ > .enableHiveSupport() \ > .getOrCreate() > # create Parquet file for subsequent repeated loading > df = spark.createDataFrame( > pandas.DataFrame( > dict( > row=range(args.n_partitions), > x=args.n_partitions * [0] > ) > ) > ) > parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions) > df.write.parquet( > path=parquet_path, > partitionBy='row', > mode='overwrite' > ) > i = 0 > # the below loop simulates an iterative algorithm that creates new DataFrames > in each iteration (e.g. sampling from a "mother" DataFrame), do something, > and never need those DataFrames again in future iterations > # we are having a problem cleaning up the built-up metadata > # hence the program will crash after while because of OOM > while True: > _df = spark.read.parquet(parquet_path) > if args.unpersist: > _df.unpersist() > if args.py_gc: > del _df > gc.collect() > i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared
[ https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MBA Learns to Code updated SPARK-23246: --- Attachment: SparkProgramHeapDump.bin.tar.xz > (Py)Spark OOM because of iteratively accumulated metadata that cannot be > cleared > > > Key: SPARK-23246 > URL: https://issues.apache.org/jira/browse/SPARK-23246 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: MBA Learns to Code >Priority: Critical > Attachments: SparkProgramHeapDump.bin.tar.xz > > > I am having consistent OOM crashes when trying to use PySpark for iterative > algorithms in which I create new DataFrames per iteration (e.g. by sampling > from a "mother" DataFrame), do something with such DataFrames, and never need > such DataFrames ever in future iterations. > The below script simulates such OOM failures. Even when one tries explicitly > .unpersist() the temporary DataFrames (by using the --unpersist flag below) > and/or deleting and garbage-collecting the Python objects (by using the > --py-gc flag below), the Java objects seem to stay on and accumulate until > they exceed the JVM/driver memory. > The more complex the temporary DataFrames in each iteration (illustrated by > the --n-partitions flag below), the faster OOM occurs. > The typical error messages include: > - "java.lang.OutOfMemoryError : GC overhead limit exceeded" > - "Java heap space" > - "ERROR TransportRequestHandler: Error sending result > RpcResponse{requestId=6053742323219781 > 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 > cap=64]}} to /; closing connection" > Please suggest how I may overcome this so that we can have long-running > iterative programs using Spark that uses resources only up to a bounded, > controllable limit. > > {code:java} > from __future__ import print_function > import argparse > import gc > import pandas > import pyspark > arg_parser = argparse.ArgumentParser() > arg_parser.add_argument('--unpersist', action='store_true') > arg_parser.add_argument('--py-gc', action='store_true') > arg_parser.add_argument('--n-partitions', type=int, default=1000) > args = arg_parser.parse_args() > # create SparkSession (*** set spark.driver.memory to 512m in > spark-defaults.conf ***) > spark = pyspark.sql.SparkSession.builder \ > .config('spark.executor.instances', 2) \ > .config('spark.executor.cores', 2) \ > .config('spark.executor.memory', '512m') \ > .config('spark.ui.enabled', False) \ > .config('spark.ui.retainedJobs', 10) \ > .enableHiveSupport() \ > .getOrCreate() > # create Parquet file for subsequent repeated loading > df = spark.createDataFrame( > pandas.DataFrame( > dict( > row=range(args.n_partitions), > x=args.n_partitions * [0] > ) > ) > ) > parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions) > df.write.parquet( > path=parquet_path, > partitionBy='row', > mode='overwrite' > ) > i = 0 > # the below loop simulates an iterative algorithm that creates new DataFrames > in each iteration (e.g. sampling from a "mother" DataFrame), do something, > and never need those DataFrames again in future iterations > # we are having a problem cleaning up the built-up metadata > # hence the program will crash after while because of OOM > while True: > _df = spark.read.parquet(parquet_path) > if args.unpersist: > _df.unpersist() > if args.py_gc: > del _df > gc.collect() > i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i)) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23261) Rename Pandas UDFs
Xiao Li created SPARK-23261: --- Summary: Rename Pandas UDFs Key: SPARK-23261 URL: https://issues.apache.org/jira/browse/SPARK-23261 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li Rename the public APIs of pandas udfs from - PANDAS SCALAR UDF -> SCALAR PANDAS UDF - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343962#comment-16343962 ] Henry Robinson commented on SPARK-23157: [~kretes] - I can see an argument for the behaviour you're describing, but that's not the way the API is apparently intended. Like Sean says, there are way too many ways to shoot yourself in the foot if you can stitch together arbitrary datasets like this if the Datasets are column-wise incompatible, and allowing the relatively small subset of cases where it would work would lead to a more confusing API, IMO. The documentation for {{withColumn()}} could be updated to make this clearer; if I get a moment today I'll submit a PR. > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23260) remove V2 from the class name of data source reader/writer
[ https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343878#comment-16343878 ] Apache Spark commented on SPARK-23260: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20427 > remove V2 from the class name of data source reader/writer > -- > > Key: SPARK-23260 > URL: https://issues.apache.org/jira/browse/SPARK-23260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23260) remove V2 from the class name of data source reader/writer
[ https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23260: Assignee: Apache Spark (was: Wenchen Fan) > remove V2 from the class name of data source reader/writer > -- > > Key: SPARK-23260 > URL: https://issues.apache.org/jira/browse/SPARK-23260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23260) remove V2 from the class name of data source reader/writer
[ https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23260: Assignee: Wenchen Fan (was: Apache Spark) > remove V2 from the class name of data source reader/writer > -- > > Key: SPARK-23260 > URL: https://issues.apache.org/jira/browse/SPARK-23260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23260) remove V2 from the class name of data source reader/writer
Wenchen Fan created SPARK-23260: --- Summary: remove V2 from the class name of data source reader/writer Key: SPARK-23260 URL: https://issues.apache.org/jira/browse/SPARK-23260 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343840#comment-16343840 ] Apache Spark commented on SPARK-23207: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/20426 > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.3.0, 2.4.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23259: Assignee: Apache Spark > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Assignee: Apache Spark >Priority: Major > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23259: Assignee: (was: Apache Spark) > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Priority: Major > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23259) Clean up legacy code around hive external catalog
[ https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343814#comment-16343814 ] Apache Spark commented on SPARK-23259: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20425 > Clean up legacy code around hive external catalog > - > > Key: SPARK-23259 > URL: https://issues.apache.org/jira/browse/SPARK-23259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Feng Liu >Priority: Major > > Some legacy code around the hive metastore catalog need to be removed for > further code improvement: > # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the > private method `getRawTable`. > # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the > `addJar` method, after the jar being added to the single class loader. > # in HiveClientImpl: There are some redundant code in both the `tableExists` > and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23259) Clean up legacy code around hive external catalog
Feng Liu created SPARK-23259: Summary: Clean up legacy code around hive external catalog Key: SPARK-23259 URL: https://issues.apache.org/jira/browse/SPARK-23259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Feng Liu Some legacy code around the hive metastore catalog need to be removed for further code improvement: # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the private method `getRawTable`. # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the `addJar` method, after the jar being added to the single class loader. # in HiveClientImpl: There are some redundant code in both the `tableExists` and `getTableOption` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout
[ https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23240: Assignee: (was: Apache Spark) > PythonWorkerFactory issues unhelpful message when pyspark.daemon produces > bogus stdout > -- > > Key: SPARK-23240 > URL: https://issues.apache.org/jira/browse/SPARK-23240 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Bruce Robbins >Priority: Minor > > Environmental issues or site-local customizations (i.e., sitecustomize.py > present in the python install directory) can interfere with daemon.py’s > output to stdout. PythonWorkerFactory produces unhelpful messages when this > happens, causing some head scratching before the actual issue is determined. > Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory uses the output as the daemon’s port number and ends up > throwing an exception when creating the socket: > {noformat} > java.lang.IllegalArgumentException: port out of range:1819239265 > at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) > at java.net.InetSocketAddress.(InetSocketAddress.java:188) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78) > {noformat} > Case #2: No data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory throws an EOFException exception reading the from the > Process input stream. > The second case is somewhat less mysterious than the first, because > PythonWorkerFactory also displays the stderr from the python process. > When there is unexpected or missing output in pyspark.daemon’s stdout, > PythonWorkerFactory should say so. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout
[ https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23240: Assignee: Apache Spark > PythonWorkerFactory issues unhelpful message when pyspark.daemon produces > bogus stdout > -- > > Key: SPARK-23240 > URL: https://issues.apache.org/jira/browse/SPARK-23240 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > Environmental issues or site-local customizations (i.e., sitecustomize.py > present in the python install directory) can interfere with daemon.py’s > output to stdout. PythonWorkerFactory produces unhelpful messages when this > happens, causing some head scratching before the actual issue is determined. > Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory uses the output as the daemon’s port number and ends up > throwing an exception when creating the socket: > {noformat} > java.lang.IllegalArgumentException: port out of range:1819239265 > at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) > at java.net.InetSocketAddress.(InetSocketAddress.java:188) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78) > {noformat} > Case #2: No data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory throws an EOFException exception reading the from the > Process input stream. > The second case is somewhat less mysterious than the first, because > PythonWorkerFactory also displays the stderr from the python process. > When there is unexpected or missing output in pyspark.daemon’s stdout, > PythonWorkerFactory should say so. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout
[ https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343792#comment-16343792 ] Apache Spark commented on SPARK-23240: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/20424 > PythonWorkerFactory issues unhelpful message when pyspark.daemon produces > bogus stdout > -- > > Key: SPARK-23240 > URL: https://issues.apache.org/jira/browse/SPARK-23240 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Bruce Robbins >Priority: Minor > > Environmental issues or site-local customizations (i.e., sitecustomize.py > present in the python install directory) can interfere with daemon.py’s > output to stdout. PythonWorkerFactory produces unhelpful messages when this > happens, causing some head scratching before the actual issue is determined. > Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory uses the output as the daemon’s port number and ends up > throwing an exception when creating the socket: > {noformat} > java.lang.IllegalArgumentException: port out of range:1819239265 > at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) > at java.net.InetSocketAddress.(InetSocketAddress.java:188) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78) > {noformat} > Case #2: No data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory throws an EOFException exception reading the from the > Process input stream. > The second case is somewhat less mysterious than the first, because > PythonWorkerFactory also displays the stderr from the python process. > When there is unexpected or missing output in pyspark.daemon’s stdout, > PythonWorkerFactory should say so. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22221) Add User Documentation for Working with Arrow in Spark
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343785#comment-16343785 ] Apache Spark commented on SPARK-1: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20423 > Add User Documentation for Working with Arrow in Spark > -- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > There needs to be user facing documentation that will show how to enable/use > Arrow with Spark, what the user should expect, and describe any differences > with similar existing functionality. > A comment from Xiao Li on https://github.com/apache/spark/pull/18664 > Given the users/applications contain the Timestamp in their Dataset and their > processing algorithms also need to have the codes based on the corresponding > time-zone related assumptions. > * For the new users/applications, they first enabled Arrow and later hit an > Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, > what should they do? > * For the existing users/applications, they want to utilize Arrow for better > performance. Can they just turn on spark.sql.execution.arrow.enable? What > should they do? > Note Hopefully, the guides/solutions are user-friendly. That means, it must > be very simple to understand for most users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22221) Add User Documentation for Working with Arrow in Spark
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-1. - Resolution: Fixed Assignee: Bryan Cutler Fix Version/s: 2.3.0 > Add User Documentation for Working with Arrow in Spark > -- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > There needs to be user facing documentation that will show how to enable/use > Arrow with Spark, what the user should expect, and describe any differences > with similar existing functionality. > A comment from Xiao Li on https://github.com/apache/spark/pull/18664 > Given the users/applications contain the Timestamp in their Dataset and their > processing algorithms also need to have the codes based on the corresponding > time-zone related assumptions. > * For the new users/applications, they first enabled Arrow and later hit an > Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, > what should they do? > * For the existing users/applications, they want to utilize Arrow for better > performance. Can they just turn on spark.sql.execution.arrow.enable? What > should they do? > Note Hopefully, the guides/solutions are user-friendly. That means, it must > be very simple to understand for most users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23258) Should not split Arrow record batches based on row count
Bryan Cutler created SPARK-23258: Summary: Should not split Arrow record batches based on row count Key: SPARK-23258 URL: https://issues.apache.org/jira/browse/SPARK-23258 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler Currently when executing scalar {{pandas_udf}} or using {{toPandas()}} the Arrow record batches are split up once the record count reaches a max value, which is configured with "spark.sql.execution.arrow.maxRecordsPerBatch". This is not ideal because the number of columns is not taken into account and if there are many columns, then OOMs can occur. An alternative approach could be to look at the size of the Arrow buffers being used and cap it at a certain size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343693#comment-16343693 ] Marcelo Vanzin commented on SPARK-23020: :-/ It's getting harder and harder to reproduce these races locally... this one may take a while. > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/29/18 5:26 PM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray - SPARK-23256 regression: LinearRegressionSummary - missing r2adj - SPARK-23162 stat: missing Summarizer class - SPARK-21741 tuning: missing subModels, hasSubModels - SPARK-22005 for the above DOC issues SPARK-23163 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray - SPARK-23256 regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For addi
[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332698#comment-16332698 ] Bryan Cutler edited comment on SPARK-23109 at 1/29/18 5:25 PM: --- I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray - SPARK-23256 regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163 was (Author: bryanc): I did the following: generated HTML doc and checked for consistency with Scala, did not see any API breaking changes, checked for missing items (see list below), checked default param values match. No blocking or major issues found. Items requiring follow up, I will create (related) JIRAS to fix: classification: GBTClassifier - missing featureSubsetStrategy, should be moved to TreeEnsembleParams GBTClassificationModel - missing numClasses, should inherit from JavaClassificationModel for both of the above https://issues.apache.org/jira/browse/SPARK-23161 clustering: GuassianMixtureModel - missing guassians, need to serialize Array[MultivariateGaussian]? LDAModel - missing topicsMatrix - can send Matrix through Py4J? evaluation: ClusteringEvaluator - DOC describe silhouette like scaladoc feature: Bucketizer - mulitple input/output cols, splitsArray - https://issues.apache.org/jira/browse/SPARK-22797 ChiSqSelector - DOC selectorType desc missing new types QuantileDiscretizer - multiple input output cols - https://issues.apache.org/jira/browse/SPARK-22796 fpm: DOC associationRules should say return "DataFrame" image: missing columnSchema, get*, scala missing toNDArray regression: LinearRegressionSummary - missing r2adj - https://issues.apache.org/jira/browse/SPARK-23162 stat: missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741 tuning: missing subModels, hasSubModels - https://issues.apache.org/jira/browse/SPARK-22005 for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163 > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343665#comment-16343665 ] Bryan Cutler commented on SPARK-23109: -- Thanks [~mlnick], yes this is done. > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23109. -- Resolution: Done > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17006) WithColumn Performance Degrades with Number of Invocations
[ https://issues.apache.org/jira/browse/SPARK-17006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17006. --- Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.3.0 > WithColumn Performance Degrades with Number of Invocations > -- > > Key: SPARK-17006 > URL: https://issues.apache.org/jira/browse/SPARK-17006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hamel Ajay Kothari >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.3.0 > > > Consider the following test case. We create a dataframe with 100 withColumn > statements, then 100 more, then 100 more, then 100 more. Each time we do this > it gets slower pretty drastically. If we sub in the optimized plan, we end up > with drastically better performance. > Consider the following code: > {code} > val raw = sc.parallelize(Range(1, 100)).toDF > val s1 = System.nanoTime() > var mapped = Range(1, 100).foldLeft(raw) { (df, i) => > df.withColumn(s"value${i}", df("value") + i) > } > val s2 = System.nanoTime() > val mapped2 = Range(1, 100).foldLeft(mapped) { (df, i) => > df.withColumn(s"value${i}_2", df("value") + i) > } > val s3 = System.nanoTime() > val mapped3 = Range(1, 100).foldLeft(mapped2) { (df, i) => > df.withColumn(s"value${i}_3", df("value") + i) > } > val s4 = System.nanoTime() > val mapped4 = Range(1, 100).foldLeft(mapped3) { (df, i) => > df.withColumn(s"value${i}_4", df("value") + i) > } > val s5 = System.nanoTime() > val plan = mapped3.queryExecution.optimizedPlan > val optimizedMapped3 = new org.apache.spark.sql.DataFrame(spark, plan, > org.apache.spark.sql.catalyst.encoders.RowEncoder(mapped3.schema)) > val s6 = System.nanoTime() > val mapped5 = Range(1, 100).foldLeft(optimizedMapped3) { (df, i) => > df.withColumn(s"value${i}_4", df("value") + i) > } > val s7 = System.nanoTime() > val mapped6 = Range(1, 100).foldLeft(mapped3) { (df, i) => > df.withColumn(s"value${i}_4", df("value") + i) > } > val s8 = System.nanoTime() > val plan = mapped3.queryExecution.analyzed > val analyzedMapped4 = new org.apache.spark.sql.DataFrame(spark, plan, > org.apache.spark.sql.catalyst.encoders.RowEncoder(mapped3.schema)) > val mapped7 = Range(1, 100).foldLeft(analyzedMapped4) { (df, i) => > df.withColumn(s"value${i}_4", df("value") + i) > } > val s9 = System.nanoTime() > val secondsToNanos = 1000*1000*1000.0 > val stage1 = (s2-s1)/secondsToNanos > val stage2 = (s3-s2)/secondsToNanos > val stage3 = (s4-s3)/secondsToNanos > val stage4 = (s5-s4)/secondsToNanos > val stage5 = (s6-s5)/secondsToNanos > val stage6 = (s7-s6)/secondsToNanos > val stage7 = (s8-s7)/secondsToNanos > val stage8 = (s9-s8)/secondsToNanos > println(s"First 100: ${stage1}") > println(s"Second 100: ${stage2}") > println(s"Third 100: ${stage3}") > println(s"Fourth 100: ${stage4}") > println(s"Fourth 100 Optimization time: ${stage5}") > println(s"Fourth 100 Optimized ${stage6}") > println(s"Fourth Unoptimized (to make sure no caching/etc takes place, > reusing analyzed etc: ${stage7}") > println(s"Fourth selects: ${stage8}") > {code} > This results in the following performance: > {code} > First 100: 4.873489454 > Second 100: 14.982028303 seconds > Third 100: 38.775467952 seconds > Fourth 100: 73.429119675 seconds > Fourth 100 Optimization time: 1.777374175 seconds > Fourth 100 Optimized 22.514489934 seconds > Fourth Unoptimized (to make sure no caching/etc takes place, reusing analyzed > etc: 69.616112734 seconds > Fourth 100 using analyzed plan: 67.641982709 seconds > {code} > Now, I suspect that we can't just sub in the optimized plan for the logical > plan because we lose a bunch of information which may be useful for > optimization later. But, I suspect there's something we can do in the case of > Projects at least that might be useful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23223) Stacking dataset transforms performs poorly
[ https://issues.apache.org/jira/browse/SPARK-23223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23223. --- Resolution: Fixed Fix Version/s: 2.3.0 > Stacking dataset transforms performs poorly > --- > > Key: SPARK-23223 > URL: https://issues.apache.org/jira/browse/SPARK-23223 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.3.0 > > > It is a common pattern to apply multiple transforms to a {{Dataset}} (using > {{Dataset.withColumn}} for example. This is currently quite expensive because > we run {{CheckAnalysis}} on the full plan and create an encoder for each > intermediate {{Dataset}}. > {{CheckAnalysis}} only needs to be run for the newly added plan components, > and not for the full plan. The addition of the {{AnalysisBarrier}} created > this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23059) Correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23059. - Resolution: Fixed Fix Version/s: 2.4.0 > Correct some improper with view related method usage > > > Key: SPARK-23059 > URL: https://issues.apache.org/jira/browse/SPARK-23059 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1 >Reporter: xubo245 >Priority: Minor > Fix For: 2.4.0 > > > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23059) Correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23059: --- Assignee: xubo245 > Correct some improper with view related method usage > > > Key: SPARK-23059 > URL: https://issues.apache.org/jira/browse/SPARK-23059 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Minor > Fix For: 2.4.0 > > > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23199) improved Removes repetition from group expressions in Aggregate
[ https://issues.apache.org/jira/browse/SPARK-23199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23199. - Resolution: Fixed Assignee: caoxuewen Fix Version/s: 2.3.0 > improved Removes repetition from group expressions in Aggregate > --- > > Key: SPARK-23199 > URL: https://issues.apache.org/jira/browse/SPARK-23199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 2.3.0 > > > Currently, all Aggregate operations will go into > RemoveRepetitionFromGroupExpressions, but there is no group expression or > there is no duplicate group expression in group expression, we not need copy > for logic plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23219) Rename ReadTask to DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23219. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20397 [https://github.com/apache/spark/pull/20397] > Rename ReadTask to DataReaderFactory > > > Key: SPARK-23219 > URL: https://issues.apache.org/jira/browse/SPARK-23219 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23219) Rename ReadTask to DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23219: --- Assignee: Gengliang Wang > Rename ReadTask to DataReaderFactory > > > Key: SPARK-23219 > URL: https://issues.apache.org/jira/browse/SPARK-23219 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20129. --- Resolution: Won't Fix Assignee: (was: Xiangrui Meng) Per PR discussion, I believe this should simply be Won't Fix. > JavaSparkContext should use SparkContext.getOrCreate > > > Key: SPARK-20129 > URL: https://issues.apache.org/jira/browse/SPARK-20129 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Priority: Minor > > It should re-use an existing SparkContext if there is a live one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked
[ https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343358#comment-16343358 ] Sean Owen commented on SPARK-23252: --- That much looks normal if the executor is removed and the tasks relaunched. What happens next? > When NodeManager and CoarseGrainedExecutorBackend processes are killed, the > job will be blocked > --- > > Key: SPARK-23252 > URL: https://issues.apache.org/jira/browse/SPARK-23252 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Bang Xiao >Priority: Major > > This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We > use Yarn as our resource manager. > 1,spark-submit "JavaWordCount" application in yarn-client mode > 2, Kill NodeManager and CoarseGrainedExecutorBackend processes in one node > when the job is in stage 0 > if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager > will pending the failure task to resubmit. but if the NodeManager and > CoarseGrainedExecutorBackend processes killed simultaneously,the whole job > will be blocked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager
Rob Keevil created SPARK-23257: -- Summary: Implement Kerberos Support in Kubernetes resource manager Key: SPARK-23257 URL: https://issues.apache.org/jira/browse/SPARK-23257 Project: Spark Issue Type: Wish Components: Kubernetes Affects Versions: 2.3.0 Reporter: Rob Keevil On the forked k8s branch of Spark at [https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support has been added to the Kubernetes resource manager. The Kubernetes code between these two repositories appears to have diverged, so this commit cannot be merged in simply. Are there any plans to re-implement this work on the main Spark repository? [ifilonenko|https://github.com/ifilonenko] I could not find any discussion about this specific topic online. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked
[ https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343317#comment-16343317 ] Bang Xiao commented on SPARK-23252: --- [~srowen] it seems the job waits for the results of those tasks that have failed, but it will never get those results since the failed tasks have not be resubmited. When NodeManager and CoarseGrainedExecutorBackend processes killed simultaneously, the following log appears on the driver end: {code:java} 18/01/29 14:32:35 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 24. 18/01/29 14:32:35 INFO DAGScheduler: Executor lost: 24 (epoch 1) 18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 8557067791977911361 to /10.142.103.168:14733: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 7460664886971675621 to /10.142.103.168:14751: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 5802441956021450458 to /10.142.103.168:14750: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 9203205043102726551 to /10.142.103.168:14739: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 5217847872442409416 to /10.142.103.168:14744: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 18/01/29 14:32:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 24 from BlockManagerMaster. 18/01/29 14:32:35 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(24, rsync.slave06.jupiter.zw.ted, 46509, None){code} > When NodeManager and CoarseGrainedExecutorBackend processes are killed, the > job will be blocked > --- > > Key: SPARK-23252 > URL: https://issues.apache.org/jira/browse/SPARK-23252 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Bang Xiao >Priority: Major > > This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We > use Yarn as our resource manager. > 1,spark-submit "JavaWordCount" application in yarn-client mode > 2, Kill NodeManager and CoarseGrainedExecutorBackend processes in one node > when the job is in stage 0 > if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager > will pending the failure task to resubmit. but if the NodeManager and > CoarseGrainedExecutorBackend processes killed simultaneously,the whole job > will be blocked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked
[ https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343288#comment-16343288 ] Sean Owen commented on SPARK-23252: --- Blocked how? Waiting for the NodeManager? YARN would know the NM is down shortly.. > When NodeManager and CoarseGrainedExecutorBackend processes are killed, the > job will be blocked > --- > > Key: SPARK-23252 > URL: https://issues.apache.org/jira/browse/SPARK-23252 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Bang Xiao >Priority: Major > > This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We > use Yarn as our resource manager. > 1,spark-submit "JavaWordCount" application in yarn-client mode > 2, Kill NodeManager and CoarseGrainedExecutorBackend processes in one node > when the job is in stage 0 > if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager > will pending the failure task to resubmit. but if the NodeManager and > CoarseGrainedExecutorBackend processes killed simultaneously,the whole job > will be blocked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23108: -- Assignee: Nick Pentreath > ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-23108 > URL: https://issues.apache.org/jira/browse/SPARK-23108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343278#comment-16343278 ] Nick Pentreath edited comment on SPARK-23108 at 1/29/18 12:14 PM: -- Went through {{Experimental}} APIs, there could be a case for: * {{Regression / Binary / Multiclass}} evaluators as they've been around for a long time. * Linear regression summary (since {{1.5.0}}). * {{AFTSurvivalRegression}} (since {{1.6.0}}). I think at this late stage we should not open up anything, unless anyone feels very strongly? was (Author: mlnick): I think at this late stage we should not open up anything, unless anyone feels very strongly? > ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-23108 > URL: https://issues.apache.org/jira/browse/SPARK-23108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23108. Resolution: Resolved Fix Version/s: 2.3.0 > ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-23108 > URL: https://issues.apache.org/jira/browse/SPARK-23108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Blocker > Fix For: 2.3.0 > > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343290#comment-16343290 ] Nick Pentreath commented on SPARK-23108: Also checked ml {{DeveloperAPI}}, nothing to graduate there I would say. > ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-23108 > URL: https://issues.apache.org/jira/browse/SPARK-23108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled
[ https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23238: - Fix Version/s: 2.3.0 > Externalize SQLConf spark.sql.execution.arrow.enabled > -- > > Key: SPARK-23238 > URL: https://issues.apache.org/jira/browse/SPARK-23238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled
[ https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23238. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/20403 > Externalize SQLConf spark.sql.execution.arrow.enabled > -- > > Key: SPARK-23238 > URL: https://issues.apache.org/jira/browse/SPARK-23238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled
[ https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23238: Assignee: Hyukjin Kwon > Externalize SQLConf spark.sql.execution.arrow.enabled > -- > > Key: SPARK-23238 > URL: https://issues.apache.org/jira/browse/SPARK-23238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343278#comment-16343278 ] Nick Pentreath commented on SPARK-23108: I think at this late stage we should not open up anything, unless anyone feels very strongly? > ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-23108 > URL: https://issues.apache.org/jira/browse/SPARK-23108 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343279#comment-16343279 ] Sean Owen commented on SPARK-23157: --- Agree this should not work . You are selecting a column from a different Dataset. Happening to work because a number of cols matches or the function is the identity sounds like as much way to write bugs as convenience > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343276#comment-16343276 ] Nick Pentreath commented on SPARK-23109: Created SPARK-23256 to track {{columnSchema}} in Python API. > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23256) Add columnSchema method to PySpark image reader
Nick Pentreath created SPARK-23256: -- Summary: Add columnSchema method to PySpark image reader Key: SPARK-23256 URL: https://issues.apache.org/jira/browse/SPARK-23256 Project: Spark Issue Type: Documentation Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Nick Pentreath SPARK-21866 added support for reading image data into a DataFrame. The PySpark API is missing the {{columnSchema}} method in Scala API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343269#comment-16343269 ] Nick Pentreath commented on SPARK-23109: So [~bryanc] I think this is done then? Can you confirm? > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343266#comment-16343266 ] Nick Pentreath commented on SPARK-21866: Ok, added SPARK-23255 to track user guide additions > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter >Assignee: Ilya Matiach >Priority: Major > Labels: SPIP > Fix For: 2.3.0 > > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the > more general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and B
[jira] [Created] (SPARK-23255) Add user guide and examples for DataFrame image reading functions
Nick Pentreath created SPARK-23255: -- Summary: Add user guide and examples for DataFrame image reading functions Key: SPARK-23255 URL: https://issues.apache.org/jira/browse/SPARK-23255 Project: Spark Issue Type: Documentation Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Nick Pentreath SPARK-21866 added built-in support for reading image data into a DataFrame. This new functionality should be documented in the user guide, with example usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23107: --- Description: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue (SPARK-23111 for {{2.3}}) was: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes
[ https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23227: --- Priority: Minor (was: Major) > Add user guide entry for collecting sub models for cross-validation classes > --- > > Key: SPARK-23227 > URL: https://issues.apache.org/jira/browse/SPARK-23227 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23127: --- Priority: Minor (was: Major) > Update FeatureHasher user guide for catCols parameter > - > > Key: SPARK-23127 > URL: https://issues.apache.org/jira/browse/SPARK-23127 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath >Priority: Minor > Fix For: 2.3.0 > > > SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala and > Python doc, but did not update the user guide entry discussing feature > handling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org