date:20180129


 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer


[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344604#comment-16344604
 ] 

Nick Pentreath commented on SPARK-23265:


cc [~huaxing] 

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer


 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Issue Type: Improvement  (was: Documentation)

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

Nick Pentreath created SPARK-23265:
--

 Summary: Update multi-column error handling logic in 
QuantileDiscretizer
 Key: SPARK-23265
 URL: https://issues.apache.org/jira/browse/SPARK-23265
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer


 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets}}, since that is then applied to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets}}, since that is then applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23138) Add user guide example for multiclass logistic regression summary


 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23138.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20332
[https://github.com/apache/spark/pull/20332]

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23138) Add user guide example for multiclass logistic regression summary


 [ 
https://issues.apache.org/jira/browse/SPARK-23138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23138:
--

Assignee: Seth Hendrickson

> Add user guide example for multiclass logistic regression summary
> -
>
> Key: SPARK-23138
> URL: https://issues.apache.org/jira/browse/SPARK-23138
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We haven't updated the user guide to reflect the multiclass logistic 
> regression summary added in SPARK-17139.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-29 Thread liweisheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344565#comment-16344565
 ] 

liweisheng commented on SPARK-20928:


What about introducing a new way of non-block shuffle，I mean shuffle reader and 
shuffle writer work at same time, shuffle reader fetch datas from the writer as 
long as there are datas been produced.

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Assignee: Jose Torres
>Priority: Major
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23264) Support interval values without INTERVAL clauses


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23264:


Assignee: (was: Apache Spark)

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23264) Support interval values without INTERVAL clauses


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23264:


Assignee: Apache Spark

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23264) Support interval values without INTERVAL clauses


[ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344561#comment-16344561
 ] 

Apache Spark commented on SPARK-23264:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20433

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23157.
-
Resolution: Invalid

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23264) Support interval values without INTERVAL clauses

2018-01-29 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-23264:


 Summary: Support interval values without INTERVAL clauses
 Key: SPARK-23264
 URL: https://issues.apache.org/jira/browse/SPARK-23264
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro


The master currently cannot parse a SQL query below;
{code:java}
SELECT cast('2017-08-04' as date) + 1 days;
{code}
Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23174) Fix pep8 to latest official version


[ 
https://issues.apache.org/jira/browse/SPARK-23174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344553#comment-16344553
 ] 

Apache Spark commented on SPARK-23174:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20432

> Fix pep8 to latest official version
> ---
>
> Key: SPARK-23174
> URL: https://issues.apache.org/jira/browse/SPARK-23174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.4.0
>
>
> As per discussion with [~hyukjin.kwon] , this Jira to fix python code style 
> to latest official version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23222) Flaky test: DataFrameRangeSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23222:


Assignee: Apache Spark

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23222
> URL: https://issues.apache.org/jira/browse/SPARK-23222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> I've seen this test fail a few times in unrelated PRs. e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168)
>   at 
> org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23222) Flaky test: DataFrameRangeSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23222:


Assignee: (was: Apache Spark)

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23222
> URL: https://issues.apache.org/jira/browse/SPARK-23222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> I've seen this test fail a few times in unrelated PRs. e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168)
>   at 
> org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23222) Flaky test: DataFrameRangeSuite


[ 
https://issues.apache.org/jira/browse/SPARK-23222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344548#comment-16344548
 ] 

Apache Spark commented on SPARK-23222:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20431

> Flaky test: DataFrameRangeSuite
> ---
>
> Key: SPARK-23222
> URL: https://issues.apache.org/jira/browse/SPARK-23222
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> I've seen this test fail a few times in unrelated PRs. e.g.:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86605/testReport/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86656/testReport/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4$$anonfun$apply$2.apply$mcV$sp(DataFrameRangeSuite.scala:168)
>   at 
> org.apache.spark.sql.catalyst.plans.PlanTestBase$class.withSQLConf(PlanTest.scala:176)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withSQLConf(SQLTestUtils.scala:167)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite.withSQLConf(DataFrameRangeSuite.scala:33)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:166)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2$$anonfun$apply$mcV$sp$4.apply(DataFrameRangeSuite.scala:165)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply$mcV$sp(DataFrameRangeSuite.scala:165)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
>   at 
> org.apache.spark.sql.DataFrameRangeSuite$$anonfun$2.apply(DataFrameRangeSuite.scala:154)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-29 Thread Gaurav Garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Garg updated SPARK-18016:

Comment: was deleted

(was: [~kiszk], this programs also gives the Constant pool error in my 
environment. Have you set any extra spark property?

 

I have tested the code in single node having 64g RAM and 4 cores of CPU, as 
well as in cluster mode where I have 6 nodes each having the same 
configurations. It seems not the problem of hardware but, issue somewhere else.

 

Observations when I run the same in single node:

 - Ram consumed only 23g, when it throws constant pool exception.

 - Have tried the same logic using RDDs and not DataFrame, it works fine. 

 )

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.ja

[jira] [Comment Edited] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked

2018-01-29 Thread Bang Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344484#comment-16344484
 ] 

Bang Xiao edited comment on SPARK-23252 at 1/30/18 4:30 AM:


After the executor and NodeManager is killed, failure tasks never relaunched 
because of reason is not yet known.
{code:java}
CoarseGrainedSchedulerBackend.scala:

protected def disableExecutor(executorId: String): Boolean = {
val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized {
  if (executorIsAlive(executorId)) {
executorsPendingLossReason += executorId
true
  } else {
// Returns true for explicitly killed executors, we also need to get 
pending loss reasons;
// For others return false.
executorsPendingToRemove.contains(executorId)
  }
}
if (shouldDisable) {
  logInfo(s"Disabling executor $executorId.")
  scheduler.executorLost(executorId, LossReasonPending)
}
shouldDisable
  }{code}
TaskSchedulerImpl will handle executorLost and removeExecutor
{code:java}
TaskSchedulerImpl.scala:
private def removeExecutor(executorId: String, reason: ExecutorLossReason) {
  // The tasks on the lost executor may not send any more status updates 
(because the executor
  // has been lost), so they should be cleaned up here.
  executorIdToRunningTaskIds.remove(executorId).foreach { taskIds =>
logDebug("Cleaning up TaskScheduler state for tasks " +
  s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId")
// We do not notify the TaskSetManager of the task failures because that 
will
// happen below in the rootPool.executorLost() call.
taskIds.foreach(cleanupTaskState)
  }

  val host = executorIdToHost(executorId)
  val execs = hostToExecutors.getOrElse(host, new HashSet)
  execs -= executorId
  if (execs.isEmpty) {
hostToExecutors -= host
for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) {
  hosts -= host
  if (hosts.isEmpty) {
hostsByRack -= rack
  }
}
  }

  if (reason != LossReasonPending) {
executorIdToHost -= executorId
rootPool.executorLost(executorId, host, reason)
  }
  blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId))
}
{code}
but if the reason is LossReasonPending, it will not trigger lost tasks 
relaunched.

This is consistent with what I've observed from the log.

 


was (Author: chopinxb):
After the executor and NodeManager is killed, failure tasks never relaunched 
because of reason is not yet known.
{code:java}
CoarseGrainedSchedulerBackend.scala:

protected def disableExecutor(executorId: String): Boolean = {
val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized {
  if (executorIsAlive(executorId)) {
executorsPendingLossReason += executorId
true
  } else {
// Returns true for explicitly killed executors, we also need to get 
pending loss reasons;
// For others return false.
executorsPendingToRemove.contains(executorId)
  }
}
if (shouldDisable) {
  logInfo(s"Disabling executor $executorId.")
  scheduler.executorLost(executorId, LossReasonPending)
}
shouldDisable
  }{code}
TaskSchedulerImpl will handle executorLost and removeExecutor
{code:java}
TaskSchedulerImpl.scala:
private def removeExecutor(executorId: String, reason: ExecutorLossReason) {
  // The tasks on the lost executor may not send any more status updates 
(because the executor
  // has been lost), so they should be cleaned up here.
  executorIdToRunningTaskIds.remove(executorId).foreach { taskIds =>
logDebug("Cleaning up TaskScheduler state for tasks " +
  s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId")
// We do not notify the TaskSetManager of the task failures because that 
will
// happen below in the rootPool.executorLost() call.
taskIds.foreach(cleanupTaskState)
  }

  val host = executorIdToHost(executorId)
  val execs = hostToExecutors.getOrElse(host, new HashSet)
  execs -= executorId
  if (execs.isEmpty) {
hostToExecutors -= host
for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) {
  hosts -= host
  if (hosts.isEmpty) {
hostsByRack -= rack
  }
}
  }

  if (reason != LossReasonPending) {
executorIdToHost -= executorId
rootPool.executorLost(executorId, host, reason)
  }
  blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId))
}
{code}
but if the reason is LossReasonPending, it will not trigger lost tasks 
relaunched.

This is consistent with what I've observed from the log.

 

> When NodeManager and CoarseGrainedExecutorBackend processes are killed, the 
> job will be blocked
> ---
>
> Key: SPARK-23252
> URL: https://issues.apac

[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked

2018-01-29 Thread Bang Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344484#comment-16344484
 ] 

Bang Xiao commented on SPARK-23252:
---

After the executor and NodeManager is killed, failure tasks never relaunched 
because of reason is not yet known.
{code:java}
CoarseGrainedSchedulerBackend.scala:

protected def disableExecutor(executorId: String): Boolean = {
val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized {
  if (executorIsAlive(executorId)) {
executorsPendingLossReason += executorId
true
  } else {
// Returns true for explicitly killed executors, we also need to get 
pending loss reasons;
// For others return false.
executorsPendingToRemove.contains(executorId)
  }
}
if (shouldDisable) {
  logInfo(s"Disabling executor $executorId.")
  scheduler.executorLost(executorId, LossReasonPending)
}
shouldDisable
  }{code}
TaskSchedulerImpl will handle executorLost and removeExecutor
{code:java}
TaskSchedulerImpl.scala:
private def removeExecutor(executorId: String, reason: ExecutorLossReason) {
  // The tasks on the lost executor may not send any more status updates 
(because the executor
  // has been lost), so they should be cleaned up here.
  executorIdToRunningTaskIds.remove(executorId).foreach { taskIds =>
logDebug("Cleaning up TaskScheduler state for tasks " +
  s"${taskIds.mkString("[", ",", "]")} on failed executor $executorId")
// We do not notify the TaskSetManager of the task failures because that 
will
// happen below in the rootPool.executorLost() call.
taskIds.foreach(cleanupTaskState)
  }

  val host = executorIdToHost(executorId)
  val execs = hostToExecutors.getOrElse(host, new HashSet)
  execs -= executorId
  if (execs.isEmpty) {
hostToExecutors -= host
for (rack <- getRackForHost(host); hosts <- hostsByRack.get(rack)) {
  hosts -= host
  if (hosts.isEmpty) {
hostsByRack -= rack
  }
}
  }

  if (reason != LossReasonPending) {
executorIdToHost -= executorId
rootPool.executorLost(executorId, host, reason)
  }
  blacklistTrackerOpt.foreach(_.handleRemovedExecutor(executorId))
}
{code}
but if the reason is LossReasonPending, it will not trigger lost tasks 
relaunched.

This is consistent with what I've observed from the log.

 

> When NodeManager and CoarseGrainedExecutorBackend processes are killed, the 
> job will be blocked
> ---
>
> Key: SPARK-23252
> URL: https://issues.apache.org/jira/browse/SPARK-23252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Bang Xiao
>Priority: Major
>
> This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We 
> use Yarn as our resource manager. 
> 1，spark-submit "JavaWordCount" application in yarn-client mode
> 2,   Kill NodeManager and CoarseGrainedExecutorBackend processes in one node 
> when the job is in stage 0 
> if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager 
> will pending the failure task to resubmit. but if the NodeManager and 
> CoarseGrainedExecutorBackend processes killed simultaneously，the whole job 
> will be blocked. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled


 [ 
https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23263:


Assignee: (was: Apache Spark)

> create table stored as parquet should update table size if automatic update 
> table size is enabled
> -
>
> Key: SPARK-23263
> URL: https://issues.apache.org/jira/browse/SPARK-23263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {noformat}
> bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true
> {noformat}
> {code:sql}
> spark-sql> create table test_create_parquet stored as parquet as select 1;
> spark-sql> desc extended test_create_parquet;
> {code}
> The table statistics will not exists.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled


[ 
https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344460#comment-16344460
 ] 

Apache Spark commented on SPARK-23263:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20430

> create table stored as parquet should update table size if automatic update 
> table size is enabled
> -
>
> Key: SPARK-23263
> URL: https://issues.apache.org/jira/browse/SPARK-23263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {noformat}
> bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true
> {noformat}
> {code:sql}
> spark-sql> create table test_create_parquet stored as parquet as select 1;
> spark-sql> desc extended test_create_parquet;
> {code}
> The table statistics will not exists.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled


 [ 
https://issues.apache.org/jira/browse/SPARK-23263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23263:


Assignee: Apache Spark

> create table stored as parquet should update table size if automatic update 
> table size is enabled
> -
>
> Key: SPARK-23263
> URL: https://issues.apache.org/jira/browse/SPARK-23263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce:
> {noformat}
> bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true
> {noformat}
> {code:sql}
> spark-sql> create table test_create_parquet stored as parquet as select 1;
> spark-sql> desc extended test_create_parquet;
> {code}
> The table statistics will not exists.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23263) create table stored as parquet should update table size if automatic update table size is enabled

2018-01-29 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-23263:
---

 Summary: create table stored as parquet should update table size 
if automatic update table size is enabled
 Key: SPARK-23263
 URL: https://issues.apache.org/jira/browse/SPARK-23263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


How to reproduce:

{noformat}
bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true
{noformat}

{code:sql}
spark-sql> create table test_create_parquet stored as parquet as select 1;
spark-sql> desc extended test_create_parquet;
{code}
The table statistics will not exists.


 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared


 [ 
https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MBA Learns to Code updated SPARK-23246:
---
Description: 
I am having consistent OOM crashes when trying to use PySpark for iterative 
algorithms in which I create new DataFrames per iteration (e.g. by sampling 
from a "mother" DataFrame), do something with such DataFrames, and never need 
such DataFrames ever in future iterations.

The below script simulates such OOM failures. Even when one tries explicitly 
.unpersist() the temporary DataFrames (by using the --unpersist flag below) 
and/or deleting and garbage-collecting the Python objects (by using the --py-gc 
flag below), the Java objects seem to stay on and accumulate until they exceed 
the JVM/driver memory.

The more complex the temporary DataFrames in each iteration (illustrated by the 
--n-partitions flag below), the faster OOM occurs.

The typical error messages include:
 - "java.lang.OutOfMemoryError : GC overhead limit exceeded"

 - "Java heap space"
 - "ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=6053742323219781
 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} 
to /; closing connection"

Please suggest how I may overcome this so that we can have long-running 
iterative programs using Spark that uses resources only up to a bounded, 
controllable limit.

 
{code:java}
from __future__ import print_function

import argparse
import gc
import pandas

import pyspark


arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--unpersist', action='store_true')
arg_parser.add_argument('--py-gc', action='store_true')
arg_parser.add_argument('--n-partitions', type=int, default=1000)
args = arg_parser.parse_args()


# create SparkSession (*** set spark.driver.memory to 512m in 
spark-defaults.conf ***)
spark = pyspark.sql.SparkSession.builder \
.config('spark.executor.instances', 2) \
.config('spark.executor.cores', 2) \
.config('spark.executor.memory', '512m') \
.config('spark.ui.enabled', False) \
.config('spark.ui.retainedJobs', 10) \
.config('spark.ui.retainedStages', 10) \
.config('spark.ui.retainedTasks', 10) \
.enableHiveSupport() \
.getOrCreate()


# create Parquet file for subsequent repeated loading
df = spark.createDataFrame(
pandas.DataFrame(
dict(
row=range(args.n_partitions),
x=args.n_partitions * [0]
)
)
)

parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions)

df.write.parquet(
path=parquet_path,
partitionBy='row',
mode='overwrite'
)


i = 0


# the below loop simulates an iterative algorithm that creates new DataFrames 
in each iteration (e.g. sampling from a "mother" DataFrame), do something, and 
never need those DataFrames again in future iterations
# we are having a problem cleaning up the built-up metadata
# hence the program will crash after while because of OOM
while True:
_df = spark.read.parquet(parquet_path)

if args.unpersist:
_df.unpersist()

if args.py_gc:
del _df
gc.collect()

i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i))
{code}
 

  was:
I am having consistent OOM crashes when trying to use PySpark for iterative 
algorithms in which I create new DataFrames per iteration (e.g. by sampling 
from a "mother" DataFrame), do something with such DataFrames, and never need 
such DataFrames ever in future iterations.

The below script simulates such OOM failures. Even when one tries explicitly 
.unpersist() the temporary DataFrames (by using the --unpersist flag below) 
and/or deleting and garbage-collecting the Python objects (by using the --py-gc 
flag below), the Java objects seem to stay on and accumulate until they exceed 
the JVM/driver memory.

The more complex the temporary DataFrames in each iteration (illustrated by the 
--n-partitions flag below), the faster OOM occurs.

The typical error messages include:
 - "java.lang.OutOfMemoryError : GC overhead limit exceeded"

 - "Java heap space"
 - "ERROR TransportRequestHandler: Error sending result 
RpcResponse{requestId=6053742323219781
 161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} 
to /; closing connection"

Please suggest how I may overcome this so that we can have long-running 
iterative programs using Spark that uses resources only up to a bounded, 
controllable limit.

 
{code:java}
from __future__ import print_function

import argparse
import gc
import pandas

import pyspark


arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('--unpersist', action='store_true')
arg_parser.add_argument('--py-gc', action='store_true')
arg_parser.add_argument('--n-partitions', type=int, default=1000)
args = arg_parser.parse_args()


# create SparkSession (*** set spark.driver.memory to 512m in 
spark-defaults.conf ***)
spark = pyspark.sql.Spark

[jira] [Resolved] (SPARK-23088) History server not showing incomplete/running applications

2018-01-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23088.
-
   Resolution: Fixed
 Assignee: paul mackles
Fix Version/s: 2.4.0

Issue resolved by pull request 20335
https://github.com/apache/spark/pull/20335

> History server not showing incomplete/running applications
> --
>
> Key: SPARK-23088
> URL: https://issues.apache.org/jira/browse/SPARK-23088
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.2, 2.2.1
>Reporter: paul mackles
>Assignee: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> History server not showing incomplete/running applications when 
> _spark.history.ui.maxApplications_ property is set to a value that is smaller 
> than the total number of applications.
> I believe this is because any applications where completed=false wind up at 
> the end of the list of apps returned by the /applications endpoint and when 
> _spark.history.ui.maxApplications_ is set, that list gets truncated and the 
> running apps are never returned.
> The fix I have in mind is to modify the history template to start passing the 
> _status_ parameter when calling the /applications endpoint (status=completed 
> is the default).
> I am running Spark in a Mesos environment but I don't think that is relevant 
> to this issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks


[ 
https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344434#comment-16344434
 ] 

Imran Rashid commented on SPARK-23237:
--

Can you expand a bit about what you are worried about?

Confusing UI?  Too tempting for users to refresh it constantly, not realizing 
the impact it has?

 

If its just the UI, I'd be fine with only making it an endpoint in the api.

And on the expense – well, the existing thread dump page already would have 
that problem.  Perhaps its just so inconvenient that nobody has been tempted to 
abuse it :P.  But I think its better to warn in the docs and then the user is 
allowed to shoot themselves in the foot.

> Add UI / endpoint for threaddumps for executors with active tasks
> -
>
> Key: SPARK-23237
> URL: https://issues.apache.org/jira/browse/SPARK-23237
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Frequently, when there are a handful of straggler tasks, users want to know 
> what is going on in those executors running the stragglers.  Currently, that 
> is a bit of a pain to do: you have to go to the page for your active stage, 
> find the task, figure out which executor its on, then go to the executors 
> page, and get the thread dump.  Or maybe you just go to the executors page, 
> find the executor with an active task, and then click on that, but that 
> doesn't work if you've got multiple stages running.
> Users could figure this by extracting the info from the stage rest endpoint, 
> but it's such a common thing to do that we should make it easy.
> I realize that figuring out a good way to do this is a little tricky.  We 
> don't want to make it easy to end up pulling thread dumps from 1000 executors 
> back to the driver.  So we've got to come up with a reasonable heuristic for 
> choosing which executors to poll.  And we've also got to find a suitable 
> place to put this.
> My suggestion is that the stage page always has a link to the thread dumps 
> for the *one* executor with the longest running task.  And there would be a 
> corresponding endpoint in the rest api with the same info, maybe at 
> {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23236) Make it easier to find the rest API, especially in local mode

[
https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344433#comment-16344433
]

Imran Rashid edited comment on SPARK-23236 at 1/30/18 3:09 AM:
---

{quote}
1. /api and /api/v1 to give the same results (a redirect) as
/api/v1/applications
Based on that I'm on the fence about #1, it may be ok, but I'm not sure if it's
best.
{quote}

actually, I want them to be a very simple html page, with just a single link.
I don't think they should be a redirect, as I dont' think they should appear to
actually be part of the api. The html page could even have a little disclaimer
"Are you looking for the rest api? There is no endpoint here in this version,
maybe you want ..."

{quote}
2. You want /api/v1/applications/{app-id} to include a list of rest api end
points
As for #2 I'm not sure how you're envisioning it, but it seems like a bad idea
in mu head.
{quote}

So to be clear, really I just want a little help with my super lazy, forgetful
side, which can't remember the exact endpoint off the top of my head. There
are many variations which would satisfy this. The entire UI could *just*
provide a link to {{/api/v1/applications/[app-id]}} ... but thats a little
weird since there actually isn't anything there. That's why I was suggesting
putting something there, even if its just another simple html page, "Nothing
here, maybe you want ...". We could go even simpler -- have every page in the
UI have a link to some random app-specific endpoint in the UI, eg.
{{/api/v1/applications/[app-id]/jobs}} . It would be a little weird if you're
on the stage page in the UI, and you follow a link for the REST api, and it
takes you to the jobs data ... but it would at least make it easier to get the
base URL right.

I also don't really care if {{/api/v1/applications/[app-id]}} actually lists
*every* endpoint below that. I think its totally fine if I need to remember
the choices below that -- that is actually the important part of the choice I
need to make. I want something to fill in the rest of the automatic stuff for
me. The only reason I suggested putting the list of endpoints is, I want to
put *something* there so the UI can link to it.

{quote}
3. You want UI pages to include links to the rest api calls that they get their
info from
And for #3, I would be ok with this, but again I'm not sure how its useful.
{quote}

yeah this is the least important, but perhaps you can see where I'm going from
this after #2, it just feels like a natural extension. Certainly not worth a
huge amount of effort.

was (Author: irashid):
bq. 1. /api and /api/v1 to give the same results (a redirect) as
/api/v1/applications
bq. Based on that I'm on the fence about #1, it may be ok, but I'm not sure if
it's best.

bq. 2. You want /api/v1/applications/{app-id} to include a list of rest api end
points

bq. As for #2 I'm not sure how you're envisioning it, but it seems like a bad
idea in mu head.

So to be clear, really I just want a little help with my super lazy, forgetful
side, which can't remember the exact endpoint off the top of my head. There
are many variations which would satisfy this. The entire UI could *just*
provide a link to /api/v1/applications/{app-id} ... but thats a little weird
since there actually isn't anything there. That's why I was suggesting putting
something there, even if its just another simple html page, "Nothing here,
maybe you want ...". We could go even simpler -- have every page in the UI
have a link to some random app-specific endpoint in the UI, eg.
/api/v1/applications/{app-id}/jobs . It would be a little weird if you're on
the stage page in the UI, and you follow a link for the REST api, and it takes
you to the jobs data ... but it would at least make it easier to get the base
URL right.

I also don't really care if /api/v1/applications/{app-id} actually lists every
endpoint below that. I think its totally fine if I need to remember the
choices below that -- that is actually the important part of the choice I need
to make. I want something to fill in the rest of the automatic stuff for me.

bq. 3. You want UI pages to include links to the rest api calls that they get
their info from
bq. And for #3, I would be ok with this, but again I'm not sure how its useful.

yeah this is the least important, but perhaps you can see where I'm going from
this after #2, it just feels like a natural extension. Certainly not worth a
huge amount of effort.

> Make it easier to find th

[jira] [Commented] (SPARK-23236) Make it easier to find the rest API, especially in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344433#comment-16344433
 ] 

Imran Rashid commented on SPARK-23236:
--

bq. 1. /api and /api/v1 to give the same results (a redirect) as 
/api/v1/applications
bq. Based on that I'm on the fence about #1, it may be ok, but I'm not sure if 
it's best. 

actually, I want them to be a very simple html page, with just a single link.  
I don't think they should be a redirect, as I dont' think they should appear to 
actually be part of the api.  The html page could even have a little disclaimer 
"Are you looking for the rest api?  There is no endpoint here in this version, 
maybe you want ..."

bq. 2. You want /api/v1/applications/{app-id} to include a list of rest api end 
points

bq. As for #2 I'm not sure how you're envisioning it, but it seems like a bad 
idea in mu head. 

So to be clear, really I just want a little help with my super lazy, forgetful 
side, which can't remember the exact endpoint off the top of my head.  There 
are many variations which would satisfy this.  The entire UI could *just* 
provide a link to /api/v1/applications/{app-id} ... but thats a little weird 
since there actually isn't anything there.  That's why I was suggesting putting 
something there, even if its just another simple html page, "Nothing here, 
maybe you want ...".  We could go even simpler -- have every page in the UI 
have a link to some random app-specific endpoint in the UI, eg. 
/api/v1/applications/{app-id}/jobs .  It would be a little weird if you're on 
the stage page in the UI, and you follow a link for the REST api, and it takes 
you to the jobs data ... but it would at least make it easier to get the base 
URL right.

I also don't really care if /api/v1/applications/{app-id} actually lists every 
endpoint below that.  I think its totally fine if I need to remember the 
choices below that -- that is actually the important part of the choice I need 
to make.  I want something to fill in the rest of the automatic stuff for me.

bq. 3. You want UI pages to include links to the rest api calls that they get 
their info from
bq. And for #3, I would be ok with this, but again I'm not sure how its useful.

yeah this is the least important, but perhaps you can see where I'm going from 
this after #2, it just feels like a natural extension.  Certainly not worth a 
huge amount of effort.

> Make it easier to find the rest API, especially in local mode
> -
>
> Key: SPARK-23236
> URL: https://issues.apache.org/jira/browse/SPARK-23236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Trivial
>  Labels: newbie
>
> This is really minor, but it always takes me a little bit to figure out how 
> to get from the UI to the rest api.  Its especially a pain in local-mode, 
> where you need the app-id, though in general I don't know the app-id, so have 
> to either look in logs or go to another endpoint first in the ui just to find 
> the app-id.  While it wouldn't really help anybody accessing the endpoints 
> programmatically, we could make it easier for someone doing exploration via 
> their browser.
> Some things which could be improved:
> * /api/v1 just provides a link to "/api/v1/applications"
> * /api provides a link to "/api/v1/applications"
> * /api/v1/applications/[app-id] gives a list of links for the other endpoints
> * on the UI, there is a link to at least /api/v1/applications/[app-id] -- 
> better still if each UI page links to the corresponding endpoint, eg. the all 
> jobs page would link to /api/v1/applications/[app-id]/jobs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in


[ 
https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344396#comment-16344396
 ] 

Apache Spark commented on SPARK-23262:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20427

> mix-in interface should extend the interface it aimed to mix in
> ---
>
> Key: SPARK-23262
> URL: https://issues.apache.org/jira/browse/SPARK-23262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in


 [ 
https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23262:


Assignee: Wenchen Fan  (was: Apache Spark)

> mix-in interface should extend the interface it aimed to mix in
> ---
>
> Key: SPARK-23262
> URL: https://issues.apache.org/jira/browse/SPARK-23262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in


 [ 
https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23262:


Assignee: Apache Spark  (was: Wenchen Fan)

> mix-in interface should extend the interface it aimed to mix in
> ---
>
> Key: SPARK-23262
> URL: https://issues.apache.org/jira/browse/SPARK-23262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in

Wenchen Fan created SPARK-23262:
---

 Summary: mix-in interface should extend the interface it aimed to 
mix in
 Key: SPARK-23262
 URL: https://issues.apache.org/jira/browse/SPARK-23262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2018-01-29 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344339#comment-16344339
 ] 

Alex Bozarth commented on SPARK-18085:
--

[~vanzin] since this is complete and going into 2.3 I was hoping you could help 
write a short overview of the SPIP, like what's changed and why. Given how much 
this evolved during the project I'm not sure if the original pitch is the best 
description anymore and I would like to have a nice summary to describe the 
project's impact. Thanks

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21664) Use the column name as the file name.

2018-01-29 Thread jifei_yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jifei_yang closed SPARK-21664.
--

We can use the partition to save the column names, such as:

{code:java}
case class UserInfo(name:String,favorite_number:Int,favorite_color:String) 
extends Serializable{}
def mainSaveAsParquet(args: Array[String]) {
val fileName=new Random().nextInt(43952858)
val outPath = 
s"G:/project/idea15/xlwl/bigdata002/bigdata/sparkmvn/outpath/user/spark/parquet/temp/$fileName"
val sparkConf = new SparkConf().setAppName("Spark Avro 
Test").setMaster("local[4]")

MyKryoRegistrator.register(sparkConf)

val sc = new SparkContext(sparkConf)

val sqlContext=new SQLContext(sc)

val array=new Array[UserInfo](3001)
for(i <- 0 to 3000){
  val choose=i % 2
  choose match {
case 0 =>array(i)=  UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", 
256+(i/102), "blue")
case 1 =>array(i)=  UserInfo("Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 
Edge/15.15063", 256+i, "blue")
  }
}

import sqlContext.implicits._
val records: DataFrame = sc.parallelize(array).toDF()

records.repartition(1).write.partitionBy("name","favorite_number").format("parquet").mode(SaveMode.ErrorIfExists).save(outPath)
sc.stop()
  }
{code}

This will handle the column name and favorite_number as input fields.

>  Use the column name as the file name.
> --
>
> Key: SPARK-21664
> URL: https://issues.apache.org/jira/browse/SPARK-21664
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: jifei_yang
>Priority: Major
>
> When we save the dataframe, we want to use the column name as the file name. 
> PairRDDFunctions are achievable. Can Dataframe be implemented? Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared


 [ 
https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23246.
---
Resolution: Not A Problem

Yes, did you have a look? It's dominated by things like {{class 
org.apache.spark.ui.jobs.UIData$TaskMetricsUIData}}. Turn down the values of 
the "retained*" options you see at 
https://spark.apache.org/docs/latest/configuration.html#spark-ui

> (Py)Spark OOM because of iteratively accumulated metadata that cannot be 
> cleared
> 
>
> Key: SPARK-23246
> URL: https://issues.apache.org/jira/browse/SPARK-23246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: MBA Learns to Code
>Priority: Critical
> Attachments: SparkProgramHeapDump.bin.tar.xz
>
>
> I am having consistent OOM crashes when trying to use PySpark for iterative 
> algorithms in which I create new DataFrames per iteration (e.g. by sampling 
> from a "mother" DataFrame), do something with such DataFrames, and never need 
> such DataFrames ever in future iterations.
> The below script simulates such OOM failures. Even when one tries explicitly 
> .unpersist() the temporary DataFrames (by using the --unpersist flag below) 
> and/or deleting and garbage-collecting the Python objects (by using the 
> --py-gc flag below), the Java objects seem to stay on and accumulate until 
> they exceed the JVM/driver memory.
> The more complex the temporary DataFrames in each iteration (illustrated by 
> the --n-partitions flag below), the faster OOM occurs.
> The typical error messages include:
>  - "java.lang.OutOfMemoryError : GC overhead limit exceeded"
>  - "Java heap space"
>  - "ERROR TransportRequestHandler: Error sending result 
> RpcResponse{requestId=6053742323219781
>  161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 
> cap=64]}} to /; closing connection"
> Please suggest how I may overcome this so that we can have long-running 
> iterative programs using Spark that uses resources only up to a bounded, 
> controllable limit.
>  
> {code:java}
> from __future__ import print_function
> import argparse
> import gc
> import pandas
> import pyspark
> arg_parser = argparse.ArgumentParser()
> arg_parser.add_argument('--unpersist', action='store_true')
> arg_parser.add_argument('--py-gc', action='store_true')
> arg_parser.add_argument('--n-partitions', type=int, default=1000)
> args = arg_parser.parse_args()
> # create SparkSession (*** set spark.driver.memory to 512m in 
> spark-defaults.conf ***)
> spark = pyspark.sql.SparkSession.builder \
> .config('spark.executor.instances', 2) \
> .config('spark.executor.cores', 2) \
> .config('spark.executor.memory', '512m') \
> .config('spark.ui.enabled', False) \
> .config('spark.ui.retainedJobs', 10) \
> .enableHiveSupport() \
> .getOrCreate()
> # create Parquet file for subsequent repeated loading
> df = spark.createDataFrame(
> pandas.DataFrame(
> dict(
> row=range(args.n_partitions),
> x=args.n_partitions * [0]
> )
> )
> )
> parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions)
> df.write.parquet(
> path=parquet_path,
> partitionBy='row',
> mode='overwrite'
> )
> i = 0
> # the below loop simulates an iterative algorithm that creates new DataFrames 
> in each iteration (e.g. sampling from a "mother" DataFrame), do something, 
> and never need those DataFrames again in future iterations
> # we are having a problem cleaning up the built-up metadata
> # hence the program will crash after while because of OOM
> while True:
> _df = spark.read.parquet(parquet_path)
> if args.unpersist:
> _df.unpersist()
> if args.py_gc:
> del _df
> gc.collect()
> i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23235) Add executor Threaddump to api

2018-01-29 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344168#comment-16344168
 ] 

Alex Bozarth commented on SPARK-23235:
--

Your discussion clarified my concern for me. I think I wanted to see how he was 
going to do it first, but based on your implementation comments, this looks 
like a good add.

> Add executor Threaddump to api
> --
>
> Key: SPARK-23235
> URL: https://issues.apache.org/jira/browse/SPARK-23235
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: newbie
>
> It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} 
> is only available in the UI, not in the rest api at all.  This is especially 
> a pain because that page in the UI has extra formatting which makes it a pain 
> to send the output to somebody else (most likely you click "expand all" and 
> then copy paste that, which is OK, but is formatted weirdly).  We might also 
> just want a "format=raw" option even on the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23235) Add executor Threaddump to api


[ 
https://issues.apache.org/jira/browse/SPARK-23235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344140#comment-16344140
 ] 

Imran Rashid commented on SPARK-23235:
--

[~ajbozarth] can you explain your concern?

[~attilapiros] definitely a new endpoint.  taking a thread dump is somewhat 
expensive, don't want it to be part of the regular info.

> Add executor Threaddump to api
> --
>
> Key: SPARK-23235
> URL: https://issues.apache.org/jira/browse/SPARK-23235
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: newbie
>
> It looks like the the thread dump {{/executors/threadDump/?executorId=[id]}} 
> is only available in the UI, not in the rest api at all.  This is especially 
> a pain because that page in the UI has extra formatting which makes it a pain 
> to send the output to somebody else (most likely you click "expand all" and 
> then copy paste that, which is OK, but is formatted weirdly).  We might also 
> just want a "format=raw" option even on the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-23209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23209:


Assignee: Marcelo Vanzin

> HiveDelegationTokenProvider throws an exception if Hive jars are not the 
> classpath
> --
>
> Key: SPARK-23209
> URL: https://issues.apache.org/jira/browse/SPARK-23209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: OSX, Java(TM) SE Runtime Environment (build 
> 1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
>Reporter: Sahil Takiar
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While doing some Hive-on-Spark testing against the Spark 2.3.0 release 
> candidates we came across a bug (see HIVE-18436).
> Stack-trace:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/conf/HiveConf
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54)
> at 
> org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44)
> at org.apache.spark.deploy.yarn.Client.(Client.scala:123)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.conf.HiveConf
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 10 more
> {code}
> Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed 
> {{HiveDelegationTokenProvider}} so that it constructs 
> {{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} 
> rather than trying to manually load the class via the class loader. Looks 
> like with the new code the JVM tries to load {{HiveConf}} as soon as 
> {{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch 
> around the construction of {{HiveDelegationTokenProvider}} a 
> {{ClassNotFoundException}} is thrown, which causes spark-submit to crash. 
> Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if 
> Hive is on the classpath". This behavior would seem to contradict that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-23209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23209.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20399
[https://github.com/apache/spark/pull/20399]

> HiveDelegationTokenProvider throws an exception if Hive jars are not the 
> classpath
> --
>
> Key: SPARK-23209
> URL: https://issues.apache.org/jira/browse/SPARK-23209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: OSX, Java(TM) SE Runtime Environment (build 
> 1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
>Reporter: Sahil Takiar
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While doing some Hive-on-Spark testing against the Spark 2.3.0 release 
> candidates we came across a bug (see HIVE-18436).
> Stack-trace:
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/conf/HiveConf
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54)
> at 
> org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44)
> at org.apache.spark.deploy.yarn.Client.(Client.scala:123)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.conf.HiveConf
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 10 more
> {code}
> Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed 
> {{HiveDelegationTokenProvider}} so that it constructs 
> {{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} 
> rather than trying to manually load the class via the class loader. Looks 
> like with the new code the JVM tries to load {{HiveConf}} as soon as 
> {{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch 
> around the construction of {{HiveDelegationTokenProvider}} a 
> {{ClassNotFoundException}} is thrown, which causes spark-submit to crash. 
> Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if 
> Hive is on the classpath". This behavior would seem to contradict that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23157:


Assignee: Apache Spark

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Assignee: Apache Spark
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23157:


Assignee: (was: Apache Spark)

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet


[ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344078#comment-16344078
 ] 

Apache Spark commented on SPARK-23157:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/20429

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23261) Rename Pandas UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23261:


Assignee: Apache Spark  (was: Xiao Li)

> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Rename the public APIs of pandas udfs from 
> - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
> - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
> - PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23261) Rename Pandas UDFs


[ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344076#comment-16344076
 ] 

Apache Spark commented on SPARK-23261:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20428

> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Rename the public APIs of pandas udfs from 
> - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
> - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
> - PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23261) Rename Pandas UDFs


 [ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23261:


Assignee: Xiao Li  (was: Apache Spark)

> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Rename the public APIs of pandas udfs from 
> - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
> - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
> - PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared


[ 
https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344068#comment-16344068
 ] 

MBA Learns to Code edited comment on SPARK-23246 at 1/29/18 9:45 PM:
-

[~srowen] the Java driver heap dump is attached above for your review. I ran 
the job with Spark UI disabled (spark.ui.enabled = 'false').


was (Author: mbalearnstocode):
[~srowen] the Java driver heap dump is attached above for your review.

> (Py)Spark OOM because of iteratively accumulated metadata that cannot be 
> cleared
> 
>
> Key: SPARK-23246
> URL: https://issues.apache.org/jira/browse/SPARK-23246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: MBA Learns to Code
>Priority: Critical
> Attachments: SparkProgramHeapDump.bin.tar.xz
>
>
> I am having consistent OOM crashes when trying to use PySpark for iterative 
> algorithms in which I create new DataFrames per iteration (e.g. by sampling 
> from a "mother" DataFrame), do something with such DataFrames, and never need 
> such DataFrames ever in future iterations.
> The below script simulates such OOM failures. Even when one tries explicitly 
> .unpersist() the temporary DataFrames (by using the --unpersist flag below) 
> and/or deleting and garbage-collecting the Python objects (by using the 
> --py-gc flag below), the Java objects seem to stay on and accumulate until 
> they exceed the JVM/driver memory.
> The more complex the temporary DataFrames in each iteration (illustrated by 
> the --n-partitions flag below), the faster OOM occurs.
> The typical error messages include:
>  - "java.lang.OutOfMemoryError : GC overhead limit exceeded"
>  - "Java heap space"
>  - "ERROR TransportRequestHandler: Error sending result 
> RpcResponse{requestId=6053742323219781
>  161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 
> cap=64]}} to /; closing connection"
> Please suggest how I may overcome this so that we can have long-running 
> iterative programs using Spark that uses resources only up to a bounded, 
> controllable limit.
>  
> {code:java}
> from __future__ import print_function
> import argparse
> import gc
> import pandas
> import pyspark
> arg_parser = argparse.ArgumentParser()
> arg_parser.add_argument('--unpersist', action='store_true')
> arg_parser.add_argument('--py-gc', action='store_true')
> arg_parser.add_argument('--n-partitions', type=int, default=1000)
> args = arg_parser.parse_args()
> # create SparkSession (*** set spark.driver.memory to 512m in 
> spark-defaults.conf ***)
> spark = pyspark.sql.SparkSession.builder \
> .config('spark.executor.instances', 2) \
> .config('spark.executor.cores', 2) \
> .config('spark.executor.memory', '512m') \
> .config('spark.ui.enabled', False) \
> .config('spark.ui.retainedJobs', 10) \
> .enableHiveSupport() \
> .getOrCreate()
> # create Parquet file for subsequent repeated loading
> df = spark.createDataFrame(
> pandas.DataFrame(
> dict(
> row=range(args.n_partitions),
> x=args.n_partitions * [0]
> )
> )
> )
> parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions)
> df.write.parquet(
> path=parquet_path,
> partitionBy='row',
> mode='overwrite'
> )
> i = 0
> # the below loop simulates an iterative algorithm that creates new DataFrames 
> in each iteration (e.g. sampling from a "mother" DataFrame), do something, 
> and never need those DataFrames again in future iterations
> # we are having a problem cleaning up the built-up metadata
> # hence the program will crash after while because of OOM
> while True:
> _df = spark.read.parquet(parquet_path)
> if args.unpersist:
> _df.unpersist()
> if args.py_gc:
> del _df
> gc.collect()
> i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared


[ 
https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344068#comment-16344068
 ] 

MBA Learns to Code commented on SPARK-23246:


[~srowen] the Java driver heap dump is attached above for your review.

> (Py)Spark OOM because of iteratively accumulated metadata that cannot be 
> cleared
> 
>
> Key: SPARK-23246
> URL: https://issues.apache.org/jira/browse/SPARK-23246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: MBA Learns to Code
>Priority: Critical
> Attachments: SparkProgramHeapDump.bin.tar.xz
>
>
> I am having consistent OOM crashes when trying to use PySpark for iterative 
> algorithms in which I create new DataFrames per iteration (e.g. by sampling 
> from a "mother" DataFrame), do something with such DataFrames, and never need 
> such DataFrames ever in future iterations.
> The below script simulates such OOM failures. Even when one tries explicitly 
> .unpersist() the temporary DataFrames (by using the --unpersist flag below) 
> and/or deleting and garbage-collecting the Python objects (by using the 
> --py-gc flag below), the Java objects seem to stay on and accumulate until 
> they exceed the JVM/driver memory.
> The more complex the temporary DataFrames in each iteration (illustrated by 
> the --n-partitions flag below), the faster OOM occurs.
> The typical error messages include:
>  - "java.lang.OutOfMemoryError : GC overhead limit exceeded"
>  - "Java heap space"
>  - "ERROR TransportRequestHandler: Error sending result 
> RpcResponse{requestId=6053742323219781
>  161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 
> cap=64]}} to /; closing connection"
> Please suggest how I may overcome this so that we can have long-running 
> iterative programs using Spark that uses resources only up to a bounded, 
> controllable limit.
>  
> {code:java}
> from __future__ import print_function
> import argparse
> import gc
> import pandas
> import pyspark
> arg_parser = argparse.ArgumentParser()
> arg_parser.add_argument('--unpersist', action='store_true')
> arg_parser.add_argument('--py-gc', action='store_true')
> arg_parser.add_argument('--n-partitions', type=int, default=1000)
> args = arg_parser.parse_args()
> # create SparkSession (*** set spark.driver.memory to 512m in 
> spark-defaults.conf ***)
> spark = pyspark.sql.SparkSession.builder \
> .config('spark.executor.instances', 2) \
> .config('spark.executor.cores', 2) \
> .config('spark.executor.memory', '512m') \
> .config('spark.ui.enabled', False) \
> .config('spark.ui.retainedJobs', 10) \
> .enableHiveSupport() \
> .getOrCreate()
> # create Parquet file for subsequent repeated loading
> df = spark.createDataFrame(
> pandas.DataFrame(
> dict(
> row=range(args.n_partitions),
> x=args.n_partitions * [0]
> )
> )
> )
> parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions)
> df.write.parquet(
> path=parquet_path,
> partitionBy='row',
> mode='overwrite'
> )
> i = 0
> # the below loop simulates an iterative algorithm that creates new DataFrames 
> in each iteration (e.g. sampling from a "mother" DataFrame), do something, 
> and never need those DataFrames again in future iterations
> # we are having a problem cleaning up the built-up metadata
> # hence the program will crash after while because of OOM
> while True:
> _df = spark.read.parquet(parquet_path)
> if args.unpersist:
> _df.unpersist()
> if args.py_gc:
> del _df
> gc.collect()
> i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23246) (Py)Spark OOM because of iteratively accumulated metadata that cannot be cleared


 [ 
https://issues.apache.org/jira/browse/SPARK-23246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MBA Learns to Code updated SPARK-23246:
---
Attachment: SparkProgramHeapDump.bin.tar.xz

> (Py)Spark OOM because of iteratively accumulated metadata that cannot be 
> cleared
> 
>
> Key: SPARK-23246
> URL: https://issues.apache.org/jira/browse/SPARK-23246
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: MBA Learns to Code
>Priority: Critical
> Attachments: SparkProgramHeapDump.bin.tar.xz
>
>
> I am having consistent OOM crashes when trying to use PySpark for iterative 
> algorithms in which I create new DataFrames per iteration (e.g. by sampling 
> from a "mother" DataFrame), do something with such DataFrames, and never need 
> such DataFrames ever in future iterations.
> The below script simulates such OOM failures. Even when one tries explicitly 
> .unpersist() the temporary DataFrames (by using the --unpersist flag below) 
> and/or deleting and garbage-collecting the Python objects (by using the 
> --py-gc flag below), the Java objects seem to stay on and accumulate until 
> they exceed the JVM/driver memory.
> The more complex the temporary DataFrames in each iteration (illustrated by 
> the --n-partitions flag below), the faster OOM occurs.
> The typical error messages include:
>  - "java.lang.OutOfMemoryError : GC overhead limit exceeded"
>  - "Java heap space"
>  - "ERROR TransportRequestHandler: Error sending result 
> RpcResponse{requestId=6053742323219781
>  161, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 
> cap=64]}} to /; closing connection"
> Please suggest how I may overcome this so that we can have long-running 
> iterative programs using Spark that uses resources only up to a bounded, 
> controllable limit.
>  
> {code:java}
> from __future__ import print_function
> import argparse
> import gc
> import pandas
> import pyspark
> arg_parser = argparse.ArgumentParser()
> arg_parser.add_argument('--unpersist', action='store_true')
> arg_parser.add_argument('--py-gc', action='store_true')
> arg_parser.add_argument('--n-partitions', type=int, default=1000)
> args = arg_parser.parse_args()
> # create SparkSession (*** set spark.driver.memory to 512m in 
> spark-defaults.conf ***)
> spark = pyspark.sql.SparkSession.builder \
> .config('spark.executor.instances', 2) \
> .config('spark.executor.cores', 2) \
> .config('spark.executor.memory', '512m') \
> .config('spark.ui.enabled', False) \
> .config('spark.ui.retainedJobs', 10) \
> .enableHiveSupport() \
> .getOrCreate()
> # create Parquet file for subsequent repeated loading
> df = spark.createDataFrame(
> pandas.DataFrame(
> dict(
> row=range(args.n_partitions),
> x=args.n_partitions * [0]
> )
> )
> )
> parquet_path = '/tmp/TestOOM-{}Partitions.parquet'.format(args.n_partitions)
> df.write.parquet(
> path=parquet_path,
> partitionBy='row',
> mode='overwrite'
> )
> i = 0
> # the below loop simulates an iterative algorithm that creates new DataFrames 
> in each iteration (e.g. sampling from a "mother" DataFrame), do something, 
> and never need those DataFrames again in future iterations
> # we are having a problem cleaning up the built-up metadata
> # hence the program will crash after while because of OOM
> while True:
> _df = spark.read.parquet(parquet_path)
> if args.unpersist:
> _df.unpersist()
> if args.py_gc:
> del _df
> gc.collect()
> i += 1; print('COMPLETED READ ITERATION #{}\n'.format(i))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23261) Rename Pandas UDFs

Xiao Li created SPARK-23261:
---

 Summary: Rename Pandas UDFs
 Key: SPARK-23261
 URL: https://issues.apache.org/jira/browse/SPARK-23261
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


Rename the public APIs of pandas udfs from 

- PANDAS SCALAR UDF -> SCALAR PANDAS UDF

- PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 

- PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-29 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343962#comment-16343962
 ] 

Henry Robinson commented on SPARK-23157:


[~kretes] - I can see an argument for the behaviour you're describing, but 
that's not the way the API is apparently intended. Like Sean says, there are 
way too many ways to shoot yourself in the foot if you can stitch together 
arbitrary datasets like this if the Datasets are column-wise incompatible, and 
allowing the relatively small subset of cases where it would work would lead to 
a more confusing API, IMO. 

The documentation for {{withColumn()}} could be updated to make this clearer; 
if I get a moment today I'll submit a PR. 

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23260) remove V2 from the class name of data source reader/writer


[ 
https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343878#comment-16343878
 ] 

Apache Spark commented on SPARK-23260:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20427

> remove V2 from the class name of data source reader/writer
> --
>
> Key: SPARK-23260
> URL: https://issues.apache.org/jira/browse/SPARK-23260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23260) remove V2 from the class name of data source reader/writer


 [ 
https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23260:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove V2 from the class name of data source reader/writer
> --
>
> Key: SPARK-23260
> URL: https://issues.apache.org/jira/browse/SPARK-23260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23260) remove V2 from the class name of data source reader/writer


 [ 
https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23260:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove V2 from the class name of data source reader/writer
> --
>
> Key: SPARK-23260
> URL: https://issues.apache.org/jira/browse/SPARK-23260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23260) remove V2 from the class name of data source reader/writer

Wenchen Fan created SPARK-23260:
---

 Summary: remove V2 from the class name of data source reader/writer
 Key: SPARK-23260
 URL: https://issues.apache.org/jira/browse/SPARK-23260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343840#comment-16343840
 ] 

Apache Spark commented on SPARK-23207:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/20426

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.3.0, 2.4.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog


 [ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23259:


Assignee: Apache Spark

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Assignee: Apache Spark
>Priority: Major
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog


 [ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23259:


Assignee: (was: Apache Spark)

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Priority: Major
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23259) Clean up legacy code around hive external catalog


[ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343814#comment-16343814
 ] 

Apache Spark commented on SPARK-23259:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20425

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Priority: Major
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23259) Clean up legacy code around hive external catalog

2018-01-29 Thread Feng Liu (JIRA)

Feng Liu created SPARK-23259:


 Summary: Clean up legacy code around hive external catalog
 Key: SPARK-23259
 URL: https://issues.apache.org/jira/browse/SPARK-23259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Feng Liu


Some legacy code around the hive metastore catalog need to be removed for 
further code improvement:
 # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
private method `getRawTable`. 
 # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
`addJar` method, after the jar being added to the single class loader.
 # in HiveClientImpl: There are some redundant code in both the `tableExists` 
and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout


 [ 
https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23240:


Assignee: (was: Apache Spark)

> PythonWorkerFactory issues unhelpful message when pyspark.daemon produces 
> bogus stdout
> --
>
> Key: SPARK-23240
> URL: https://issues.apache.org/jira/browse/SPARK-23240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Bruce Robbins
>Priority: Minor
>
> Environmental issues or site-local customizations (i.e., sitecustomize.py 
> present in the python install directory) can interfere with daemon.py’s 
> output to stdout. PythonWorkerFactory produces unhelpful messages when this 
> happens, causing some head scratching before the actual issue is determined.
> Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory uses the output as the daemon’s port number and ends up 
> throwing an exception when creating the socket:
> {noformat}
> java.lang.IllegalArgumentException: port out of range:1819239265
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78)
> {noformat}
> Case #2: No data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory throws an EOFException exception reading the from the 
> Process input stream.
> The second case is somewhat less mysterious than the first, because 
> PythonWorkerFactory also displays the stderr from the python process.
> When there is unexpected or missing output in pyspark.daemon’s stdout, 
> PythonWorkerFactory should say so.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout


 [ 
https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23240:


Assignee: Apache Spark

> PythonWorkerFactory issues unhelpful message when pyspark.daemon produces 
> bogus stdout
> --
>
> Key: SPARK-23240
> URL: https://issues.apache.org/jira/browse/SPARK-23240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> Environmental issues or site-local customizations (i.e., sitecustomize.py 
> present in the python install directory) can interfere with daemon.py’s 
> output to stdout. PythonWorkerFactory produces unhelpful messages when this 
> happens, causing some head scratching before the actual issue is determined.
> Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory uses the output as the daemon’s port number and ends up 
> throwing an exception when creating the socket:
> {noformat}
> java.lang.IllegalArgumentException: port out of range:1819239265
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78)
> {noformat}
> Case #2: No data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory throws an EOFException exception reading the from the 
> Process input stream.
> The second case is somewhat less mysterious than the first, because 
> PythonWorkerFactory also displays the stderr from the python process.
> When there is unexpected or missing output in pyspark.daemon’s stdout, 
> PythonWorkerFactory should say so.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout


[ 
https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343792#comment-16343792
 ] 

Apache Spark commented on SPARK-23240:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/20424

> PythonWorkerFactory issues unhelpful message when pyspark.daemon produces 
> bogus stdout
> --
>
> Key: SPARK-23240
> URL: https://issues.apache.org/jira/browse/SPARK-23240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Bruce Robbins
>Priority: Minor
>
> Environmental issues or site-local customizations (i.e., sitecustomize.py 
> present in the python install directory) can interfere with daemon.py’s 
> output to stdout. PythonWorkerFactory produces unhelpful messages when this 
> happens, causing some head scratching before the actual issue is determined.
> Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory uses the output as the daemon’s port number and ends up 
> throwing an exception when creating the socket:
> {noformat}
> java.lang.IllegalArgumentException: port out of range:1819239265
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78)
> {noformat}
> Case #2: No data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory throws an EOFException exception reading the from the 
> Process input stream.
> The second case is somewhat less mysterious than the first, because 
> PythonWorkerFactory also displays the stderr from the python process.
> When there is unexpected or missing output in pyspark.daemon’s stdout, 
> PythonWorkerFactory should say so.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22221) Add User Documentation for Working with Arrow in Spark


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343785#comment-16343785
 ] 

Apache Spark commented on SPARK-1:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/20423

> Add User Documentation for Working with Arrow in Spark
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> There needs to be user facing documentation that will show how to enable/use 
> Arrow with Spark, what the user should expect, and describe any differences 
> with similar existing functionality.
> A comment from Xiao Li on https://github.com/apache/spark/pull/18664
> Given the users/applications contain the Timestamp in their Dataset and their 
> processing algorithms also need to have the codes based on the corresponding 
> time-zone related assumptions.
> * For the new users/applications, they first enabled Arrow and later hit an 
> Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, 
> what should they do?
> * For the existing users/applications, they want to utilize Arrow for better 
> performance. Can they just turn on spark.sql.execution.arrow.enable? What 
> should they do?
> Note Hopefully, the guides/solutions are user-friendly. That means, it must 
> be very simple to understand for most users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22221) Add User Documentation for Working with Arrow in Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-1.
-
   Resolution: Fixed
 Assignee: Bryan Cutler
Fix Version/s: 2.3.0

> Add User Documentation for Working with Arrow in Spark
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> There needs to be user facing documentation that will show how to enable/use 
> Arrow with Spark, what the user should expect, and describe any differences 
> with similar existing functionality.
> A comment from Xiao Li on https://github.com/apache/spark/pull/18664
> Given the users/applications contain the Timestamp in their Dataset and their 
> processing algorithms also need to have the codes based on the corresponding 
> time-zone related assumptions.
> * For the new users/applications, they first enabled Arrow and later hit an 
> Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, 
> what should they do?
> * For the existing users/applications, they want to utilize Arrow for better 
> performance. Can they just turn on spark.sql.execution.arrow.enable? What 
> should they do?
> Note Hopefully, the guides/solutions are user-friendly. That means, it must 
> be very simple to understand for most users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23258) Should not split Arrow record batches based on row count

Bryan Cutler created SPARK-23258:


 Summary: Should not split Arrow record batches based on row count
 Key: SPARK-23258
 URL: https://issues.apache.org/jira/browse/SPARK-23258
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Currently when executing scalar {{pandas_udf}} or using {{toPandas()}} the 
Arrow record batches are split up once the record count reaches a max value, 
which is configured with "spark.sql.execution.arrow.maxRecordsPerBatch".  This 
is not ideal because the number of columns is not taken into account and if 
there are many columns, then OOMs can occur.  An alternative approach could be 
to look at the size of the Arrow buffers being used and cap it at a certain 
size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343693#comment-16343693
 ] 

Marcelo Vanzin commented on SPARK-23020:


:-/

It's getting harder and harder to reproduce these races locally... this one may 
take a while.

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/29/18 5:26 PM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
     GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
     GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
 for both of the above SPARK-23161

clustering:
     GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
     LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
     ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
     Bucketizer - mulitple input/output cols, splitsArray - SPARK-22797
     ChiSqSelector - DOC selectorType desc missing new types
     QuantileDiscretizer - multiple input output cols - SPARK-22796

fpm:
     DOC associationRules should say return "DataFrame"

image:
     missing columnSchema, get*, scala missing toNDArray - SPARK-23256

regression:
     LinearRegressionSummary - missing r2adj - SPARK-23162

stat:
     missing Summarizer class - SPARK-21741

tuning:
     missing subModels, hasSubModels - SPARK-22005

for the above DOC issues SPARK-23163


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
     GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
     GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
 for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
     GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
     LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
     ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
     Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
     ChiSqSelector - DOC selectorType desc missing new types
     QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
     DOC associationRules should say return "DataFrame"

image:
     missing columnSchema, get*, scala missing toNDArray - SPARK-23256

regression:
     LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
     missing Summarizer class - 
https://issues.apache.org/jira/browse/SPARK-21741

tuning:
     missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For addi

[jira] [Comment Edited] (SPARK-23109) ML 2.3 QA: API: Python API coverage


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332698#comment-16332698
 ] 

Bryan Cutler edited comment on SPARK-23109 at 1/29/18 5:25 PM:
---

I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
     GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
     GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
 for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
     GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
     LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
     ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
     Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
     ChiSqSelector - DOC selectorType desc missing new types
     QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
     DOC associationRules should say return "DataFrame"

image:
     missing columnSchema, get*, scala missing toNDArray - SPARK-23256

regression:
     LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
     missing Summarizer class - 
https://issues.apache.org/jira/browse/SPARK-21741

tuning:
     missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163


was (Author: bryanc):
I did the following: generated HTML doc and checked for consistency with Scala, 
 did not see any API breaking changes, checked for missing items (see list 
below), checked default param values match.  No blocking or major issues found.

Items requiring follow up, I will create (related) JIRAS to fix:

classification:
    GBTClassifier - missing featureSubsetStrategy, should be moved to 
TreeEnsembleParams
    GBTClassificationModel - missing numClasses, should inherit from 
JavaClassificationModel
for both of the above https://issues.apache.org/jira/browse/SPARK-23161

clustering:
    GuassianMixtureModel - missing guassians, need to serialize 
Array[MultivariateGaussian]?
    LDAModel - missing topicsMatrix - can send Matrix through Py4J?

evaluation:
    ClusteringEvaluator - DOC describe silhouette like scaladoc

feature:
    Bucketizer - mulitple input/output cols, splitsArray - 
https://issues.apache.org/jira/browse/SPARK-22797
    ChiSqSelector - DOC selectorType desc missing new types
    QuantileDiscretizer - multiple input output cols - 
https://issues.apache.org/jira/browse/SPARK-22796

fpm:
    DOC associationRules should say return "DataFrame"

image:
    missing columnSchema, get*, scala missing toNDArray

regression:
    LinearRegressionSummary - missing r2adj - 
https://issues.apache.org/jira/browse/SPARK-23162

stat:
    missing Summarizer class - https://issues.apache.org/jira/browse/SPARK-21741

tuning:
    missing subModels, hasSubModels - 
https://issues.apache.org/jira/browse/SPARK-22005

for the above DOC issues https://issues.apache.org/jira/browse/SPARK-23163

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as

[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343665#comment-16343665
 ] 

Bryan Cutler commented on SPARK-23109:
--

Thanks [~mlnick], yes this is done.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23109) ML 2.3 QA: API: Python API coverage


 [ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23109.
--
Resolution: Done

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17006) WithColumn Performance Degrades with Number of Invocations

2018-01-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17006.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.3.0

> WithColumn Performance Degrades with Number of Invocations
> --
>
> Key: SPARK-17006
> URL: https://issues.apache.org/jira/browse/SPARK-17006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hamel Ajay Kothari
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.3.0
>
>
> Consider the following test case. We create a dataframe with 100 withColumn 
> statements, then 100 more, then 100 more, then 100 more. Each time we do this 
> it gets slower pretty drastically. If we sub in the optimized plan, we end up 
> with drastically better performance.
> Consider the following code:
> {code}
> val raw = sc.parallelize(Range(1, 100)).toDF
> val s1 = System.nanoTime()
> var mapped = Range(1, 100).foldLeft(raw) { (df, i) =>
> df.withColumn(s"value${i}", df("value") + i)
> }
> val s2 = System.nanoTime()
> val mapped2 = Range(1, 100).foldLeft(mapped) { (df, i) =>
> df.withColumn(s"value${i}_2", df("value") + i)
> }
> val s3 = System.nanoTime()
> val mapped3 = Range(1, 100).foldLeft(mapped2) { (df, i) =>
> df.withColumn(s"value${i}_3", df("value") + i)
> }
> val s4 = System.nanoTime()
> val mapped4 = Range(1, 100).foldLeft(mapped3) { (df, i) =>
> df.withColumn(s"value${i}_4", df("value") + i)
> }
> val s5 = System.nanoTime()
> val plan = mapped3.queryExecution.optimizedPlan
> val optimizedMapped3 = new org.apache.spark.sql.DataFrame(spark, plan, 
> org.apache.spark.sql.catalyst.encoders.RowEncoder(mapped3.schema))
> val s6 = System.nanoTime()
> val mapped5 = Range(1, 100).foldLeft(optimizedMapped3) { (df, i) =>
> df.withColumn(s"value${i}_4", df("value") + i)
> }
> val s7 = System.nanoTime()
> val mapped6 = Range(1, 100).foldLeft(mapped3) { (df, i) =>
> df.withColumn(s"value${i}_4", df("value") + i)
> }
> val s8 = System.nanoTime()
> val plan = mapped3.queryExecution.analyzed
> val analyzedMapped4 = new org.apache.spark.sql.DataFrame(spark, plan, 
> org.apache.spark.sql.catalyst.encoders.RowEncoder(mapped3.schema))
> val mapped7 = Range(1, 100).foldLeft(analyzedMapped4) { (df, i) =>
> df.withColumn(s"value${i}_4", df("value") + i)
> }
> val s9 = System.nanoTime()
> val secondsToNanos = 1000*1000*1000.0
> val stage1 = (s2-s1)/secondsToNanos
> val stage2 = (s3-s2)/secondsToNanos
> val stage3 = (s4-s3)/secondsToNanos
> val stage4 = (s5-s4)/secondsToNanos
> val stage5 = (s6-s5)/secondsToNanos
> val stage6 = (s7-s6)/secondsToNanos
> val stage7 = (s8-s7)/secondsToNanos
> val stage8 = (s9-s8)/secondsToNanos
> println(s"First 100: ${stage1}")
> println(s"Second 100: ${stage2}")
> println(s"Third 100: ${stage3}")
> println(s"Fourth 100: ${stage4}")
> println(s"Fourth 100 Optimization time: ${stage5}")
> println(s"Fourth 100 Optimized ${stage6}")
> println(s"Fourth Unoptimized (to make sure no caching/etc takes place, 
> reusing analyzed etc: ${stage7}")
> println(s"Fourth selects: ${stage8}")
> {code}
> This results in the following performance:
> {code}
> First 100: 4.873489454
> Second 100: 14.982028303 seconds
> Third 100: 38.775467952 seconds
> Fourth 100: 73.429119675 seconds
> Fourth 100 Optimization time: 1.777374175 seconds
> Fourth 100 Optimized 22.514489934 seconds
> Fourth Unoptimized (to make sure no caching/etc takes place, reusing analyzed 
> etc: 69.616112734 seconds
> Fourth 100 using analyzed plan: 67.641982709 seconds
> {code}
> Now, I suspect that we can't just sub in the optimized plan for the logical 
> plan because we lose a bunch of information which may be useful for 
> optimization later. But, I suspect there's something we can do in the case of 
> Projects at least that might be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23223) Stacking dataset transforms performs poorly

2018-01-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23223.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Stacking dataset transforms performs poorly
> ---
>
> Key: SPARK-23223
> URL: https://issues.apache.org/jira/browse/SPARK-23223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.3.0
>
>
> It is a common pattern to apply multiple transforms to a {{Dataset}} (using 
> {{Dataset.withColumn}} for example. This is currently quite expensive because 
> we run {{CheckAnalysis}} on the full plan and create an encoder for each 
> intermediate {{Dataset}}.
> {{CheckAnalysis}} only needs to be run for the newly added plan components, 
> and not for the full plan. The addition of the {{AnalysisBarrier}} created 
> this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23059) Correct some improper with view related method usage


 [ 
https://issues.apache.org/jira/browse/SPARK-23059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23059.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Correct some improper with view related method usage
> 
>
> Key: SPARK-23059
> URL: https://issues.apache.org/jira/browse/SPARK-23059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1
>Reporter: xubo245
>Priority: Minor
> Fix For: 2.4.0
>
>
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23059) Correct some improper with view related method usage


 [ 
https://issues.apache.org/jira/browse/SPARK-23059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23059:
---

Assignee: xubo245

> Correct some improper with view related method usage
> 
>
> Key: SPARK-23059
> URL: https://issues.apache.org/jira/browse/SPARK-23059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Minor
> Fix For: 2.4.0
>
>
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23199) improved Removes repetition from group expressions in Aggregate


 [ 
https://issues.apache.org/jira/browse/SPARK-23199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23199.
-
   Resolution: Fixed
 Assignee: caoxuewen
Fix Version/s: 2.3.0

> improved Removes repetition from group expressions in Aggregate
> ---
>
> Key: SPARK-23199
> URL: https://issues.apache.org/jira/browse/SPARK-23199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, all Aggregate operations will go into 
> RemoveRepetitionFromGroupExpressions, but there is no group expression or 
> there is no duplicate group expression in group expression, we not need copy 
> for logic plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23219) Rename ReadTask to DataReaderFactory


 [ 
https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23219.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20397
[https://github.com/apache/spark/pull/20397]

> Rename ReadTask to DataReaderFactory
> 
>
> Key: SPARK-23219
> URL: https://issues.apache.org/jira/browse/SPARK-23219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23219) Rename ReadTask to DataReaderFactory


 [ 
https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23219:
---

Assignee: Gengliang Wang

> Rename ReadTask to DataReaderFactory
> 
>
> Key: SPARK-23219
> URL: https://issues.apache.org/jira/browse/SPARK-23219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate


 [ 
https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20129.
---
Resolution: Won't Fix
  Assignee: (was: Xiangrui Meng)

Per PR discussion, I believe this should simply be Won't Fix.

> JavaSparkContext should use SparkContext.getOrCreate
> 
>
> Key: SPARK-20129
> URL: https://issues.apache.org/jira/browse/SPARK-20129
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked


[ 
https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343358#comment-16343358
 ] 

Sean Owen commented on SPARK-23252:
---

That much looks normal if the executor is removed and the tasks relaunched. 
What happens next?

> When NodeManager and CoarseGrainedExecutorBackend processes are killed, the 
> job will be blocked
> ---
>
> Key: SPARK-23252
> URL: https://issues.apache.org/jira/browse/SPARK-23252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Bang Xiao
>Priority: Major
>
> This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We 
> use Yarn as our resource manager. 
> 1，spark-submit "JavaWordCount" application in yarn-client mode
> 2,   Kill NodeManager and CoarseGrainedExecutorBackend processes in one node 
> when the job is in stage 0 
> if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager 
> will pending the failure task to resubmit. but if the NodeManager and 
> CoarseGrainedExecutorBackend processes killed simultaneously，the whole job 
> will be blocked. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23257) Implement Kerberos Support in Kubernetes resource manager

2018-01-29 Thread Rob Keevil (JIRA)

Rob Keevil created SPARK-23257:
--

 Summary: Implement Kerberos Support in Kubernetes resource manager
 Key: SPARK-23257
 URL: https://issues.apache.org/jira/browse/SPARK-23257
 Project: Spark
  Issue Type: Wish
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Rob Keevil


On the forked k8s branch of Spark at 
[https://github.com/apache-spark-on-k8s/spark/pull/540] , Kerberos support has 
been added to the Kubernetes resource manager.  The Kubernetes code between 
these two repositories appears to have diverged, so this commit cannot be 
merged in simply.  Are there any plans to re-implement this work on the main 
Spark repository? [ifilonenko|https://github.com/ifilonenko] I could not find 
any discussion about this specific topic online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked

2018-01-29 Thread Bang Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343317#comment-16343317
 ] 

Bang Xiao commented on SPARK-23252:
---

[~srowen] it seems the job  waits for the results of those tasks that have 
failed， but it will never get those results since the failed tasks have not be 
resubmited.
When  NodeManager and CoarseGrainedExecutorBackend processes killed 
simultaneously, the following log appears on the driver end：
{code:java}
18/01/29 14:32:35 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 24.
18/01/29 14:32:35 INFO DAGScheduler: Executor lost: 24 (epoch 1)
18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 8557067791977911361 
to /10.142.103.168:14733: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 7460664886971675621 
to /10.142.103.168:14751: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 5802441956021450458 
to /10.142.103.168:14750: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 9203205043102726551 
to /10.142.103.168:14739: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/01/29 14:32:35 ERROR TransportClient: Failed to send RPC 5217847872442409416 
to /10.142.103.168:14744: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/01/29 14:32:35 INFO BlockManagerMasterEndpoint: Trying to remove executor 24 
from BlockManagerMaster.
18/01/29 14:32:35 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(24, rsync.slave06.jupiter.zw.ted, 46509, None){code}
 

 

> When NodeManager and CoarseGrainedExecutorBackend processes are killed, the 
> job will be blocked
> ---
>
> Key: SPARK-23252
> URL: https://issues.apache.org/jira/browse/SPARK-23252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Bang Xiao
>Priority: Major
>
> This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We 
> use Yarn as our resource manager. 
> 1，spark-submit "JavaWordCount" application in yarn-client mode
> 2,   Kill NodeManager and CoarseGrainedExecutorBackend processes in one node 
> when the job is in stage 0 
> if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager 
> will pending the failure task to resubmit. but if the NodeManager and 
> CoarseGrainedExecutorBackend processes killed simultaneously，the whole job 
> will be blocked. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23252) When NodeManager and CoarseGrainedExecutorBackend processes are killed, the job will be blocked


[ 
https://issues.apache.org/jira/browse/SPARK-23252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343288#comment-16343288
 ] 

Sean Owen commented on SPARK-23252:
---

Blocked how? Waiting for the NodeManager? YARN would know the NM is down 
shortly.. 

> When NodeManager and CoarseGrainedExecutorBackend processes are killed, the 
> job will be blocked
> ---
>
> Key: SPARK-23252
> URL: https://issues.apache.org/jira/browse/SPARK-23252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Bang Xiao
>Priority: Major
>
> This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We 
> use Yarn as our resource manager. 
> 1，spark-submit "JavaWordCount" application in yarn-client mode
> 2,   Kill NodeManager and CoarseGrainedExecutorBackend processes in one node 
> when the job is in stage 0 
> if we just kill all CoarseGrainedExecutorBackend in that node, TaskSetManager 
> will pending the failure task to resubmit. but if the NodeManager and 
> CoarseGrainedExecutorBackend processes killed simultaneously，the whole job 
> will be blocked. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit


 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-23108:
--

Assignee: Nick Pentreath

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit


[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343278#comment-16343278
 ] 

Nick Pentreath edited comment on SPARK-23108 at 1/29/18 12:14 PM:
--

Went through {{Experimental}} APIs, there could be a case for:
 * {{Regression / Binary / Multiclass}} evaluators as they've been around for a 
long time.
 * Linear regression summary (since {{1.5.0}}).
 * {{AFTSurvivalRegression}} (since {{1.6.0}}).

I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 


was (Author: mlnick):
I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit


 [ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-23108.

   Resolution: Resolved
Fix Version/s: 2.3.0

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.3.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit


[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343290#comment-16343290
 ] 

Nick Pentreath commented on SPARK-23108:


Also checked ml {{DeveloperAPI}}, nothing to graduate there I would say.

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled

2018-01-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23238:
-
Fix Version/s: 2.3.0

> Externalize SQLConf spark.sql.execution.arrow.enabled 
> --
>
> Key: SPARK-23238
> URL: https://issues.apache.org/jira/browse/SPARK-23238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled

2018-01-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23238.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/20403

> Externalize SQLConf spark.sql.execution.arrow.enabled 
> --
>
> Key: SPARK-23238
> URL: https://issues.apache.org/jira/browse/SPARK-23238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23238) Externalize SQLConf spark.sql.execution.arrow.enabled

2018-01-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23238:


Assignee: Hyukjin Kwon

> Externalize SQLConf spark.sql.execution.arrow.enabled 
> --
>
> Key: SPARK-23238
> URL: https://issues.apache.org/jira/browse/SPARK-23238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit


[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343278#comment-16343278
 ] 

Nick Pentreath commented on SPARK-23108:


I think at this late stage we should not open up anything, unless anyone feels 
very strongly? 

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet


[ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343279#comment-16343279
 ] 

Sean Owen commented on SPARK-23157:
---

Agree this should not work . You are selecting a column from a different 
Dataset. Happening to work because a number of cols matches or the function is 
the identity sounds like as much way to write bugs as convenience

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343276#comment-16343276
 ] 

Nick Pentreath commented on SPARK-23109:


Created SPARK-23256 to track {{columnSchema}} in Python API.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23256) Add columnSchema method to PySpark image reader

Nick Pentreath created SPARK-23256:
--

 Summary: Add columnSchema method to PySpark image reader
 Key: SPARK-23256
 URL: https://issues.apache.org/jira/browse/SPARK-23256
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added support for reading image data into a DataFrame. The PySpark 
API is missing the {{columnSchema}} method in Scala API. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage


[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343269#comment-16343269
 ] 

Nick Pentreath commented on SPARK-23109:


So [~bryanc] I think this is done then? Can you confirm?

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343266#comment-16343266
 ] 

Nick Pentreath commented on SPARK-21866:


Ok, added SPARK-23255 to track user guide additions

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>  * BigDL
>  * DeepLearning4J
>  * Deep Learning Pipelines
>  * MMLSpark
>  * TensorFlow (Spark connector)
>  * TensorFlowOnSpark
>  * TensorFrames
>  * Thunder
> h2. Goals:
>  * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
>  * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
>  * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>  * the total size of an image should be restricted to less than 2GB (roughly)
>  * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
>  * specialized formats used in meteorology, the medical field, etc. are not 
> supported
>  * this format is specialized to images and does not attempt to solve the 
> more general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
>  {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
>  * StructField("mode", StringType(), False),
>  ** The exact representation of the data.
>  ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
>  ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and B

[jira] [Created] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

Nick Pentreath created SPARK-23255:
--

 Summary: Add user guide and examples for DataFrame image reading 
functions
 Key: SPARK-23255
 URL: https://issues.apache.org/jira/browse/SPARK-23255
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Nick Pentreath


SPARK-21866 added built-in support for reading image data into a DataFrame. 
This new functionality should be documented in the user guide, with example 
usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs


 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23107:
---
Description: 
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue 
(SPARK-23111 for {{2.3}})

  was:
Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
* Protected/public classes or methods.  If access can be more private, then it 
should be.
* Also look for non-sealed traits.
* Documentation: Missing?  Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue.


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23227) Add user guide entry for collecting sub models for cross-validation classes


 [ 
https://issues.apache.org/jira/browse/SPARK-23227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23227:
---
Priority: Minor  (was: Major)

> Add user guide entry for collecting sub models for cross-validation classes
> ---
>
> Key: SPARK-23227
> URL: https://issues.apache.org/jira/browse/SPARK-23227
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter